BUNDLE Security scanned3 skills

    AI Feature Reliability Suite — Prove It Works, Model the Cost, Harden the Security

    Shipping an LLM feature comes down to three questions, and this suite answers all three. Is it good? Design the eval suite — metrics matched to your failure modes, a golden dataset, anchored rubrics, and LLM-as-judge prompts with the bias mitigations built in. What will it cost? Model the token spend per request and at scale, rank the cost drivers, and get an optimization plan with estimated savings and unit economics against your pricing. Is it safe? Red-team your own agent for prompt-injection and tool-misuse holes, with a defensive test plan and a prioritized mitigations list. Quality, cost, and security — the three things that sink an AI launch — proven before you ship. For developers building LLM features and agents on Claude Code, Cursor, Codex CLI, and any SKILL.md agent. Each skill includes reference templates, guides, and worked examples.

    Works with every agent that reads SKILL.md — Claude Code, Cursor, Codex CLI, Gemini CLI, GitHub Copilot, Windsurf, OpenClaw, and more.

    PubsProToolkit
    Created by
    PubsProToolkit
    $26$42
    Save 38% · $16

    One payment, lifetime access. 3 skills unlock instantly in your library.

    30-day refund guarantee

    Instant unlock in your library

    Free updates from the creator

    What's included

    3 skills

    See it in action

    You say

    I'm building a RAG bot for our internal HR docs. I need to make sure it doesn't hallucinate benefits info and always cites the right PDF. Give me an eval suite design.

    Your agent does

    Here is the eval suite design for your HR RAG bot, written to three files (eval-design-worksheet.md, llm-judge-prompt-template.md, sample-eval-suite.md):

    Metrics (each mapped to a failure mode)

    • Faithfulness (blocking): LLM-as-judge rubric that flags any benefits claim not grounded in the retrieved source.
    • Citation accuracy (blocking): programmatic check that every cited PDF exists and matches the answer's source.
    • Correct refusal (warning): checks the bot declines when the docs don't cover the question.

    Golden dataset: 10 starter cases spanning common questions, edge cases, an adversarial "invent a policy" prompt, and a correct-refusal case.

    Rubrics: anchored 1-to-5 scales with observable descriptions per level.

    Judge prompt: mitigates position, length, self-preference, and halo bias, with a human calibration step before you rely on it.

    CI gate: fail the build if faithfulness drops below your threshold on any prompt or model change.

    Note: this designs and writes the artifacts — you plug them into your own test framework or CI to run them. It does not execute evals or call models.

    How to install

    Drop the file into your AI tool. Works with Claude, Cursor, ChatGPT, and 20+ more.

    Reviews

    No reviews yet on the included skills. Be the first to try this bundle.

    Frequently asked questions

    More bundles from PubsProToolkit