1

    AI Feature Eval Writer — Golden Datasets, Rubrics, and LLM-as-Judge Prompts That Actually Catch Regressions

    by PubsProToolkit

    Design and write the eval suite for your LLM-powered feature — the metrics that match your failure modes, a golden dataset plan with starter cases, anchored rubrics, LLM-as-judge prompts with the known bias mitigations, and pass/fail gates wired for CI.

    Updated Jul 2026
    Security scanned
    including Claude Code

    $14

    · or 70 credits

    30-day refund guarantee

    Secure checkout via Stripe

    Included in download

    • Build a golden dataset for regression testing prompt changes.
    • Create anchored rubrics for subjective LLM quality grading.
    • file_read, file_write automation included
    • Ready for including Claude Code
    • Instant install

    Media gallery

    See it in action

    You say

    I'm building a RAG bot for our internal HR docs. I need to make sure it doesn't hallucinate benefits info and always cites the right PDF. Give me an eval suite design.

    Your agent does

    Here is the eval suite design for your HR RAG bot, written to three files (eval-design-worksheet.md, llm-judge-prompt-template.md, sample-eval-suite.md):

    Metrics (each mapped to a failure mode)

    • Faithfulness (blocking): LLM-as-judge rubric that flags any benefits claim not grounded in the retrieved source.
    • Citation accuracy (blocking): programmatic check that every cited PDF exists and matches the answer's source.
    • Correct refusal (warning): checks the bot declines when the docs don't cover the question.

    Golden dataset: 10 starter cases spanning common questions, edge cases, an adversarial "invent a policy" prompt, and a correct-refusal case.

    Rubrics: anchored 1-to-5 scales with observable descriptions per level.

    Judge prompt: mitigates position, length, self-preference, and halo bias, with a human calibration step before you rely on it.

    CI gate: fail the build if faithfulness drops below your threshold on any prompt or model change.

    Note: this designs and writes the artifacts — you plug them into your own test framework or CI to run them. It does not execute evals or call models.

    About This Skill

    Teams ship LLM features with unit tests for the plumbing and vibes for the model — then a prompt tweak or model upgrade quietly breaks quality and nobody notices until users do. Evals are the missing test suite, and writing them is a craft: bad rubrics measure fluency instead of correctness, and naive judge prompts have known biases. AI Feature Eval Writer does the design and writes the artifacts. Describe your feature, what good looks like, and the failure modes you fear, and it produces the eval plan — each failure mode becomes its own metric with the cheapest grader that works (programmatic checks first, LLM-as-judge only where genuinely needed) and a blocking or warning threshold; a golden dataset design with 8 to 12 concrete starter cases including adversarial and correct-refusal cases; anchored 1-to-5 rubrics with observable level descriptions; ready-to-use judge prompts that mitigate position, length, self-preference, and halo biases, with a human calibration step before you trust them; and the CI regression gate that runs on every prompt or model change. The download includes three reference files: the eval-design worksheet, the LLM-as-judge prompt template with bias guards, and a complete worked sample suite. It designs and writes the artifacts; it does not execute evals or call models. Works with Claude Code, Cursor, Codex CLI, Gemini CLI, and any SKILL.md agent.

    Use Cases

    • Build a golden dataset for regression testing prompt changes.
    • Create anchored rubrics for subjective LLM quality grading.
    • Define programmatic checks for AI-generated JSON and structured data.
    • Establish CI/CD gates to prevent model quality regressions.

    Reviews

    No reviews yet - be the first to share your experience.

    Only users who have downloaded or purchased this skill can leave a review.

    Security Scanned

    Passed automated security review

    Permissions

    Read Files
    Write Files

    File Scopes

    references/**

    The skill only reads the context you provide and writes Markdown artifacts (the eval plan, rubrics, judge-prompt templates, and worked sample suite) to your project. It needs Read Files to review any feature notes or spec you point it at and Write Files to save the generated artifacts. It does not use a terminal, make network requests, run evals, call or benchmark models, or read environment variables.

    Works with any agent that follows the SKILL.md standard, including Claude Code, Cursor, Codex CLI, Gemini CLI, and VS Code Copilot. Requires only file read/write access — no terminal, network, or environment variables. The skill designs and writes eval artifacts (Markdown); running the resulting evals requires your own test framework or CI provider.

    Creator

    PubsProToolkit builds rigor-first skills for AI agents — they write your docs and content properly, then adversarially review them to catch what's wrong before it ships. The result: cleaner output and a hard quality gate in one toolkit. Built by a CMPP-certified, PhD medical writer who brings regulated-industry standards to developer docs, content, compliance, and research integrity.

    Frequently Asked Questions

    More Premium Skills

    $14