AI Feature Eval Writer — Golden Datasets, Rubrics, and LLM-as-Judge Prompts That Actually Catch Regressions
Design and write the eval suite for your LLM-powered feature — the metrics that match your failure modes, a golden dataset plan with starter cases, anchored rubrics, LLM-as-judge prompts with the known bias mitigations, and pass/fail gates wired for CI.
- Build a golden dataset for regression testing prompt changes.
- Create anchored rubrics for subjective LLM quality grading.
- Define programmatic checks for AI-generated JSON and structured data.
$14
· or 70 creditsSecure checkout via Stripe
Included in download
- Build a golden dataset for regression testing prompt changes.
- Create anchored rubrics for subjective LLM quality grading.
- file_read, file_write automation included
- Ready for including Claude Code
PubsProToolkit builds rigor-first skills for AI agents — they write your docs and content properly, then adversarially r…
Media gallery
See it in action
You say
I'm building a RAG bot for our internal HR docs. I need to make sure it doesn't hallucinate benefits info and always cites the right PDF. Give me an eval suite design.
Your agent does
Here is the eval suite design for your HR RAG bot, written to three files (eval-design-worksheet.md, llm-judge-prompt-template.md, sample-eval-suite.md):
Metrics (each mapped to a failure mode)
- Faithfulness (blocking): LLM-as-judge rubric that flags any benefits claim not grounded in the retrieved source.
- Citation accuracy (blocking): programmatic check that every cited PDF exists and matches the answer's source.
- Correct refusal (warning): checks the bot declines when the docs don't cover the question.
Golden dataset: 10 starter cases spanning common questions, edge cases, an adversarial "invent a policy" prompt, and a correct-refusal case.
Rubrics: anchored 1-to-5 scales with observable descriptions per level.
Judge prompt: mitigates position, length, self-preference, and halo bias, with a human calibration step before you rely on it.
CI gate: fail the build if faithfulness drops below your threshold on any prompt or model change.
Note: this designs and writes the artifacts — you plug them into your own test framework or CI to run them. It does not execute evals or call models.
AI Feature Eval Writer — Golden Datasets, Rubrics, and LLM-as-Judge Prompts That Actually Catch Regressions
Design and write the eval suite for your LLM-powered feature — the metrics that match your failure modes, a golden dataset plan with starter cases, anchored rubrics, LLM-as-judge prompts with the known bias mitigations, and pass/fail gates wired for CI.
$14
· or 70 creditsSecure checkout via Stripe
Included in download
- Build a golden dataset for regression testing prompt changes.
- Create anchored rubrics for subjective LLM quality grading.
- file_read, file_write automation included
- Ready for including Claude Code
- Instant install
Media gallery
See it in action
You say
I'm building a RAG bot for our internal HR docs. I need to make sure it doesn't hallucinate benefits info and always cites the right PDF. Give me an eval suite design.
Your agent does
Here is the eval suite design for your HR RAG bot, written to three files (eval-design-worksheet.md, llm-judge-prompt-template.md, sample-eval-suite.md):
Metrics (each mapped to a failure mode)
- Faithfulness (blocking): LLM-as-judge rubric that flags any benefits claim not grounded in the retrieved source.
- Citation accuracy (blocking): programmatic check that every cited PDF exists and matches the answer's source.
- Correct refusal (warning): checks the bot declines when the docs don't cover the question.
Golden dataset: 10 starter cases spanning common questions, edge cases, an adversarial "invent a policy" prompt, and a correct-refusal case.
Rubrics: anchored 1-to-5 scales with observable descriptions per level.
Judge prompt: mitigates position, length, self-preference, and halo bias, with a human calibration step before you rely on it.
CI gate: fail the build if faithfulness drops below your threshold on any prompt or model change.
Note: this designs and writes the artifacts — you plug them into your own test framework or CI to run them. It does not execute evals or call models.
About This Skill
Teams ship LLM features with unit tests for the plumbing and vibes for the model — then a prompt tweak or model upgrade quietly breaks quality and nobody notices until users do. Evals are the missing test suite, and writing them is a craft: bad rubrics measure fluency instead of correctness, and naive judge prompts have known biases. AI Feature Eval Writer does the design and writes the artifacts. Describe your feature, what good looks like, and the failure modes you fear, and it produces the eval plan — each failure mode becomes its own metric with the cheapest grader that works (programmatic checks first, LLM-as-judge only where genuinely needed) and a blocking or warning threshold; a golden dataset design with 8 to 12 concrete starter cases including adversarial and correct-refusal cases; anchored 1-to-5 rubrics with observable level descriptions; ready-to-use judge prompts that mitigate position, length, self-preference, and halo biases, with a human calibration step before you trust them; and the CI regression gate that runs on every prompt or model change. The download includes three reference files: the eval-design worksheet, the LLM-as-judge prompt template with bias guards, and a complete worked sample suite. It designs and writes the artifacts; it does not execute evals or call models. Works with Claude Code, Cursor, Codex CLI, Gemini CLI, and any SKILL.md agent.
Use Cases
- Build a golden dataset for regression testing prompt changes.
- Create anchored rubrics for subjective LLM quality grading.
- Define programmatic checks for AI-generated JSON and structured data.
- Establish CI/CD gates to prevent model quality regressions.
Known Limitations
This skill designs and writes the eval artifacts; it does not execute evals, run test suites, call or benchmark models, or connect to any service. You run the resulting evals in your own test framework or CI provider. It generates a starter golden dataset (about 8 to 12 cases) that you should expand with your real production examples, and its LLM-as-judge prompts require a one-time human calibration pass before you trust the scores. Output is Markdown text and prompt templates, not runnable code or a hosted dashboard, and it does not include automatic, ongoing, or lifetime updates. The download contains three reference files: eval-design-worksheet.md, llm-judge-prompt-template.md, and sample-eval-suite.md.
How to Install
mkdir -p ~/.claude/skills && curl -sL https://www.agensi.io/api/install/ai-feature-eval-writer-golden-datasets-rubrics-and-llm-as-judge-prompts-that-actually-catch-regressions -o /tmp/ai-feature-eval-writer-golden-datasets-rubrics-and-llm-as-judge-prompts-that-actually-catch-regressions.zip && unzip -o /tmp/ai-feature-eval-writer-golden-datasets-rubrics-and-llm-as-judge-prompts-that-actually-catch-regressions.zip -d ~/.claude/skills && rm /tmp/ai-feature-eval-writer-golden-datasets-rubrics-and-llm-as-judge-prompts-that-actually-catch-regressions.zipFree skills install directly. Paid skills require purchase - use the download button above after buying.
Reviews
No reviews yet - be the first to share your experience.
Only users who have downloaded or purchased this skill can leave a review.
Early access skill
Be the first to review this skill.
Only users who have downloaded or purchased this skill can leave a review.
Security Scanned
Passed automated security review
Permissions
File Scopes
The skill only reads the context you provide and writes Markdown artifacts (the eval plan, rubrics, judge-prompt templates, and worked sample suite) to your project. It needs Read Files to review any feature notes or spec you point it at and Write Files to save the generated artifacts. It does not use a terminal, make network requests, run evals, call or benchmark models, or read environment variables.
Tags
Works with any agent that follows the SKILL.md standard, including Claude Code, Cursor, Codex CLI, Gemini CLI, and VS Code Copilot. Requires only file read/write access — no terminal, network, or environment variables. The skill designs and writes eval artifacts (Markdown); running the resulting evals requires your own test framework or CI provider.
Creator
PubsProToolkit builds rigor-first skills for AI agents — they write your docs and content properly, then adversarially review them to catch what's wrong before it ships. The result: cleaner output and a hard quality gate in one toolkit. Built by a CMPP-certified, PhD medical writer who brings regulated-industry standards to developer docs, content, compliance, and research integrity.
Frequently Asked Questions
Learn More About AI Agent Skills
More Premium Skills
skill-router-2
Automatically detect, load, and stack the perfect skills combo for any user request.

inline-comment
Best way to steer your agents, effortlessly.
designing-hybrid-context-layers
Architects the right retrieval strategy for every query — teaching your agent when to use RAG, a knowledge graph, or a temporal index instead of defaulting to vector search for everything.

Cinematic Landing Page Builder
Turn any business URL into a high-end animated landing page with 4K AI assets and GSAP animations via Cloudflare.