AI Feature Eval Writer — Golden Datasets, Rubrics, and LLM-as-Judge Prompts That Actually Catch Regressions

Design and write the eval suite for your LLM-powered feature — the metrics that match your failure modes, a golden dataset plan with starter cases, anchored rubrics, LLM-as-judge prompts with the known bias mitigations, and pass/fail gates wired for CI.

Updated Jul 2026

Build a golden dataset for regression testing prompt changes.
Create anchored rubrics for subjective LLM quality grading.
Define programmatic checks for AI-generated JSON and structured data.

including Claude Code

Cursor

Codex CLI

Gemini CLI

Security scannedInstant install

$14

· or 70 credits

30-day refund guarantee

Secure checkout via Stripe

Included in download

Build a golden dataset for regression testing prompt changes.
Create anchored rubrics for subjective LLM quality grading.
file_read, file_write automation included
Ready for including Claude Code

PubsProToolkit

PubsProToolkit builds rigor-first skills for AI agents — they write your docs and content properly, then adversarially r…

See it in action

You say

I'm building a RAG bot for our internal HR docs. I need to make sure it doesn't hallucinate benefits info and always cites the right PDF. Give me an eval suite design.

Your agent does

Here is the eval suite design for your HR RAG bot, written to three files (eval-design-worksheet.md, llm-judge-prompt-template.md, sample-eval-suite.md):

Metrics (each mapped to a failure mode)

Faithfulness (blocking): LLM-as-judge rubric that flags any benefits claim not grounded in the retrieved source.
Citation accuracy (blocking): programmatic check that every cited PDF exists and matches the answer's source.
Correct refusal (warning): checks the bot declines when the docs don't cover the question.

Golden dataset: 10 starter cases spanning common questions, edge cases, an adversarial "invent a policy" prompt, and a correct-refusal case.

Rubrics: anchored 1-to-5 scales with observable descriptions per level.

Judge prompt: mitigates position, length, self-preference, and halo bias, with a human calibration step before you rely on it.

CI gate: fail the build if faithfulness drops below your threshold on any prompt or model change.

Note: this designs and writes the artifacts — you plug them into your own test framework or CI to run them. It does not execute evals or call models.

AI Feature Eval Writer — Golden Datasets, Rubrics, and LLM-as-Judge Prompts That Actually Catch Regressions

Name: AI Feature Eval Writer — Golden Datasets, Rubrics, and LLM-as-Judge Prompts That Actually Catch Regressions
Price: 14 USD
Availability: InStock
Author: Agensi

by PubsProToolkit

Updated Jul 2026

Security scanned

including Claude Code

$14

· or 70 credits

30-day refund guarantee

Secure checkout via Stripe

⚡ Also available via Agensi MCP - your AI agent can load this skill on demand via MCP. Learn more →

Included in download

Build a golden dataset for regression testing prompt changes.
Create anchored rubrics for subjective LLM quality grading.
file_read, file_write automation included
Ready for including Claude Code
Instant install

See it in action

You say

I'm building a RAG bot for our internal HR docs. I need to make sure it doesn't hallucinate benefits info and always cites the right PDF. Give me an eval suite design.

Your agent does

Here is the eval suite design for your HR RAG bot, written to three files (eval-design-worksheet.md, llm-judge-prompt-template.md, sample-eval-suite.md):

Metrics (each mapped to a failure mode)

Faithfulness (blocking): LLM-as-judge rubric that flags any benefits claim not grounded in the retrieved source.
Citation accuracy (blocking): programmatic check that every cited PDF exists and matches the answer's source.
Correct refusal (warning): checks the bot declines when the docs don't cover the question.

Golden dataset: 10 starter cases spanning common questions, edge cases, an adversarial "invent a policy" prompt, and a correct-refusal case.

Rubrics: anchored 1-to-5 scales with observable descriptions per level.

Judge prompt: mitigates position, length, self-preference, and halo bias, with a human calibration step before you rely on it.

CI gate: fail the build if faithfulness drops below your threshold on any prompt or model change.

Note: this designs and writes the artifacts — you plug them into your own test framework or CI to run them. It does not execute evals or call models.

Security scanned

About This Skill

Teams ship LLM features with unit tests for the plumbing and vibes for the model — then a prompt tweak or model upgrade quietly breaks quality and nobody notices until users do. Evals are the missing test suite, and writing them is a craft: bad rubrics measure fluency instead of correctness, and naive judge prompts have known biases. AI Feature Eval Writer does the design and writes the artifacts. Describe your feature, what good looks like, and the failure modes you fear, and it produces the eval plan — each failure mode becomes its own metric with the cheapest grader that works (programmatic checks first, LLM-as-judge only where genuinely needed) and a blocking or warning threshold; a golden dataset design with 8 to 12 concrete starter cases including adversarial and correct-refusal cases; anchored 1-to-5 rubrics with observable level descriptions; ready-to-use judge prompts that mitigate position, length, self-preference, and halo biases, with a human calibration step before you trust them; and the CI regression gate that runs on every prompt or model change. The download includes three reference files: the eval-design worksheet, the LLM-as-judge prompt template with bias guards, and a complete worked sample suite. It designs and writes the artifacts; it does not execute evals or call models. Works with Claude Code, Cursor, Codex CLI, Gemini CLI, and any SKILL.md agent.

Use Cases

Build a golden dataset for regression testing prompt changes.
Create anchored rubrics for subjective LLM quality grading.
Define programmatic checks for AI-generated JSON and structured data.
Establish CI/CD gates to prevent model quality regressions.

Known Limitations

This skill designs and writes the eval artifacts; it does not execute evals, run test suites, call or benchmark models, or connect to any service. You run the resulting evals in your own test framework or CI provider. It generates a starter golden dataset (about 8 to 12 cases) that you should expand with your real production examples, and its LLM-as-judge prompts require a one-time human calibration pass before you trust the scores. Output is Markdown text and prompt templates, not runnable code or a hosted dashboard, and it does not include automatic, ongoing, or lifetime updates. The download contains three reference files: eval-design-worksheet.md, llm-judge-prompt-template.md, and sample-eval-suite.md.

How to Install

mkdir -p ~/.claude/skills && curl -sL https://www.agensi.io/api/install/ai-feature-eval-writer-golden-datasets-rubrics-and-llm-as-judge-prompts-that-actually-catch-regressions -o /tmp/ai-feature-eval-writer-golden-datasets-rubrics-and-llm-as-judge-prompts-that-actually-catch-regressions.zip && unzip -o /tmp/ai-feature-eval-writer-golden-datasets-rubrics-and-llm-as-judge-prompts-that-actually-catch-regressions.zip -d ~/.claude/skills && rm /tmp/ai-feature-eval-writer-golden-datasets-rubrics-and-llm-as-judge-prompts-that-actually-catch-regressions.zip

Free skills install directly. Paid skills require purchase - use the download button above after buying.

Reviews

No reviews yet - be the first to share your experience.

Only users who have downloaded or purchased this skill can leave a review.

Early access skill

Security scanned

Built by PubsProToolkit

Works with any agent that follows the SKILL.md standard, …

Be the first to review this skill.

Only users who have downloaded or purchased this skill can leave a review.

Security Scanned

Passed automated security review

Permissions

Read Files

Write Files

File Scopes

references/**

The skill only reads the context you provide and writes Markdown artifacts (the eval plan, rubrics, judge-prompt templates, and worked sample suite) to your project. It needs Read Files to review any feature notes or spec you point it at and Write Files to save the generated artifacts. It does not use a terminal, make network requests, run evals, call or benchmark models, or read environment variables.

Creator

PubsProToolkit

PubsProToolkit builds rigor-first skills for AI agents — they write your docs and content properly, then adversarially review them to catch what's wrong before it ships. The result: cleaner output and a hard quality gate in one toolkit. Built by a CMPP-certified, PhD medical writer who brings regulated-industry standards to developer docs, content, compliance, and research integrity.

Frequently Asked Questions

Learn More About AI Agent Skills

More Premium Skills

skill-router-2

Automatically detect, load, and stack the perfect skills combo for any user request.

$54 installs

inline-comment

Best way to steer your agents, effortlessly.

$9.994 installs

designing-hybrid-context-layers

Architects the right retrieval strategy for every query — teaching your agent when to use RAG, a knowledge graph, or a temporal index instead of defaulting to vector search for everything.

$1016 installs

Cinematic Landing Page Builder

Turn any business URL into a high-end animated landing page with 4K AI assets and GSAP animations via Cloudflare.

$1912 installs