Do I need Claude Code to run evals?

No. The skill-creator works in Claude.ai, Claude Code, or Cowork. Anywhere you can invoke a Claude skill, you can eval one. The interface is conversational in all three.

How many test prompts should I write?

Start with 5-10 covering your main use cases plus 2-3 negative prompts (requests that shouldn't trigger the skill). Add more over time as you discover edge cases.

Can I automate evals in CI?

Not yet through the skill-creator directly, but you can store your eval prompts as a structured file in the skill folder and run them manually after each model update or skill edit. Anthropic has indicated more automation tooling is coming.

What pass rate should I target?

90%+ for capability skills used in production. 80%+ is acceptable for preference skills where the model might reasonably interpret instructions multiple ways. Below 70% means the skill needs revision.

Do evals work for skills that use other agents besides Claude Code?

The skill-creator is built for Claude. For cross-agent skills, run the eval in Claude first to validate the SKILL.md is solid, then test manually in Cursor, OpenClaw, or other agents. Most issues you find in Claude will surface in other agents too.

How is this different from unit testing code?

Unit tests check deterministic outputs. Skill evals check probabilistic ones. The same prompt can produce slightly different outputs across runs, which is why benchmarks measure mean pass rate and standard deviation rather than expecting exact matches.

Where can I find skills that have been tested?

Many creators include eval results in their skill descriptions on [Agensi's SKILL.md skills marketplace](https://agensi.io/skills). Look for skills that show pass rate, test coverage, and token usage. These are the ones built with rigor.

How to Eval and Benchmark SKILL.md Skills (2026 Guide)

How to Eval and Benchmark Your SKILL.md Skills (2026 Guide)

A skill that works today might break after a model update. Evals let you test whether your SKILL.md skills actually do what they claim, and benchmarks let you measure them over time. Here's how to use the skill-creator's eval system to build skills that stay reliable.

Quick Answer: Use the skill-creator in eval mode to test your SKILL.md skills against defined prompts and expected outputs. It runs 4 parallel sub-agents in isolated contexts, grades pass/fail, and tracks metrics like pass rate, token usage, and elapsed time. Run evals after every model update or skill edit. Access it in Claude Code, Claude.ai, or Cowork by asking Claude to "use the skill-creator to eval my skill." No code required.

Why evals matter for SKILL.md skills

SKILL.md skills are text instructions. When the underlying model updates, the way it interprets those instructions can shift. A code review skill that reliably flagged security issues last week might start missing them after a model release. You won't know unless you test.

Before the skill-creator update in March 2026, most skill authors followed the same pattern: write the SKILL.md, test it once manually, ship it, and hope. There was no structured way to verify that a skill worked, let alone measure whether it improved or degraded over time.

The skill-creator now operates in four modes:

Create walks you through writing a new skill from scratch. Eval runs your skill against test prompts and grades the output against criteria you define. Improve does blind A/B comparisons between two versions of a skill. Benchmark runs the eval suite multiple times and reports mean pass rate, standard deviation, token usage, and elapsed time.

Everything works in natural language. You don't need to write code or configure test frameworks. If you can describe what your skill should do, you can eval it.

For background on the SKILL.md format itself, read What Is SKILL.md?.

How to run your first eval

Step 1. Open the skill-creator

In Claude Code, start a session and say:

Use the skill-creator to eval my code-reviewer skill

In Claude.ai or Cowork, the same prompt works. Claude loads the skill-creator and switches to eval mode.

Step 2. Define test prompts

The skill-creator asks you to provide test prompts. These are the requests a user would type that should trigger your skill. For a code review skill, you might define:

"Review this Python function for security issues" with a sample function containing an SQL injection vulnerability
"Check this React component for performance problems" with a component that re-renders unnecessarily
"Review my git diff before I commit" with staged changes that include a hardcoded API key

You also define negative prompts, requests that should not trigger the skill:

"Write a README for my project"
"Help me debug this error"

Step 3. Define expected outputs

For each test prompt, describe what the skill's output should include. You don't need exact text matches. Describe the criteria:

"Should identify the SQL injection risk in the get_user function"
"Should flag the missing useMemo or useCallback for the expensive computation"
"Should catch the hardcoded API key and recommend using environment variables"

Step 4. Run the eval

The skill-creator launches 4 sub-agents in parallel, each in an isolated context. This prevents contamination between tests. One test's output can't influence another. Each sub-agent runs a test prompt through your skill and grades the output against your criteria.

You get a results summary showing which tests passed, which failed, and why. If the skill missed the hardcoded API key on line 42, the eval tells you that specifically.

Step 5. Fix and re-run

The skill-creator identifies gaps and suggests edits to your SKILL.md. If the skill missed an API key pattern, it might propose adding a step like "Scan for string patterns matching common API key formats (sk-, AKIA, ghp_, xox-)" to your instructions. After the edit, re-run the eval to verify the fix.

This loop, eval then fix then re-eval, is where skills get good. Most skills go from roughly 60% pass rate on first eval to 90%+ after two or three iterations.

How to benchmark a skill

Benchmarks are evals with statistics. Instead of running the eval once, benchmark mode runs it N times and reports aggregate results.

Use the skill-creator to benchmark my code-reviewer skill, 5 runs

You get three metrics across runs:

Pass rate as mean and standard deviation (e.g., 85% plus or minus 5%). This tells you how consistent the skill is. A skill that scores 100% once and 60% the next has a reliability problem even if the average looks decent.

Token usage shows how many tokens the skill consumes per run. Skills that use excessive tokens cost more and risk hitting context limits. If your skill uses 8,000 tokens for something that should take 2,000, the instructions are probably too verbose or triggering unnecessary reasoning.

Elapsed time tells you how long each run takes. Useful for skills integrated into CI pipelines or team workflows where speed matters.

The real value is establishing a baseline. Run a benchmark today, save the numbers. After a model update, run it again. If pass rate drops from 90% to 70%, you know the update broke something and you know exactly which test cases failed.

How to A/B test skill versions

The improve mode runs a blind comparison between two versions of your skill.

Use the skill-creator to improve my code-reviewer skill

The skill-creator runs the same test prompts through both versions and passes the outputs to a separate comparator agent. The comparator doesn't know which output came from which version. It grades both on quality and tells you which performs better and why.

This eliminates author bias. You aren't the one deciding whether your changes helped. An independent agent evaluates only the output quality, blind to which version produced it.

How to tune trigger descriptions

A skill that never fires is useless. A skill that fires on every prompt wastes tokens.

The skill-creator analyzes your skill's description and when_to_use fields against sample prompts and suggests edits that reduce both false positives (firing when it shouldn't) and false negatives (not firing when it should).

Use the skill-creator to optimize triggers for my code-reviewer skill

It generates prompts that should trigger the skill and prompts that shouldn't, then tests each one. After multiple iterations, it produces a revised description optimized for accurate triggering. Anthropic ran this across their own document-creation skills and saw improved accuracy on 5 out of 6.

Tip: Anthropic's own guidance says to make descriptions "a little bit pushy" because Claude tends to underuse skills when they'd actually help. A description like "Reviews staged git changes for security, logic, and style issues before commit" triggers more reliably than "Helps with code review."

What to eval: capability skills vs. preference skills

Not every skill needs the same kind of eval.

Capability skills do a specific job: review code, generate tests, write commit messages, scaffold documentation. Evals for these should test whether the job gets done correctly. Define clear pass/fail criteria based on output quality.

Preference skills enforce a style or process: "always use conventional commits," "follow our naming convention," "structure components this way." Evals for these should test whether the process is followed. Check that the output matches the prescribed format, not just that it's generally good.

The distinction matters because model updates affect capability skills more than preference skills. A smarter model might change how it approaches a code review, breaking carefully sequenced steps. A preference skill that says "use kebab-case for filenames" is less likely to break because the instruction is unambiguous.

For more on writing effective skills, see How to Create a SKILL.md File from Scratch.

Building evals into your skill workflow

If you publish skills on a marketplace like Agensi, evals are how you prove quality. Any creator can claim their skill works. Eval results show that it actually does.

Here's a workflow that keeps skills reliable long term:

Before publishing. Run evals with at least 5 test prompts covering the main use cases and 2-3 negative prompts. Aim for 90%+ pass rate. Run a benchmark with 3-5 runs to confirm consistency.

After model updates. Re-run benchmarks within a week of any Claude model release. Compare against your baseline. If pass rate drops more than 10 percentage points, investigate and fix.

When iterating. Use improve mode before making changes. Run the A/B test so you have a before/after comparison. This prevents regressions where fixing one test case breaks another.

For teams. Store eval prompts and expected outputs alongside your SKILL.md in the skill folder. This makes evals reproducible by anyone on the team, not just the original author.

Common eval mistakes to avoid

Testing only the happy path. If your code review skill only gets tested on code with obvious bugs, you won't know how it handles clean code or subtle issues. Include edge cases.

Vague expected outputs. "Should give good feedback" isn't a useful criterion. "Should identify the race condition in the async handler and suggest using a mutex or lock" is.

Running evals once. A single pass doesn't tell you about consistency. Always benchmark with multiple runs, especially for capability skills where model behavior has inherent variance.

Ignoring token usage. A skill that passes all evals but uses 15,000 tokens for what should take 2,000 is wasting money and context. Watch the metrics.

Skipping negative tests. Without prompts that shouldn't trigger your skill, you can't measure false positives. A skill with high recall but low precision wastes tokens on every unrelated request.

Where to go from here

If you publish skills, evals are how you build trust. Browse the SKILL.md skills marketplace on Agensi to see what others ship, then run the same evals on your own. Share your results. Skills that pass benchmarks across multiple Claude versions are the ones developers trust enough to pay for.

For more on the broader workflow, read How to Create a SKILL.md File from Scratch and SKILL.md Creator Checklist.

How to Eval and Benchmark Your SKILL.md Skills (2026 Guide)

How to Eval and Benchmark Your SKILL.md Skills (2026 Guide)

Why evals matter for SKILL.md skills

How to run your first eval

Step 1. Open the skill-creator

Step 2. Define test prompts

Step 3. Define expected outputs

Step 4. Run the eval

Step 5. Fix and re-run

How to benchmark a skill

How to A/B test skill versions

How to tune trigger descriptions

What to eval: capability skills vs. preference skills

Building evals into your skill workflow

Common eval mistakes to avoid

Where to go from here

Frequently Asked Questions

Find the right skill for your workflow

Related Articles

SKILL.md Cross-Agent Compatibility: Tested Across 6 Agents (2026)

How to Share Claude Code Skills With Your Team

Claude Code Skills for JavaScript & TypeScript Developers