Creator Contest. Win $100. Enter →

    Guides
    skill.md
    evals
    benchmarks

    How to Eval and Benchmark Your SKILL.md Skills (2026 Guide)

    Test, benchmark, and A/B test your SKILL.md skills before publishing. How to use the skill-creator, define test prompts, and tune trigger descriptions.

    May 8, 20269 min read
    Share:

    How to Eval and Benchmark Your SKILL.md Skills (2026 Guide)

    A skill that works today might break after a model update. Evals let you test whether your SKILL.md skills actually do what they claim, and benchmarks let you measure them over time. Here's how to use the skill-creator's eval system to build skills that stay reliable.

    Quick Answer: Use the skill-creator in eval mode to test your SKILL.md skills against defined prompts and expected outputs. It runs 4 parallel sub-agents in isolated contexts, grades pass/fail, and tracks metrics like pass rate, token usage, and elapsed time. Run evals after every model update or skill edit. Access it in Claude Code, Claude.ai, or Cowork by asking Claude to "use the skill-creator to eval my skill." No code required.

    Why evals matter for SKILL.md skills

    SKILL.md skills are text instructions. When the underlying model updates, the way it interprets those instructions can shift. A code review skill that reliably flagged security issues last week might start missing them after a model release. You won't know unless you test.

    Before the skill-creator update in March 2026, most skill authors followed the same pattern: write the SKILL.md, test it once manually, ship it, and hope. There was no structured way to verify that a skill worked, let alone measure whether it improved or degraded over time.

    The skill-creator now operates in four modes:

    Create walks you through writing a new skill from scratch. Eval runs your skill against test prompts and grades the output against criteria you define. Improve does blind A/B comparisons between two versions of a skill. Benchmark runs the eval suite multiple times and reports mean pass rate, standard deviation, token usage, and elapsed time.

    Everything works in natural language. You don't need to write code or configure test frameworks. If you can describe what your skill should do, you can eval it.

    For background on the SKILL.md format itself, read What Is SKILL.md?.

    How to run your first eval

    Step 1. Open the skill-creator

    In Claude Code, start a session and say:

    Use the skill-creator to eval my code-reviewer skill
    

    In Claude.ai or Cowork, the same prompt works. Claude loads the skill-creator and switches to eval mode.

    Step 2. Define test prompts

    The skill-creator asks you to provide test prompts. These are the requests a user would type that should trigger your skill. For a code review skill, you might define:

    • "Review this Python function for security issues" with a sample function containing an SQL injection vulnerability
    • "Check this React component for performance problems" with a component that re-renders unnecessarily
    • "Review my git diff before I commit" with staged changes that include a hardcoded API key

    You also define negative prompts, requests that should not trigger the skill:

    • "Write a README for my project"
    • "Help me debug this error"

    Step 3. Define expected outputs

    For each test prompt, describe what the skill's output should include. You don't need exact text matches. Describe the criteria:

    • "Should identify the SQL injection risk in the get_user function"
    • "Should flag the missing useMemo or useCallback for the expensive computation"
    • "Should catch the hardcoded API key and recommend using environment variables"

    Step 4. Run the eval

    The skill-creator launches 4 sub-agents in parallel, each in an isolated context. This prevents contamination between tests. One test's output can't influence another. Each sub-agent runs a test prompt through your skill and grades the output against your criteria.

    You get a results summary showing which tests passed, which failed, and why. If the skill missed the hardcoded API key on line 42, the eval tells you that specifically.

    Step 5. Fix and re-run

    The skill-creator identifies gaps and suggests edits to your SKILL.md. If the skill missed an API key pattern, it might propose adding a step like "Scan for string patterns matching common API key formats (sk-, AKIA, ghp_, xox-)" to your instructions. After the edit, re-run the eval to verify the fix.

    This loop, eval then fix then re-eval, is where skills get good. Most skills go from roughly 60% pass rate on first eval to 90%+ after two or three iterations.

    How to benchmark a skill

    Benchmarks are evals with statistics. Instead of running the eval once, benchmark mode runs it N times and reports aggregate results.

    Use the skill-creator to benchmark my code-reviewer skill, 5 runs
    

    You get three metrics across runs:

    Pass rate as mean and standard deviation (e.g., 85% plus or minus 5%). This tells you how consistent the skill is. A skill that scores 100% once and 60% the next has a reliability problem even if the average looks decent.

    Token usage shows how many tokens the skill consumes per run. Skills that use excessive tokens cost more and risk hitting context limits. If your skill uses 8,000 tokens for something that should take 2,000, the instructions are probably too verbose or triggering unnecessary reasoning.

    Elapsed time tells you how long each run takes. Useful for skills integrated into CI pipelines or team workflows where speed matters.

    The real value is establishing a baseline. Run a benchmark today, save the numbers. After a model update, run it again. If pass rate drops from 90% to 70%, you know the update broke something and you know exactly which test cases failed.

    How to A/B test skill versions

    The improve mode runs a blind comparison between two versions of your skill.

    Use the skill-creator to improve my code-reviewer skill
    

    The skill-creator runs the same test prompts through both versions and passes the outputs to a separate comparator agent. The comparator doesn't know which output came from which version. It grades both on quality and tells you which performs better and why.

    This eliminates author bias. You aren't the one deciding whether your changes helped. An independent agent evaluates only the output quality, blind to which version produced it.

    How to tune trigger descriptions

    A skill that never fires is useless. A skill that fires on every prompt wastes tokens.

    The skill-creator analyzes your skill's description and when_to_use fields against sample prompts and suggests edits that reduce both false positives (firing when it shouldn't) and false negatives (not firing when it should).

    Use the skill-creator to optimize triggers for my code-reviewer skill
    

    It generates prompts that should trigger the skill and prompts that shouldn't, then tests each one. After multiple iterations, it produces a revised description optimized for accurate triggering. Anthropic ran this across their own document-creation skills and saw improved accuracy on 5 out of 6.

    Tip: Anthropic's own guidance says to make descriptions "a little bit pushy" because Claude tends to underuse skills when they'd actually help. A description like "Reviews staged git changes for security, logic, and style issues before commit" triggers more reliably than "Helps with code review."

    What to eval: capability skills vs. preference skills

    Not every skill needs the same kind of eval.

    Capability skills do a specific job: review code, generate tests, write commit messages, scaffold documentation. Evals for these should test whether the job gets done correctly. Define clear pass/fail criteria based on output quality.

    Preference skills enforce a style or process: "always use conventional commits," "follow our naming convention," "structure components this way." Evals for these should test whether the process is followed. Check that the output matches the prescribed format, not just that it's generally good.

    The distinction matters because model updates affect capability skills more than preference skills. A smarter model might change how it approaches a code review, breaking carefully sequenced steps. A preference skill that says "use kebab-case for filenames" is less likely to break because the instruction is unambiguous.

    For more on writing effective skills, see How to Create a SKILL.md File from Scratch.

    Building evals into your skill workflow

    If you publish skills on a marketplace like Agensi, evals are how you prove quality. Any creator can claim their skill works. Eval results show that it actually does.

    Here's a workflow that keeps skills reliable long term:

    Before publishing. Run evals with at least 5 test prompts covering the main use cases and 2-3 negative prompts. Aim for 90%+ pass rate. Run a benchmark with 3-5 runs to confirm consistency.

    After model updates. Re-run benchmarks within a week of any Claude model release. Compare against your baseline. If pass rate drops more than 10 percentage points, investigate and fix.

    When iterating. Use improve mode before making changes. Run the A/B test so you have a before/after comparison. This prevents regressions where fixing one test case breaks another.

    For teams. Store eval prompts and expected outputs alongside your SKILL.md in the skill folder. This makes evals reproducible by anyone on the team, not just the original author.

    Common eval mistakes to avoid

    Testing only the happy path. If your code review skill only gets tested on code with obvious bugs, you won't know how it handles clean code or subtle issues. Include edge cases.

    Vague expected outputs. "Should give good feedback" isn't a useful criterion. "Should identify the race condition in the async handler and suggest using a mutex or lock" is.

    Running evals once. A single pass doesn't tell you about consistency. Always benchmark with multiple runs, especially for capability skills where model behavior has inherent variance.

    Ignoring token usage. A skill that passes all evals but uses 15,000 tokens for what should take 2,000 is wasting money and context. Watch the metrics.

    Skipping negative tests. Without prompts that shouldn't trigger your skill, you can't measure false positives. A skill with high recall but low precision wastes tokens on every unrelated request.

    Where to go from here

    If you publish skills, evals are how you build trust. Browse the SKILL.md skills marketplace on Agensi to see what others ship, then run the same evals on your own. Share your results. Skills that pass benchmarks across multiple Claude versions are the ones developers trust enough to pay for.

    For more on the broader workflow, read How to Create a SKILL.md File from Scratch and SKILL.md Creator Checklist.

    Frequently Asked Questions

    Find the right skill for your workflow

    Browse our marketplace of AI agent skills, ready to install in seconds.

    Browse

    Related Articles