1

    Rag Eval

    by Ifásola

    Diagnose RAG bottlenecks with precision metrics (Recall, MRR, nDCG) to identify retrieval or ranking failures.

    Updated Jun 2026
    Security scanned
    Cursor

    $5

    · or 25 credits

    30-day refund guarantee

    Secure checkout via Stripe

    Included in download

    • Identify if RAG failures are caused by retrieval, ranking, or generation.
    • Calculate Recall@k and nDCG to benchmark different embedding models.
    • Ready for Cursor
    • Instant install

    Sample input

    Evaluate our latest retriever results in retrieval_results.jsonl and tell me where to focus.

    Sample output

    Metrics:

    • Recall@5: 0.45
    • MRR: 0.32
    • nDCG@5: 0.38

    Verdict: [RETRIEVAL BOTTLENECK] Recall is critically low. Your retriever is missing the relevant docs entirely. Focus on improving your embedding model or chunking strategy before tuning the prompt.

    About This Skill

    Diagnostic Tools for RAG Performance

    Pinpointing why a Retrieval-Augmented Generation (RAG) system is failing can be a guessing game. Is the embedding model weak? Is the chunking strategy off? Or is the LLM simply hallucinating despite having the right context? This skill eliminates the guesswork by providing a standardized evaluation framework for your retrieval pipeline.

    Data-Driven Insights

    By comparing your retriever's output against a labeled ground-truth set, this tool calculates industry-standard metrics including Recall@k, Precision@k, Hit-Rate, Mean Reciprocal Rank (MRR), and Normalized Discounted Cumulative Gain (nDCG). It goes beyond raw numbers to provide a technical verdict on where your bottleneck lies.

    What it helps you solve

    • Low Recall: Identifies when your embeddings, chunking strategy, or indexing are failing to surface relevant documents.
    • Ranking Issues: Detects when relevant documents are being retrieved but ranked too low for the LLM's context window.
    • Generation Bottlenecks: Confirms when retrieval is healthy, indicating that your issues stem from the prompt or the LLM's reasoning capabilities.

    This developer-centric tool requires zero heavy dependencies, running on the Python standard library for easy integration into CI/CD pipelines or local development workflows.

    Use Cases

    • Identify if RAG failures are caused by retrieval, ranking, or generation.
    • Calculate Recall@k and nDCG to benchmark different embedding models.
    • Automate regression testing for vector database index updates.
    • Generate data-driven verdicts to guide chunking and metadata strategy.

    Reviews

    No reviews yet - be the first to share your experience.

    Only users who have downloaded or purchased this skill can leave a review.

    Security Scanned

    Passed automated security review

    Permissions

    No special permissions declared or detected

    Requires Python 3.8+. No external dependencies (standard library only). Works with any SKILL.md-compatible agent (Claude Code, Cursor, Codex CLI, Gemini CLI).

    Creator

    Frequently Asked Questions

    More Premium Skills