Rag Eval
by Ifásola
Diagnose RAG bottlenecks with precision metrics (Recall, MRR, nDCG) to identify retrieval or ranking failures.
- Identify if RAG failures are caused by retrieval, ranking, or generation.
- Calculate Recall@k and nDCG to benchmark different embedding models.
- Automate regression testing for vector database index updates.
$5
· or 25 creditsSecure checkout via Stripe
Included in download
- Identify if RAG failures are caused by retrieval, ranking, or generation.
- Calculate Recall@k and nDCG to benchmark different embedding models.
- Ready for Cursor
Sample input
Evaluate our latest retriever results in retrieval_results.jsonl and tell me where to focus.
Sample output
Metrics:
- Recall@5: 0.45
- MRR: 0.32
- nDCG@5: 0.38
Verdict: [RETRIEVAL BOTTLENECK] Recall is critically low. Your retriever is missing the relevant docs entirely. Focus on improving your embedding model or chunking strategy before tuning the prompt.
Rag Eval
by Ifásola
Diagnose RAG bottlenecks with precision metrics (Recall, MRR, nDCG) to identify retrieval or ranking failures.
$5
· or 25 creditsSecure checkout via Stripe
Included in download
- Identify if RAG failures are caused by retrieval, ranking, or generation.
- Calculate Recall@k and nDCG to benchmark different embedding models.
- Ready for Cursor
- Instant install
Sample input
Evaluate our latest retriever results in retrieval_results.jsonl and tell me where to focus.
Sample output
Metrics:
- Recall@5: 0.45
- MRR: 0.32
- nDCG@5: 0.38
Verdict: [RETRIEVAL BOTTLENECK] Recall is critically low. Your retriever is missing the relevant docs entirely. Focus on improving your embedding model or chunking strategy before tuning the prompt.
About This Skill
Diagnostic Tools for RAG Performance
Pinpointing why a Retrieval-Augmented Generation (RAG) system is failing can be a guessing game. Is the embedding model weak? Is the chunking strategy off? Or is the LLM simply hallucinating despite having the right context? This skill eliminates the guesswork by providing a standardized evaluation framework for your retrieval pipeline.
Data-Driven Insights
By comparing your retriever's output against a labeled ground-truth set, this tool calculates industry-standard metrics including Recall@k, Precision@k, Hit-Rate, Mean Reciprocal Rank (MRR), and Normalized Discounted Cumulative Gain (nDCG). It goes beyond raw numbers to provide a technical verdict on where your bottleneck lies.
What it helps you solve
- Low Recall: Identifies when your embeddings, chunking strategy, or indexing are failing to surface relevant documents.
- Ranking Issues: Detects when relevant documents are being retrieved but ranked too low for the LLM's context window.
- Generation Bottlenecks: Confirms when retrieval is healthy, indicating that your issues stem from the prompt or the LLM's reasoning capabilities.
This developer-centric tool requires zero heavy dependencies, running on the Python standard library for easy integration into CI/CD pipelines or local development workflows.
Use Cases
- Identify if RAG failures are caused by retrieval, ranking, or generation.
- Calculate Recall@k and nDCG to benchmark different embedding models.
- Automate regression testing for vector database index updates.
- Generate data-driven verdicts to guide chunking and metadata strategy.
How to Install
mkdir -p ~/.claude/skills && curl -sL https://www.agensi.io/api/install/rag-eval -o /tmp/rag-eval.zip && unzip -o /tmp/rag-eval.zip -d ~/.claude/skills && rm /tmp/rag-eval.zipFree skills install directly. Paid skills require purchase - use the download button above after buying.
Reviews
No reviews yet - be the first to share your experience.
Only users who have downloaded or purchased this skill can leave a review.
Early access skill
Be the first to review this skill.
Only users who have downloaded or purchased this skill can leave a review.
Security Scanned
Passed automated security review
Permissions
No special permissions declared or detected
Requires Python 3.8+. No external dependencies (standard library only). Works with any SKILL.md-compatible agent (Claude Code, Cursor, Codex CLI, Gemini CLI).
Creator
Frequently Asked Questions
Learn More About AI Agent Skills
More Premium Skills
designing-hybrid-context-layers
Architects the right retrieval strategy for every query — teaching your agent when to use RAG, a knowledge graph, or a temporal index instead of defaulting to vector search for everything.
synthesizing-institutional-knowledge
Builds the organizational memory schema your AI agent needs to answer why — capturing decision provenance, causal chains, and event context that embedding-based retrieval permanently discards.
diagnosing-rag-failure-modes
RAG fails quietly. It retrieves documents, returns confident-looking answers, and misses the question entirely — because the question required connecting facts across documents, reasoning about sequence, or tracing causation. This skill gives you a five-question diagnostic checklist that classifies any failing query as either RAG-safe or structurally RAG-incompatible, then maps it to the specific failure pattern and the architectural fix that resolves it.
ai-automation-qa-pack
Professional QA & UAT documentation generator for AI automation agencies and complex agent deployments.