Rag Failure Diagnostics

Diagnose broken RAG systems. 8 failure categories: chunking, embeddings, retrieval, reranking, hallucination. Recall@k measurement.

Updated Jun 2026

A structured diagnostic workflow for broken RAG systems. Identifies root cause across chunking, embedding, retrieval, reranking, generation, and citation layers. Outputs a fix plan ranked by impact.

Security scannedInstant install

Free

Included in download

Downloadable skill package
2 permissions declared

Kaymue

Rag Failure Diagnostics

Name: Rag Failure Diagnostics
Availability: InStock
Author: Agensi

by Kaymue

Diagnose broken RAG systems. 8 failure categories: chunking, embeddings, retrieval, reranking, hallucination. Recall@k measurement.

Updated Jun 2026

0 installs

Free

⚡ Also available via Agensi MCP - your AI agent can load this skill on demand via MCP. Learn more →

Included in download

Downloadable skill package
2 permissions declared
Instant install

0 installs

Works with any agent that s…

About This Skill

# RAG Failure Diagnostics Your RAG system gives bad answers. Users complain. You don't know if it's the chunking, the embeddings, the retriever, the reranker, or the LLM. This skill is the diagnostic workflow that finds the broken layer — and tells you exactly how to fix it. ## What it does A **layered diagnostic protocol** for RAG pipelines. The skill walks your agent through 8 failure categories, each with specific probes, expected signals, and fixes: 1. **Chunking** — wrong sizes, broken boundaries, lost context 2. **Embedding model** — wrong model for domain, mis-normalization, off-the-shelf vs fine-tuned 3. **Vector store** — index type mismatch, distance metric wrong, no metadata filter 4. **Retriever** — top-k too low, no hybrid search, semantic-only 5. **Reranker** — missing, mis-tuned, or scoring wrong 6. **Prompt assembly** — context stuffing, instruction dilution, no citations 7. **LLM hallucination** — ignoring context, inventing facts, refusing to say "I don't know" 8. **End-to-end latency / cost** — N+1 queries, no caching, oversized context For each, the skill provides: - **Probe questions** to ask the user / inspect the system - **Code-level checks** to run against the codebase - **Expected signals** (what good looks like) - **Concrete fix** with copy-paste code ## When to use it - Your RAG answers are off-topic or hallucinating - Users say "it doesn't know what's in our docs" - Retrieval latency is too high for production - You're paying too much for LLM calls because context is too long - You can't tell which layer is the bottleneck - You're about to ship a RAG feature and want a pre-flight check ## Why it's better than ad-hoc prompting Most "debug my RAG" prompts give vague advice ("try smaller chunks"). This skill is different: - **Systematic**: walks all 8 layers in order — you can't skip - **Quantified**: every probe returns a number, not a feeling - **Reproducible**: same inputs → same diagnosis - **Prioritized**: ranks fixes by impact and effort - **Code-aware**: actually reads your retriever/embedder/prompter code ## Architecture ``` ┌─────────────────────────────────────────────────────────┐ │ Agent (Claude/Cursor) │ │ - Reads user complaint, asks clarifying questions │ │ - Inspects codebase with Grep/Read │ │ - Runs diagnostic scripts │ │ - Synthesizes fix plan │ └───────────────┬─────────────────────────────────────────┘ │ ▼ ┌─────────────────────────────────────────────────────────┐ │ skills/rag-failure-diagnostics/ │ │ scripts/ │ │ ├── probe_chunking.py # Chunk size & boundary │ │ ├── probe_embeddings.py # Embedding quality probe │ │ ├── probe_retrieval.py # Recall@k calculation │ │ ├── probe_latency.py # Per-layer timing │ │ └── probe_hallucination.py # Faithfulness check │ │ references/ │ │ ├── diagnosis-workflow.md │ │ ├── chunking-strategies.md │ │ ├── embedding-model-selection.md │ │ └── fix-templates.md │ │ data/ │ │ └── golden_test_set.json # 50 Q&A pairs for testing │ └─────────────────────────────────────────────────────────┘ ``` ## Quick start ```bash # 1. Install pip install numpy scikit-learn sentence-transformers # 2. Probe chunking python scripts/probe_chunking.py --chunks-file ./chunks.jsonl # 3. Probe embeddings python scripts/probe_embeddings.py --docs ./corpus/ --queries ./queries.txt # 4. Probe retrieval python scripts/probe_retrieval.py --golden ./data/golden_test_set.json \ --index ./faiss.index --k 5 # 5. Probe latency python scripts/probe_latency.py --pipeline ./rag_pipeline.py --queries ./queries.txt # 6. Probe hallucination python scripts/probe_hallucination.py --answers ./answers.jsonl --context ./contexts.jsonl ``` ## The 8 failure categories (summary) | # | Layer | Symptom | Most likely cause | |---|-------|---------|-------------------| | 1 | Chunking | "It misses mid-paragraph context" | Chunks too small / no overlap | | 2 | Embeddings | "It can't find docs I know are there" | Wrong model for domain / no normalization | | 3 | Vector store | "Search returns weird neighbors" | Wrong distance metric (cosine vs L2) | | 4 | Retriever | "Top-k results are off-topic" | k too low, no hybrid, no metadata filter | | 5 | Reranker | "First results are good, rest is junk" | Missing reranker, or it's not trained on your data | | 6 | Prompt assembly | "Model ignores the context" | Stuffing 50k tokens, instructions lost | | 7 | LLM hallucination | "It invents citations" | No "say I don't know" rule, temperature too high | | 8 | Latency / cost | "p95 latency is 8 seconds" | No caching, oversized context, N+1 retrievals | ## Pricing Single-purchase, lifetime access. $19.00. Includes: - 5 Python diagnostic scripts - 4 reference docs (workflow, chunking, embeddings, fixes) - 50 Q&A golden test set for evaluation - 12 fix templates (copy-paste LangChain / LlamaIndex code) - Future updates for the same major version ## Example usage > "Our RAG system returns the wrong product specs 40% of the time. Latency is 6s. Find the broken layer." The skill will: 1. Read your retriever + prompt code 2. Run chunking probe (likely finds: chunks too small, no overlap) 3. Run embedding probe (likely finds: 768d model on technical product docs) 4. Run retrieval probe (likely finds: recall@5 = 0.42, target is 0.85) 5. Run hallucination probe (likely finds: model invents spec values) 6. Output ranked fix plan: - **Fix 1** (high impact, low effort): Switch to semantic chunking with 200-token chunks, 20-token overlap → recall@5 to 0.78 - **Fix 2** (high impact, medium effort): Replace `all-MiniLM-L6-v2` with `bge-large-en-v1.5` → recall@5 to 0.86 - **Fix 3** (medium impact, low effort): Add bge-reranker-base → precision@5 to 0.91 - **Fix 4** (high impact, medium effort): Add "Say 'I don't know' if context doesn't contain the answer" → hallucination from 40% to 5% ## What you should already have This skill assumes: - A working RAG pipeline (any framework: LangChain, LlamaIndex, custom) - Access to a sample of user queries and expected answers (golden set) - Permission to read the codebase It does **not** require: - A GPU (probes run on CPU with small models) - A specific vector DB (works with FAISS, Pinecone, Qdrant, Weaviate, Chroma) - Production data (works fine with samples) ## Compatibility Works with any agent that supports the SKILL.md standard and can execute Python: Claude Code, OpenClaw, Codex CLI, Cursor, Gemini CLI, Cline, Windsurf, Aider. Tested on Linux, macOS, Windows. ## Tags rag, llm, retrieval, embeddings, vector-db, langchain, llamaindex, debugging, ai-architecture

Use Cases

A structured diagnostic workflow for broken RAG systems. Identifies root cause across chunking, embedding, retrieval, reranking, generation, and citation layers. Outputs a fix plan ranked by impact.

How to Install

mkdir -p ~/.claude/skills && curl -sL https://www.agensi.io/api/install/rag-failure-diagnostics -o /tmp/rag-failure-diagnostics.zip && unzip -o /tmp/rag-failure-diagnostics.zip -d ~/.claude/skills && rm /tmp/rag-failure-diagnostics.zip

Free skills install directly. Paid skills require purchase - use the download button above after buying.

Reviews

No reviews yet - be the first to share your experience.

Only users who have downloaded or purchased this skill can leave a review.

No reviews yet - be the first to share your experience.

Only users who have downloaded or purchased this skill can leave a review.

Security Scanned

Passed automated security review

Permissions

Terminal / Shell

Read Files

File Scopes

data/**

scripts/**

Creator

Kaymue

Frequently Asked Questions

Learn More About AI Agent Skills

More Premium Skills

diagnosing-rag-failure-modes

RAG fails quietly. It retrieves documents, returns confident-looking answers, and misses the question entirely — because the question required connecting facts across documents, reasoning about sequence, or tracing causation. This skill gives you a five-question diagnostic checklist that classifies any failing query as either RAG-safe or structurally RAG-incompatible, then maps it to the specific failure pattern and the architectural fix that resolves it.

$105 installs

designing-hybrid-context-layers

Architects the right retrieval strategy for every query — teaching your agent when to use RAG, a knowledge graph, or a temporal index instead of defaulting to vector search for everything.

$1016 installs

synthesizing-institutional-knowledge

Builds the organizational memory schema your AI agent needs to answer why — capturing decision provenance, causal chains, and event context that embedding-based retrieval permanently discards.

$105 installs

ai-automation-qa-pack

Professional QA & UAT documentation generator for AI automation agencies and complex agent deployments.

$510 installs