1

    Rag Failure Diagnostics

    by Kaymue

    Diagnose broken RAG systems. 8 failure categories: chunking, embeddings, retrieval, reranking, hallucination. Recall@k measurement.

    Updated Jun 2026
    0 installs

    Free

    Included in download

    • Downloadable skill package
    • 2 permissions declared
    • Instant install

    About This Skill

    # RAG Failure Diagnostics Your RAG system gives bad answers. Users complain. You don't know if it's the chunking, the embeddings, the retriever, the reranker, or the LLM. This skill is the diagnostic workflow that finds the broken layer — and tells you exactly how to fix it. ## What it does A **layered diagnostic protocol** for RAG pipelines. The skill walks your agent through 8 failure categories, each with specific probes, expected signals, and fixes: 1. **Chunking** — wrong sizes, broken boundaries, lost context 2. **Embedding model** — wrong model for domain, mis-normalization, off-the-shelf vs fine-tuned 3. **Vector store** — index type mismatch, distance metric wrong, no metadata filter 4. **Retriever** — top-k too low, no hybrid search, semantic-only 5. **Reranker** — missing, mis-tuned, or scoring wrong 6. **Prompt assembly** — context stuffing, instruction dilution, no citations 7. **LLM hallucination** — ignoring context, inventing facts, refusing to say "I don't know" 8. **End-to-end latency / cost** — N+1 queries, no caching, oversized context For each, the skill provides: - **Probe questions** to ask the user / inspect the system - **Code-level checks** to run against the codebase - **Expected signals** (what good looks like) - **Concrete fix** with copy-paste code ## When to use it - Your RAG answers are off-topic or hallucinating - Users say "it doesn't know what's in our docs" - Retrieval latency is too high for production - You're paying too much for LLM calls because context is too long - You can't tell which layer is the bottleneck - You're about to ship a RAG feature and want a pre-flight check ## Why it's better than ad-hoc prompting Most "debug my RAG" prompts give vague advice ("try smaller chunks"). This skill is different: - **Systematic**: walks all 8 layers in order — you can't skip - **Quantified**: every probe returns a number, not a feeling - **Reproducible**: same inputs → same diagnosis - **Prioritized**: ranks fixes by impact and effort - **Code-aware**: actually reads your retriever/embedder/prompter code ## Architecture ``` ┌─────────────────────────────────────────────────────────┐ │ Agent (Claude/Cursor) │ │ - Reads user complaint, asks clarifying questions │ │ - Inspects codebase with Grep/Read │ │ - Runs diagnostic scripts │ │ - Synthesizes fix plan │ └───────────────┬─────────────────────────────────────────┘ │ ▼ ┌─────────────────────────────────────────────────────────┐ │ skills/rag-failure-diagnostics/ │ │ scripts/ │ │ ├── probe_chunking.py # Chunk size & boundary │ │ ├── probe_embeddings.py # Embedding quality probe │ │ ├── probe_retrieval.py # Recall@k calculation │ │ ├── probe_latency.py # Per-layer timing │ │ └── probe_hallucination.py # Faithfulness check │ │ references/ │ │ ├── diagnosis-workflow.md │ │ ├── chunking-strategies.md │ │ ├── embedding-model-selection.md │ │ └── fix-templates.md │ │ data/ │ │ └── golden_test_set.json # 50 Q&A pairs for testing │ └─────────────────────────────────────────────────────────┘ ``` ## Quick start ```bash # 1. Install pip install numpy scikit-learn sentence-transformers # 2. Probe chunking python scripts/probe_chunking.py --chunks-file ./chunks.jsonl # 3. Probe embeddings python scripts/probe_embeddings.py --docs ./corpus/ --queries ./queries.txt # 4. Probe retrieval python scripts/probe_retrieval.py --golden ./data/golden_test_set.json \ --index ./faiss.index --k 5 # 5. Probe latency python scripts/probe_latency.py --pipeline ./rag_pipeline.py --queries ./queries.txt # 6. Probe hallucination python scripts/probe_hallucination.py --answers ./answers.jsonl --context ./contexts.jsonl ``` ## The 8 failure categories (summary) | # | Layer | Symptom | Most likely cause | |---|-------|---------|-------------------| | 1 | Chunking | "It misses mid-paragraph context" | Chunks too small / no overlap | | 2 | Embeddings | "It can't find docs I know are there" | Wrong model for domain / no normalization | | 3 | Vector store | "Search returns weird neighbors" | Wrong distance metric (cosine vs L2) | | 4 | Retriever | "Top-k results are off-topic" | k too low, no hybrid, no metadata filter | | 5 | Reranker | "First results are good, rest is junk" | Missing reranker, or it's not trained on your data | | 6 | Prompt assembly | "Model ignores the context" | Stuffing 50k tokens, instructions lost | | 7 | LLM hallucination | "It invents citations" | No "say I don't know" rule, temperature too high | | 8 | Latency / cost | "p95 latency is 8 seconds" | No caching, oversized context, N+1 retrievals | ## Pricing Single-purchase, lifetime access. $19.00. Includes: - 5 Python diagnostic scripts - 4 reference docs (workflow, chunking, embeddings, fixes) - 50 Q&A golden test set for evaluation - 12 fix templates (copy-paste LangChain / LlamaIndex code) - Future updates for the same major version ## Example usage > "Our RAG system returns the wrong product specs 40% of the time. Latency is 6s. Find the broken layer." The skill will: 1. Read your retriever + prompt code 2. Run chunking probe (likely finds: chunks too small, no overlap) 3. Run embedding probe (likely finds: 768d model on technical product docs) 4. Run retrieval probe (likely finds: recall@5 = 0.42, target is 0.85) 5. Run hallucination probe (likely finds: model invents spec values) 6. Output ranked fix plan: - **Fix 1** (high impact, low effort): Switch to semantic chunking with 200-token chunks, 20-token overlap → recall@5 to 0.78 - **Fix 2** (high impact, medium effort): Replace `all-MiniLM-L6-v2` with `bge-large-en-v1.5` → recall@5 to 0.86 - **Fix 3** (medium impact, low effort): Add bge-reranker-base → precision@5 to 0.91 - **Fix 4** (high impact, medium effort): Add "Say 'I don't know' if context doesn't contain the answer" → hallucination from 40% to 5% ## What you should already have This skill assumes: - A working RAG pipeline (any framework: LangChain, LlamaIndex, custom) - Access to a sample of user queries and expected answers (golden set) - Permission to read the codebase It does **not** require: - A GPU (probes run on CPU with small models) - A specific vector DB (works with FAISS, Pinecone, Qdrant, Weaviate, Chroma) - Production data (works fine with samples) ## Compatibility Works with any agent that supports the SKILL.md standard and can execute Python: Claude Code, OpenClaw, Codex CLI, Cursor, Gemini CLI, Cline, Windsurf, Aider. Tested on Linux, macOS, Windows. ## Tags rag, llm, retrieval, embeddings, vector-db, langchain, llamaindex, debugging, ai-architecture

    Use Cases

    • A structured diagnostic workflow for broken RAG systems. Identifies root cause across chunking, embedding, retrieval, reranking, generation, and citation layers. Outputs a fix plan ranked by impact.

    Reviews

    No reviews yet - be the first to share your experience.

    Only users who have downloaded or purchased this skill can leave a review.

    Security Scanned

    Passed automated security review

    Permissions

    Terminal / Shell
    Read Files

    File Scopes

    data/**
    scripts/**

    Works with any agent that supports the universal SKILL.md standard

    Creator

    Frequently Asked Questions

    More Premium Skills

    Free