1

    diagnosing-rag-failure-modes

    RAG fails quietly. It retrieves documents, returns confident-looking answers, and misses the question entirely — because the question required connecting facts across documents, reasoning about sequence, or tracing causation. This skill gives you a five-question diagnostic checklist that classifies any failing query as either RAG-safe or structurally RAG-incompatible, then maps it to the specific failure pattern and the architectural fix that resolves it.

    by Loreto

    1 installs
    25 views

    About This Skill

    Problems It Solves

    • Silent retrieval failure — RAG pipelines return plausible-sounding results on multi-hop and causal queries, making failures hard to detect. Teams iterate on embedding quality and chunking strategy for weeks before realizing the query type is the problem, not the implementation.

    • Wrong fix applied — Most RAG debugging focuses on embedding models, chunk size, and reranking. These are the right levers for factual lookup failures. They do nothing for relational and temporal failures, where the architecture itself is mismatched to the query.

    • Query type blindness — No standard vocabulary exists for distinguishing "what is X" from "how did X come to be" at the pipeline level. Without this distinction, every query gets routed to the same retrieval system regardless of structural fit.

    • Scale degradation — RAG degrades on large corpora not because the embeddings get worse, but because the signal-to-noise ratio collapses. Teams add reranking layers and see marginal improvement, missing that tiered retrieval is the actual fix.

    What You Get

    • The two-class query taxonomy — A clear, actionable split between Class A (factual lookup, RAG-safe) and Class B (relational/temporal, RAG danger zone), with concrete examples of each so classification is fast and unambiguous.

    • Five-question diagnostic checklist — Run any failing query through five yes/no checks (multi-document join required? order matters? causation chain? time span? why, not just what?) to score it as Class A, borderline, or Class B in under two minutes.

    • Four named failure patterns — Multi-hop relational failure, temporal sequencing failure, organizational context failure, and scale failure — each with a symptom description, a worked example, and a specific architectural fix.

    • Failure Classification Report template — A structured output artifact (query, class, failure patterns, root cause paragraph, recommended fix, references) that communicates a diagnosis clearly to engineers, architects, and non-technical stakeholders.

    • Architectural fix references — Each failure pattern maps directly to a companion skill (designing-hybrid-context-layers, temporal-reasoning-sleuth, synthesizing-institutional-knowledge) so diagnosis connects immediately to remediation.

    Who Should Use This

    • Engineers and AI architects whose RAG pipeline is returning poor results and need to determine whether the problem is implementation quality (fixable with tuning) or architectural mismatch (requires a different retrieval approach).

    • Teams building agents over organizational knowledge bases — ADRs, incident reports, policy documents, vendor contracts — where some queries will always be relational or temporal in nature.

    • Technical leads evaluating whether to add a knowledge graph, timeline index, or hybrid retrieval layer and needing a principled basis for the recommendation rather than intuition.

    Use Cases

    • RAG pipeline debugging: An agent over internal documentation fails on "Why did we deprecate the v1 API?" — a query that requires linking the deprecation notice, the downstream services affected, and the architectural rationale from a decision record written two years earlier. The diagnostic checklist scores it as Class B (3 checks: multi-document join, causal chain, temporal span). Root cause: structural RAG mismatch. Fix: knowledge graph traversal.
    • Architecture investment justification: A team wants to add a knowledge graph but needs to demonstrate to engineering leadership why the current vector store cannot be tuned to handle the failing queries. The failure classification report provides a structured argument with root cause analysis and specific pattern attribution.
    • Onboarding agent quality review: A new onboarding assistant answers "What is our PTO policy?" correctly but fails on "Why is our engineering team structured the way it is?" The diagnostic separates these as Class A and Class B respectively — and identifies that the second query requires organizational context provenance that was never ingested, not better embeddings.
    • Vendor evaluation: A team is evaluating RAG vendors and receives demo results on their sample queries. Running the diagnostic checklist against the sample set reveals that all demo queries were Class A. Their actual production queries are 60% Class B. The vendor's system is being benchmarked on a task distribution it will never face in production.

    How to Install

    unzip diagnosing-rag-failure-modes.zip -d ~/.claude/skills/

    Free

    One-time purchase • Own forever

    Security Scanned

    Passed automated security review

    8/8 checks passed

    Tags

    rag
    ai-architecture
    knowledge-graphs
    debugging
    llmops
    retrieval
    knowledge base agent failure
    multi-hop retrieval
    causal reasoning
    temporal reasoning
    enterprise AI
    AI diagnostics

    Best with Claude Code 1.2+. No external dependencies required. The diagnostic checklist and classification report are architecture-agnostic and apply regardless of embedding model, vector store, or retrieval framework in use. Designed as a first-step diagnostic that routes to designing-hybrid-context-layers (remediation architecture), temporal-reasoning-sleuth (temporal sequencing fixes), and synthesizing-institutional-knowledge (provenance ingestion).

    Creator

    Loreto

    Loreto

    Over 20 years of experience in data exploration and digital signal processing working across a variety of sectors including fintech, aerospace, and defense. Expertise in Risk Analysis, Engine Health Monitoring and predictive maintenance efforts for one of the world’s leading jet engine manufacturers developing machine learning models and helping organizations achieve real impact from their analytics initiatives. Passionate about Agentic workflows, the Enterprise Context Layer, and Information Synthesis. Specializing in Enterprise AI.

    Frequently Asked Questions

    Similar Skills