Retrieval Augmented Generation: Common Failure Modes


Retrieval Augmented Generation has become the default architecture for LLM applications that need to reference specific knowledge bases. The basic concept is straightforward: retrieve relevant documents, include them in the context, let the LLM generate responses based on that context.

In practice, RAG systems fail in ways that surprise teams who’ve only implemented basic examples. These failures are consistent enough across implementations that they’re worth understanding before you encounter them in production.

Retrieval Returns Wrong Documents

This is the fundamental failure mode. Your vector search returns documents that seem semantically similar but aren’t actually relevant to the user’s question. The LLM then generates responses based on incorrect context.

This happens more often than people expect because semantic similarity doesn’t equal relevance. A user asking “how do I reset my password” might get documents about “password security best practices” because the words overlap semantically, even though one is about password management and the other is about account recovery procedures.

The symptoms are responses that are plausible-sounding but factually wrong based on your actual documentation. Users report that “the AI told me something that contradicts what’s in the docs.”

Fixes involve improving retrieval quality through better chunking, metadata filtering, hybrid search combining semantic and keyword matching, or reranking retrieved results before passing to the LLM. There’s no single solution—you need to measure retrieval accuracy for your specific content.

Retrieved Context Contradicts Itself

When you retrieve multiple document chunks, they might contain contradictory information. Maybe you have documentation from different versions of your product, or different authors wrote conflicting instructions.

The LLM receives contradictory context and either gets confused, picks one arbitrarily, or tries to hedge by saying “some sources suggest X while others say Y,” which is honest but unhelpful.

This is particularly problematic when versioning isn’t handled properly. Your knowledge base contains docs for versions 2.0, 2.1, and 3.0 of your product. A user asking about a feature gets retrieved chunks from different versions with different instructions.

Mitigation requires metadata-based filtering (only retrieve docs for the version the user is actually using) or document deduplication and conflict resolution before the retrieval step.

Relevant Information Split Across Chunks

Your documents get chunked for embedding. A user’s question requires information that spans multiple chunks. The retrieval system returns the most relevant chunk but misses the second chunk that contains critical context.

The LLM generates an incomplete or partially incorrect answer because it’s missing part of the picture. Users get answers that are technically correct for what was retrieved but incomplete for what they actually asked.

This happens with procedural documentation where steps are split across chunks, or with conceptual explanations where the setup is in one chunk and the payoff is in another.

Solutions include larger chunks (which creates other problems), overlapping chunks (more expensive but captures split content better), or hierarchical retrieval that fetches parent sections when child chunks match.

Retrieved Context Too Large for Window

You retrieve lots of relevant documents but collectively they exceed the model’s context window or, more realistically, consume so much of the context that there’s no room for actual response generation.

This forces you to either truncate retrieved results (losing potentially relevant information) or summarize them (adding cost and latency, plus potential information loss).

The failure mode is subtle—answers become more generic or miss details because the LLM couldn’t fit all relevant context and had to work with truncated information.

Fixes include better filtering to retrieve fewer but higher-quality documents, summarization of retrieved chunks before including them in context, or using models with larger context windows at higher cost.

Retrieval Quality Degrades with Scale

Your RAG system works great with 100 documents. You scale to 10,000 documents and suddenly retrieval quality drops. More documents mean more potential matches, many of which are marginally relevant but not actually helpful.

The signal-to-noise ratio decreases as the knowledge base grows. Your retrieval returns some relevant documents but also irrelevant ones that weren’t in the corpus when you tested at small scale.

This requires more sophisticated retrieval strategies—metadata filtering to narrow the search space, hierarchical document organization, or multiple retrieval passes with different strategies.

Hallucination Despite Retrieved Context

The LLM has correct context retrieved, but it still generates information not present in that context. This is particularly frustrating because you specifically built RAG to prevent hallucination.

This happens when the retrieved context is relevant but doesn’t completely answer the question. The LLM fills gaps with plausible-sounding but invented information.

Some prompting strategies help: explicitly instruct the model to only use provided context, ask it to cite specific documents for claims, or structure outputs to separate “information from provided context” from “inferences based on that context.”

But you can’t completely eliminate hallucination through prompting alone. Validation of outputs against source documents helps catch this, though it adds latency and cost.

User Questions Don’t Match Document Structure

Your documentation is structured for browsing by humans. User questions are phrased as natural language queries. The mismatch means semantically relevant documents don’t surface because the phrasing differs too much.

Documentation says “Creating a new user account” but users ask “how do I sign up.” Semantic embedding should handle this, but in practice, the similarity isn’t always high enough to reliably retrieve the right section.

This requires query expansion (rewrite the user’s question in multiple ways and retrieve for each), better document preprocessing (add common question phrasings to documentation embeddings), or fine-tuning embeddings on your specific domain.

Citation and Sourcing Problems

Users want to know where information came from. RAG systems retrieve documents but often don’t propagate citation information through to final responses. The LLM generates an answer but users can’t verify it against source docs.

Building proper citation requires tracking which retrieved chunks influenced which parts of the response. This isn’t trivial—the LLM might synthesize information from multiple chunks in ways that make attribution ambiguous.

Some implementations ask the LLM to cite sources explicitly, but this is unreliable. The model might cite documents it didn’t actually use or fail to cite ones it did use. Programmatic tracking of chunk-to-response mapping is more reliable but complex to implement.

Performance and Latency Issues

RAG adds latency. Embedding the query, searching the vector database, and then running LLM inference takes longer than just running inference. For applications where response time matters, this can be problematic.

Users expect near-instant responses. If RAG retrieval takes 200ms, embedding takes 100ms, and LLM inference takes 800ms, your total latency is over a second. That’s acceptable for some applications but not for others.

Optimization requires parallelizing steps where possible, caching query embeddings for repeated questions, maintaining hot caches of frequently retrieved documents, or accepting quality tradeoffs to use faster retrieval methods.

Measuring What Actually Matters

Many teams don’t measure retrieval quality independently from end-to-end quality. If users complain about wrong answers, you don’t know if the problem is retrieval failure or LLM failure.

Measure retrieval precision and recall separately. For a sample of queries, manually evaluate whether the retrieved documents were actually relevant. This isolates retrieval quality from generation quality.

Measure context utilization—is the LLM actually using the retrieved context or mostly relying on its base knowledge? If the latter, your retrieval might be poor or unnecessary.

Track user satisfaction and feedback, but also log failed interactions for analysis. The queries where users said “that didn’t answer my question” are your most valuable debugging data.

What Actually Works in Production

No single RAG implementation handles all these failure modes. Production systems layer multiple strategies:

  • Hybrid search (semantic + keyword) for better retrieval
  • Metadata filtering to narrow search space
  • Reranking of retrieved results before passing to LLM
  • Explicit instruction to use only provided context
  • Chunk overlap to handle split information
  • Query expansion for better matching
  • Response validation against source documents

The specific combination depends on your content, query patterns, and quality requirements. Start simple, measure failures, add complexity to address specific observed problems rather than preemptively building for every possible failure mode.

RAG isn’t a solved problem. It’s a framework with known challenges that require thoughtful engineering to address. Understanding the common failure modes means you can build systems that handle them proactively rather than discovering them in production when users complain.