RAG

Intro

Retrieval-Augmented Generation (RAG) combines retrieval and generation: retrieve evidence from your corpus, then generate an answer grounded in that evidence. It matters because knowledge changes faster than model weights, and RAG lets you update knowledge without retraining the model.
In practice, strong RAG systems are pipelines, not prompts. The main engineering work is query processing, retrieval quality, context assembly, evaluation, and production operations.
Example: for a support assistant, a user asks "What changed in API v2 rate limits?". RAG retrieves release notes and policy docs first, then the model answers with citations to the exact source sections instead of guessing from stale parametric memory.

Core Flow

flowchart LR
    Q[User Query] --> T[Query Translation]
    T --> R[Retrieval and Fusion]
    R --> RR[Optional Reranking]
    RR --> C[Context Assembly]
    C --> G[LLM Generation]
    G --> V[Groundedness and Citation Checks]

RAG Patterns Ranked by Commonness

This ranking is a practical adoption heuristic, not market-share data. It orders patterns by how often they appear as default production guidance in current vendor docs, open-source frameworks, and enterprise RAG architectures. Start at the top and move down only when evaluation shows a specific failure that cheaper patterns do not fix.

1. Baseline Single-Pass RAG

Work principle

The system embeds the user query, retrieves the most similar chunks, places those chunks into the prompt, and asks the model to answer from that context. It is the simplest useful RAG loop: one query in, one retrieval pass, one generated answer out.

flowchart LR
    Q[User query] --> E[Embed query]
    E --> R[Retrieve chunks]
    R --> C[Assemble context]
    C --> G[Generate answer]

Where it fits:

Main risk:

2. Hybrid Search plus Reranking

Work principle

Run lexical search and vector search together, merge their candidates, then rerank the merged set so the generator sees the best few passages. Lexical search catches exact terms; vector search catches semantic matches; reranking removes noise before context assembly.

flowchart LR
    Q[User query] --> L[Keyword search]
    Q --> V[Vector search]
    L --> CC[Candidate chunks]
    V --> CC
    CC --> F[Fuse candidates]
    F --> RR[Rerank evidence]
    RR --> C[Assemble context]
    C --> G[Generate answer]

Where it fits:

Main risk:

3. Query Rewriting and Routing

Work principle

Before retrieval, a small model or rules engine rewrites the user request into a better search query and routes it to the cheapest capable path. The rewrite makes implicit intent explicit; the router decides whether to use normal RAG, web search, SQL, multi-hop retrieval, or no retrieval.

flowchart LR
    Q[User query] --> A[Analyze intent]
    A --> W[Rewrite query]
    A --> RT[Choose route]
    W --> R[Retrieve chunks]
    RT --> R
    R --> CC[Candidate chunks]
    CC --> C[Assemble context]
    C --> G[Generate answer]

Where it fits:

Main risk:

4. Parent-Document and Recursive Retrieval

Work principle

Index small chunks for precise matching, but return a larger parent section or document window for generation. Retrieval stays sharp, while the model receives enough surrounding context to interpret tables, definitions, and dependencies.

flowchart LR
    D[Document] --> P[Parent sections]
    P --> S[Small chunks]
    S --> I[Chunk index]
    Q[User query] --> R[Retrieve small chunks]
    I --> R
    R --> M[Matched chunks]
    M --> X[Expand to parents]
    P --> X
    X --> C[Assemble context]
    C --> G[Generate answer]

Where it fits:

Main risk:

5. Multi-Query Fusion

Work principle

Generate several search variants for the same user question, retrieve for each variant, deduplicate results, then fuse the rankings. This raises recall when no single query wording captures all relevant evidence.

flowchart LR
    Q[User query] --> M[Generate variants]
    M --> R1[Retrieve variant one]
    M --> R2[Retrieve variant two]
    M --> R3[Retrieve variant three]
    R1 --> CC[Candidate chunks]
    R2 --> CC
    R3 --> CC
    CC --> F[Fuse and dedupe]
    F --> C[Assemble context]
    C --> G[Generate answer]

Where it fits:

Main risk:

6. Contextual Retrieval

Work principle

Add a short document-aware explanation to each chunk before indexing it. The retriever no longer sees a bare fragment; it sees the fragment plus enough context to know what the fragment means inside the original document.

flowchart LR
    D[Source document] --> CH[Raw chunk]
    D --> CT[Chunk context]
    CH --> EN[Enriched chunk]
    CT --> EN
    EN --> IDX[Index]
    Q[User query] --> R[Retrieve enriched chunks]
    IDX --> R
    R --> C[Assemble context]
    C --> G[Generate answer]

Where it fits:

Main risk:

7. Multimodal RAG

Work principle

Retrieve and pass evidence across text, tables, images, charts, and scanned pages. The system either converts non-text content into text-like representations or uses vision-capable embeddings and models so the answer can cite visual evidence.

flowchart LR
    Q[User query] --> RT[Modality router]
    RT --> T[Text retrieval]
    RT --> I[Image retrieval]
    RT --> B[Table retrieval]
    T --> E[Mixed evidence]
    I --> E
    B --> E
    E --> C[Assemble context]
    C --> G[Generate answer]

Where it fits:

Main risk:

8. HyDE

Work principle

The model writes a hypothetical answer first, embeds that synthetic answer, and searches with the answer embedding instead of the raw query. The fake answer acts like a semantic bridge when the user query is too short or uses different vocabulary than the corpus.

flowchart LR
    Q[User query] --> H[Draft hypothetical answer]
    H --> E[Embed draft]
    E --> R[Retrieve chunks]
    Q --> C[Assemble context]
    R --> C
    C --> G[Generate answer]

Where it fits:

Main risk:

9. Iterative Multi-Hop Retrieval

Work principle

The system retrieves evidence, reasons about what is missing, creates a follow-up query, and retrieves again. It repeats for a small number of hops until the evidence covers the question.

flowchart LR
    Q[User query] --> R1[Retrieve chunks]
    R1 --> EC[Evidence context]
    EC --> RE[Reason gaps]
    RE --> F[Follow up query]
    F --> R2[Retrieve more chunks]
    R2 --> EC
    EC --> C[Assemble context]
    C --> G[Generate answer]

Where it fits:

Main risk:

10. Agentic RAG

Work principle

An agent decides which retrieval or data tools to call, observes the result, and chooses the next action. Unlike a fixed pipeline, the path can change per query.

flowchart LR
    Q[User query] --> A[Agent reasoning]
    A --> T[Choose tool]
    T --> O[Observe evidence]
    O --> S[Update scratchpad]
    S --> A
    S --> C[Assemble context]
    C --> G[Generate answer]

Where it fits:

Main risk:

11. GraphRAG

Work principle

Build a knowledge graph from documents, connect entities and relationships, summarize communities, then retrieve from graph neighborhoods or community summaries. The graph gives the retriever explicit relationship structure that flat chunks do not contain.

flowchart LR
    D[Documents] --> ER[Extract entities]
    ER --> KG[Knowledge graph]
    KG --> CS[Community summaries]
    Q[User query] --> GS[Graph search]
    KG --> GS
    CS --> GS
    GS --> E[Graph evidence]
    E --> C[Assemble context]
    C --> G[Generate answer]

Where it fits:

Main risk:

12. Corrective and Self-Reflective RAG

Work principle

Add an evaluator or specially trained model that decides whether retrieved evidence is relevant and whether the generated answer is supported. If evidence looks weak, the system retries retrieval, falls back to web search, or rejects unsupported output.

flowchart LR
    Q[User query] --> R[Retrieve]
    R --> E[Evaluate evidence]
    E --> P[Evidence passes]
    E --> W[Evidence weak]
    W --> X[Correct retrieval]
    X --> R
    P --> C[Assemble context]
    C --> G[Generate answer]
    G --> S[Check support]

Where it fits:

Main risk:

Pattern Selection Guide

Pattern Commonness Best For Runtime Cost When to Skip
Baseline Single-Pass RAG Mainstream baseline First version and simple factual lookup Low Retrieval metrics already show exact-term or precision failures
Hybrid Search plus Reranking Mainstream production default Enterprise text with exact terms and semantic matches Medium Tiny curated corpus where dense retrieval is already excellent
Query Rewriting and Routing Common Vague queries and mixed complexity traffic Low to medium Users already write precise search queries
Parent-Document and Recursive Retrieval Common Long documents and structure-sensitive answers Medium Short standalone snippets answer most questions
Multi-Query Fusion Emerging Compound or synonym-heavy questions Medium Simple single-intent lookup traffic
Contextual Retrieval Emerging Chunks that lose meaning outside the source document Indexing cost high and runtime cost low Fast-changing corpora where enrichment goes stale quickly
Multimodal RAG Emerging PDFs, tables, figures, scans, diagrams Medium to high Text-only corpus
HyDE Niche Vocabulary mismatch and sparse queries Medium Queries are already specific and direct retrieval works
Iterative Multi-Hop Retrieval Rare to emerging Multi-hop evidence chains High Single-hop answers dominate traffic
Agentic RAG Rare to emerging Multiple tools and dynamic investigation High One data source and one retrieval path are enough
GraphRAG Rare and specialized Entity relationships and global synthesis High Simple fact lookup or frequently changing data
Corrective and Self-Reflective RAG Research and very rare High-risk answers needing custom critique High You cannot train evaluators or calibrate thresholds

Adoption order: ship baseline RAG first, then add hybrid search and reranking. Add query rewriting, parent-document retrieval, or multi-query fusion when evals show recall gaps. Use contextual, multimodal, iterative, agentic, or GraphRAG only for the specific failure modes they solve. Treat Self-RAG, CRAG, and Speculative RAG as research patterns unless your team can justify the training, evaluator, or specialist-model overhead.

Operational Baselines

RAG vs Fine-Tuning

RAG and fine-tuning optimize different parts of the system. RAG externalizes knowledge into retrievable sources, while fine-tuning changes model behavior in weights. Choosing correctly prevents expensive retraining for problems that retrieval can solve more safely.

Example: if product policy changes weekly, RAG can update by reindexing documents. Fine-tuning would require repeated retraining cycles and still provide weak source traceability.

Axis RAG Fine-tuning
Knowledge freshness High Low
Source traceability High Low
Behavioral consistency Medium High
Time to first value Faster Slower
Operational complexity Retrieval and index ops Training and eval and release ops

Decision rules:

  1. Start with RAG when facts change often or citation is required.
  2. Add fine-tuning when output style or policy behavior remains unstable after prompt and retrieval tuning.
  3. Keep mutable facts in retrieval; keep behavior patterns in fine-tuned weights.

The combined pattern — fine-tune the model for behavior (format, tone, refusal policy) and use RAG for current factual knowledge — keeps updates fast while preserving behavioral control.

Questions

References


Whats next