Caching

Intro

A RAG pipeline repeats expensive work on every query: embedding the question, searching the index, and generating an answer from an LLM. Caching eliminates that repetition by storing results at each stage so subsequent queries can skip the computation entirely. The payoff is lower latency, lower cost, and reduced load on embedding models, vector databases, and LLMs.

The correct model is layered caching — a separate cache at each pipeline stage with its own key design, TTL policy, and invalidation trigger. A single cache at one layer does not protect you; embedding costs are wasted if you only cache responses, and response caching alone misses the opportunity to serve sub-second retrievals.

The hard part specific to RAG is that cache correctness is a security problem, not just a freshness problem. If cache keys omit authorization context, a query from an authorized user can populate the cache with evidence that a second, unauthorized user later receives. Every cache layer must include permission-scoping fields in its key.

Flow

Cache Hit Diagram

sequenceDiagram
  participant App
  participant EC as Embedding Cache
  participant RC as Retrieval Cache
  participant LC as Response Cache

  App->>EC: hash query + model ver
  EC-->>App: stored vector

  App->>RC: hash query + filters + tenant + index ver
  RC-->>App: doc IDs + scores

  Note over App: assemble context from docs

  App->>LC: hash prompt + context + model ver
  LC-->>App: cached answer

Cache Miss Diagram

sequenceDiagram
  participant App
  participant EC as Embedding Cache
  participant EM as Embedding Model
  participant RC as Retrieval Cache
  participant VDB as Vector DB
  participant LC as Response Cache
  participant LLM

  App->>EC: hash query + model ver
  EC-->>App: miss
  App->>EM: embed query
  EM-->>App: vector
  App->>EC: store vector

  App->>RC: hash query + filters + tenant + index ver
  RC-->>App: miss
  App->>VDB: ANN search
  VDB-->>App: doc IDs + scores
  App->>RC: store results

  Note over App: assemble context from docs

  App->>LC: hash prompt + context + model ver
  LC-->>App: miss
  App->>LLM: generate
  LLM-->>App: answer
  App->>LC: store response

Embedding Cache

How it works:

Main risk:

Retrieval Cache

How it works:

Main risk:

LLM Response Cache

LLM response caching operates at two levels that solve different problems.

Provider-level prompt caching (KV cache reuse):

Application-level response caching (exact or semantic match):

Main risk:

Pitfalls

Questions

References


Whats next