Monitoring

Intro

RAG monitoring is the continuous observation of a deployed RAG pipeline to detect quality regressions, performance degradation, and data staleness before users notice. Offline Evaluation validates a pipeline before deployment — it answers "is this version good enough to ship?" Monitoring validates it after — it answers "is it still working as expected right now?" The distinction matters because production traffic exposes failure modes that static eval sets cannot anticipate: new query patterns, corpus drift, model behavior changes after provider updates, and load-dependent latency spikes.

The mechanism: each request flows through multiple stages — query translation, embedding, retrieval, reranking, context assembly, generation — and each stage can degrade independently. Monitoring instruments each stage with metrics and traces, samples a fraction of responses for quality scoring via LLM-as-judge, and fires alerts when metrics breach thresholds relative to a rolling baseline. Without per-stage instrumentation, teams observe "answers got worse" but cannot tell whether retrieval stopped finding relevant documents, the reranker misordered them, or the generator hallucinated despite good context.

Example: aggregate faithfulness scores look stable at 0.91, but segmenting by tenant reveals that a financial services tenant dropped to 0.72 after a corpus update replaced their regulatory FAQ with a new document format that the chunking pipeline handles poorly. A global dashboard shows green. The tenant files a support ticket before the engineering team notices — because the alert fires on the global metric, not the segment.

flowchart TD
    P[Production traffic] --> I[Instrument per-stage telemetry]
    I --> D[Deterministic metrics on 100 pct of requests]
    I --> S[Sample 5 to 20 pct for LLM-as-judge scoring]
    D --> A[Alerting engine]
    S --> A
    A --> Seg{Segment-level breach}
    Seg -->|Yes| Diag[Diagnose with per-stage traces]
    Seg -->|No| A
    Diag --> Fix[Fix pipeline or corpus]
    Fix --> V[Re-evaluate offline]
    V --> P

Instrumentation

How you instrument determines what you can observe. OpenTelemetry's GenAI semantic conventions (v1.40+) provide a standard attribute schema for LLM operations — gen_ai.client.token.usage, gen_ai.client.operation.duration, gen_ai.server.time_to_first_token — with provider-specific extensions for OpenAI, Anthropic, AWS Bedrock, and Azure AI Inference. Building on this standard avoids lock-in to a single observability vendor.

Each pipeline stage — query translation, embedding, retrieval, reranking, context assembly, generation — should emit its own span within a parent trace. The gen_ai.operation.name attribute distinguishes stages: retrieval for the retriever span, embeddings for embedding generation, chat for the LLM call. This gives you per-stage latency breakdown, error attribution, and input/output size at each boundary.

For each request, capture: the raw and translated query, retrieved document IDs with relevance scores, token counts (input and output via gen_ai.usage.input_tokens / gen_ai.usage.output_tokens), and model metadata. Logging the full prompt and response for every request is expensive at scale. A common pattern is to log full traces for a configurable sample (5–20%) and log only structured metadata (latency, token count, document IDs, scores) for 100%.

Quality Metrics

Quality metrics split into two categories: deterministic metrics that require no model calls, and semantic metrics that require an LLM-as-judge.

Deterministic Metrics

Compute these on every request — they are free and instant.

Empty-result rate — fraction of queries where retrieval returns zero documents. Even a small empty-result rate (>1%) signals coverage gaps in the index. A new query cluster that hits zero results means the corpus does not cover that topic, or the query translation step is producing embeddings in an unexpected region of the vector space.

Retrieval count distribution — number of documents retrieved per query. Sudden drops suggest index issues or filter misconfigurations. Sudden increases suggest that relevance thresholds were loosened or that query translation is producing overly broad rewrites.

Citation rate — fraction of responses that include citations when the prompt instructs the model to cite sources. A drop in citation rate signals the generator is ignoring the retrieved context — often an early indicator of prompt regression or model behavior change.

Abstention rate — fraction of queries where the system declines to answer. Track alongside abstention correctness: what fraction of abstentions were warranted (no relevant documents existed) vs. false abstentions (relevant documents were retrieved but the generator refused to answer).

Response length — median and p95 response token count. Abrupt length shifts can indicate prompt regression, model behavior change after a provider update, or context assembly bugs that produce truncated or bloated prompts.

Metric	What it answers	Alert when
Empty-result rate	Are there corpus coverage gaps?	Exceeds 2× historical segment average
Retrieval count distribution	Is the index returning expected volumes?	Sudden drop or spike outside normal range
Citation rate	Is the generator attending to retrieved context?	Drops from baseline — early prompt regression signal
Abstention rate	Is the system refusing correctly?	Spikes (over-refusal) or drops with low-evidence queries
Response length	Is context assembly behaving normally?	p95 shifts abruptly in either direction

Retrieval Quality Metrics

Retrieval quality metrics require a labeled evaluation set — a set of queries with known relevant documents. Unlike deterministic metrics that run on every request, these are computed on a scheduled basis (nightly or on every deployment) against a golden query set. They complement real-time signals by answering "is retrieval finding the right documents?" rather than just "is it returning documents at all?"

Recall@k — of all relevant documents in the corpus, what fraction appears in the top-k results. Recall@5 = 0.8 means 80% of relevant documents land in the top 5. This is the primary retrieval health metric: if relevant evidence is missing from the context, no amount of generation quality can compensate. A drop after a corpus update typically means new documents are poorly embedded or the index structure needs rebuilding. Example: a support bot's Recall@5 drops from 0.87 to 0.71 after a bulk FAQ import because the imported schema causes the chunking pipeline to split documents across boundaries, breaking embedding coherence.

Precision@k — of the k documents retrieved, what fraction is relevant. Precision@5 = 0.6 means 3 of 5 retrieved chunks are on-topic. Low precision floods the context window with noise — the generator must sift relevant evidence from irrelevant material, which increases hallucination risk and wastes tokens. Example: expanding retrieval from top-5 to top-10 improves recall but drops Precision@10 to 0.35; adding a re-ranker restores precision to 0.7 while retaining the recall gain.

HitRate@k — the fraction of queries for which at least one relevant document appears in the top-k results. Binary per query (hit or miss), making it the simplest minimum-bar check. HitRate@5 = 0.92 means 8% of queries receive zero relevant context — a hard failure floor regardless of generation quality. Example: HitRate@5 is 0.94 globally but drops to 0.71 for a specific product category, exposing a corpus coverage gap rather than a ranking problem.

MRR (Mean Reciprocal Rank) — the average of 1/rank for the first relevant document across queries. If the first relevant result is at position 3, the reciprocal rank is 1/3. MRR = 0.75 means the first relevant document lands at an effective average position of 1.33. Rewards pushing the best result higher; particularly sensitive to re-ranker quality when the generator reads only the top-1 or top-2 chunks. Example: MRR drops from 0.82 to 0.64 after an embedding model upgrade that improves overall Recall@10 but consistently buries the single most relevant chunk at position 3–4.

MAP (Mean Average Precision) — the mean of Average Precision (AP) scores across queries. AP for a single query is the mean of Precision@k at each rank position where a relevant document appears. MAP = 1.0 requires all relevant documents at the top of the ranked list. More informative than MRR when multiple relevant documents per query are expected, because it penalizes both missing documents and ranking them late. Example: a legal assistant has MRR = 0.88 (finds one relevant case near the top) but MAP = 0.51 (misses most of the additional cases the lawyer needs) — improving MAP requires expanding recall, not just top-1 ranking.

nDCG@k (Normalized Discounted Cumulative Gain) — measures ranking quality with graded relevance. Documents at higher positions contribute more to the score, and more relevant documents contribute more than partially relevant ones. nDCG@5 = 0.83 means the actual ranking is 83% as good as the ideal ordering. Unlike MAP (which treats relevance as binary), nDCG captures the difference between a partially-relevant and a highly-relevant document at the same position. Example: nDCG@10 = 0.79 but nDCG@3 = 0.61 — the model finds relevant documents but places them at positions 4–7; a re-ranker targeting top-3 precision brings nDCG@3 to 0.81 without changing nDCG@10.

Metric	What it answers	When to prefer
Recall@k	Did we find the relevant documents?	Primary metric — always track
Precision@k	How much noise is in the context?	Context window is tight or token cost matters
HitRate@k	Does any relevant doc appear?	Minimum-bar coverage check; fast to interpret
MRR	Is the best result ranked first?	Generator uses only top-1 or top-2 chunks
MAP	Are all relevant docs found and ranked high?	Multiple relevant documents expected per query
nDCG@k	Is the full ranking quality good?	Generator uses all k chunks with position-aware weighting

Track all six on a nightly schedule against your golden query set. Gate deployments primarily on Recall@k and nDCG@k; use MRR and HitRate for fast diagnosis when those signals drop.

LLM-as-Judge Metrics

For semantic quality, run an LLM judge asynchronously on a sampled fraction of production traffic. Use binary pass/fail judgments rather than numeric scales — binary judgments reduce calibration noise and inter-judge variance, and correlate better with domain expert assessment than 1–5 scores. Use a smaller, cheaper model (GPT-4o-mini, Claude Haiku) as the production judge and reserve the expensive model for weekly calibration runs where you compare cheap-judge scores against expensive-judge scores on the same sample to track judge agreement drift.

Faithfulness (groundedness) — does every claim in the answer trace back to the retrieved context? The judge decomposes the response into atomic claims and checks each against the provided passages. Faithfulness = supported_claims / total_claims. This is the single most important online quality metric for RAG because it directly measures hallucination risk. For a cheaper alternative in high-volume systems, RAGAS offers FaithfulnesswithHHEM — an open-source T5-based classifier that avoids LLM API costs entirely.

Answer relevancy — does the response actually address the user's question? RAGAS computes this by generating N synthetic questions from the response and measuring cosine similarity between those questions and the original query. A faithfully grounded answer can still score low on relevancy if retrieval returned off-topic documents and the generator faithfully summarized them. This metric is reference-free, making it practical for online monitoring where ground-truth answers are unavailable.

Context relevancy — were the retrieved documents relevant to the query? This catches retrieval regressions that have not yet propagated to answer quality because the generator compensated using parametric knowledge. When context relevancy drops but faithfulness holds, the system is at elevated hallucination risk — the retrieved context is no longer providing useful evidence, and the model is filling gaps from its training data. Once parametric knowledge runs out for a query type, faithfulness will follow context relevancy downward.

Answer correctness — does the answer actually solve the user's question? A response can be perfectly faithful (every claim is grounded) but still wrong if it misses the key constraint, answers a different question, or is incomplete. Requires a reference answer, making it an offline metric run against a golden set rather than sampled production traffic.

Citation validity — do the citations in the answer actually support the specific claims they are attached to? This is stricter than faithfulness: the answer may be grounded overall, but a specific citation may point to an irrelevant passage. The judge maps each cited passage to the claim it supposedly supports and checks the entailment. Useful when the system surfaces citations to end users and trust in those citations is a product requirement.

Response completeness — does the answer cover all aspects of the query? A query asking "compare A and B" expects coverage of both; a query listing three requirements expects all three addressed. Partial answers that are faithful and relevant still fail on completeness. Requires a reference answer or a rubric defining what "complete" means for the query type.

Noise Sensitivity — measures incorrect claims introduced when retrieved context contains irrelevant chunks. Catches a failure mode the other metrics miss: the model hallucinating claims that are consistent with noisy context rather than ground truth. Recall@k and faithfulness can both look healthy while noise sensitivity is high — retrieval found enough relevant chunks to pass recall, and the model grounded most claims, but noisy chunks triggered additional fabricated ones. Requires reference. Lower is better.

Context Entities Recall — compares named entities in the reference answer against entities present in the retrieved context. Useful for entity-heavy domains (legal, medical, financial) where missing a specific name, date, or identifier is a hard failure even when general topic recall is adequate. A system can score well on Recall@k while consistently missing the exact entity the user needs. Requires reference.

Metric	What it answers	Reference needed
Faithfulness	Are all claims grounded in retrieved context?	No
Answer relevancy	Does the response address the question?	No
Context relevancy	Were retrieved documents relevant to the query?	No
Answer correctness	Does the answer actually solve the question?	Yes
Citation validity	Does each citation support its attached claim?	No
Response completeness	Are all aspects of the query covered?	Yes
Noise Sensitivity	Does noisy context introduce fabricated claims?	Yes
Context Entities Recall	Are required named entities present in context?	Yes

Performance and Cost Metrics

Per-stage latency — p50, p95, p99 for each pipeline stage separately. A p95 spike in reranking is invisible in end-to-end latency if other stages are fast — per-stage breakdown is required to localize it.

End-to-end latency — total request duration from query receipt to response delivery. Set SLOs on p95 end-to-end, but always diagnose regressions with per-stage breakdown.

Token usage — input and output tokens per request via gen_ai.client.token.usage. Track daily cost aggregates and per-query cost. A sudden increase suggests prompt bloat or context window misuse.

Cache hit rate — per Caching layer. Drops after corpus updates are expected; sustained drops on stable corpora indicate a cache key design or invalidation problem.

Error rate — rate of failed requests (model API errors, timeouts, malformed responses), segmented by stage so failures are attributed to the correct component.

Metric	What it answers	Alert when
Per-stage latency	Which stage is the bottleneck?	p95 for any stage exceeds SLO budget
End-to-end latency	Is the overall SLO being met?	p95 exceeds SLO for 10+ minutes
Token usage	Is prompt assembly efficient?	Per-query cost increases >30% from baseline
Cache hit rate	Is caching working correctly?	Sustained drop on a stable corpus
Error rate	Are pipeline stages failing?	Exceeds historical baseline per stage

Data Health Metrics

Index freshness lag — time between a document being updated in the source system and its new embedding being available in the index. Track as a distribution, not just an average — a median lag of 2 hours is fine, but a p99 lag of 3 days means some documents are silently stale.

Ingestion failure rate — fraction of documents that fail during the embedding/indexing pipeline. Silent ingestion failures create invisible coverage gaps that surface as empty retrieval results for specific query types.

Corpus size — total document and chunk count over time. Unexpected drops signal accidental deletions or pipeline failures.

Metric	What it answers	Alert when
Index freshness lag	Are documents being indexed promptly?	p99 lag exceeds acceptable staleness window
Ingestion failure rate	Are documents being lost silently?	Exceeds 1% of scheduled ingestions
Corpus size	Is the index growing or shrinking as expected?	Unexpected drop (deletion or pipeline failure)

Segmentation

Global aggregate metrics hide localized regressions. A pipeline change that improves average faithfulness by 2% can simultaneously degrade faithfulness by 20% for a specific tenant whose documents use a different format.

Segment every metric by at least:

Tenant or user group — multi-tenant systems must catch per-tenant regressions.
Query cluster — group similar queries by topic, intent, or embedding proximity and track metrics per cluster.
Document source type — different sources (PDFs, wikis, APIs, databases) have different chunking, formatting, and retrieval quality characteristics.
Language — if the system serves multiple languages, each has its own retrieval and generation quality profile.

Segmentation is not optional. Without it, you are monitoring the average, and the average lies.

Alerting

Effective RAG alerting uses relative thresholds anchored to a rolling baseline, not absolute values. Absolute thresholds ("faithfulness must be above 0.9") are brittle — they break across corpus changes, model updates, and seasonal query shifts. Relative thresholds ("faithfulness must not drop more than 5% from the 7-day rolling baseline") adapt automatically because the baseline tracks the current system state.

Signal	Alert condition	Why
Faithfulness (sampled)	Drops >5% from 7-day rolling baseline for any segment	Catches hallucination regressions before user impact
Empty-result rate	Exceeds 2x the historical segment average	Signals index coverage gap or filter misconfiguration
p95 end-to-end latency	Exceeds SLO budget for 10+ minutes	Performance regression or upstream dependency issue
Ingestion failure rate	Exceeds 1% of scheduled ingestions	Silent data loss accumulating
Token cost per query	Increases >30% from baseline	Prompt bloat, context window misuse, or upstream retrieval change

Recompute baselines after any intentional pipeline change (model swap, prompt update, index rebuild). See the same baseline principle in Evaluation.

Pitfalls

Monitoring Only Latency While Quality Degrades

A system meets latency SLOs consistently while serving increasingly ungrounded answers. This happens when a model API becomes faster but less accurate (cheaper model silently substituted by the provider), or when cache hit rates increase but cached responses are stale. Latency-only SLOs create a false sense of health.

Mitigation: always pair latency metrics with sampled quality metrics. A dashboard that says "latency is fine, faithfulness dropped 8% in the legal-docs segment" is more actionable than "all systems nominal."

Judge Drift Without Calibration

The LLM judge used for production scoring drifts over time — either because the judge model is updated by the provider, or because the distribution of inputs changes. Faithfulness scores shift gradually but nobody notices because the absolute numbers still look reasonable.

Mitigation: maintain a small calibration set (50–100 examples) with human-labeled ground truth. Run the judge against this set weekly. Track judge-human agreement rate. If agreement drops below 80%, recalibrate the judge prompt or switch to a different judge model. This is the monitoring-side counterpart to the LLM-as-judge bias problem described in LLM-as-a-Judge.

Alerting on Global Aggregates Instead of Segments

The most common monitoring failure in multi-tenant RAG. Global faithfulness is 0.92. One tenant's faithfulness is 0.68. The alert never fires because the global metric is above threshold. The tenant discovers the problem before the engineering team does.

Mitigation: fire alerts at the segment level, not the global level. If segment-level alerting creates too many alerts, implement a tiered system — alert immediately on high-priority segments (large tenants, high-risk domains), batch low-priority segments into a daily digest.

Sampling Bias in Quality Scoring

If the sampling strategy for LLM-as-judge evaluation is uniform random, it under-represents rare but important query types (multi-hop questions, negation queries, edge-case domains). These rare queries are often the ones that fail most.

Mitigation: use stratified sampling. Allocate a fixed fraction of the judge budget to each query cluster, ensuring that small clusters still get scored. Alternatively, over-sample queries where deterministic signals suggest risk — low retrieval scores, unusually high token counts, or long response latency.

Tradeoffs

Approach	Coverage	Cost	Latency impact	Reliability
Deterministic metrics only	Low — catches format and count anomalies, not semantic quality	Lowest — no model calls	Zero — computed from existing data	Perfect — deterministic
Full LLM-as-judge on every request	Highest — every response scored	Highest — model API cost per request	High if synchronous, zero if async	Subject to judge drift and prompt sensitivity
Sampled LLM-as-judge (5–20%)	High — covers the distribution statistically	Moderate — proportional to sample rate	Zero if async	Requires careful sampling to avoid bias
Human review of flagged samples	Highest precision — catches judge errors	Highest in human time	Delayed — hours to days	Gold standard for calibration, low throughput
Embedding drift detection	Medium — catches retrieval distribution shifts	Low — statistical comparison	Zero — computed offline	Detects slow drift, not sudden failures

Decision rule: combine deterministic metrics on 100% of traffic (fast, free), sampled LLM-as-judge on 5–20% (quality coverage), and periodic human review for calibration. Use embedding drift detection as an early warning for retrieval degradation between judge scoring cycles.

Questions

Why is sampled LLM-as-judge scoring preferred over scoring every response in production?

Scoring every response doubles per-request cost and adds latency if synchronous.
At production scale (thousands of queries/hour), full scoring is prohibitively expensive.
Sampled scoring (5–20%) provides statistical coverage of the quality distribution at a fraction of the cost.
Stratified sampling ensures rare but important query types (multi-hop, negation, edge-case domains) are represented.
Async execution decouples scoring from user-facing latency — users never wait for the judge.
Full scoring is reserved for offline evaluation runs against labeled eval sets.
Tradeoff: lower sample rates reduce cost but increase the risk of missing localized regressions in small query clusters. Start at 10–20% and reduce only with evidence that the distribution is stable.

Why should RAG alerting use relative regression thresholds instead of absolute quality targets?

Absolute thresholds ("faithfulness > 0.9") are brittle across corpus changes, model updates, and query distribution shifts.
A threshold calibrated at launch becomes meaningless after the corpus doubles or query mix evolves.
Relative thresholds ("no more than 5% drop from 7-day rolling baseline") adapt automatically because the baseline tracks current system state.
Relative thresholds prevent the failure mode where a team sets an ambitious absolute target, cannot reach it consistently, and silently disables the alert.
Baselines must be recomputed after intentional pipeline changes (model swap, prompt update, index rebuild) to avoid false alarms on expected shifts.
Tradeoff: relative thresholds can miss slow, gradual degradation that stays within the rolling window. Complement with periodic absolute floor checks (e.g., monthly review of whether the baseline itself is still acceptable).

How does monitoring differ from evaluation in a RAG system, and why do you need both?

Evaluation validates a pipeline configuration against a labeled dataset before deployment — it gates releases.
Monitoring validates the pipeline continuously against live traffic after deployment — it catches production regressions.
Eval sets are static snapshots; production traffic shifts continuously with new query patterns, corpus updates, and model provider changes.
Monitoring catches failure modes that no static eval set anticipates: seasonal query shifts, silent model downgrades, load-dependent degradation.
The feedback loop connects them: failing production traces identified by monitoring get added to the eval set, preventing recurrence in future releases.
Tradeoff: monitoring alone detects problems after users are affected; evaluation alone misses production-specific failures. Neither is sufficient without the other.

References

Whats next

Parent
LLM

Pages