Code Safari

Chapter 60·Intermediate·9 min read

Evaluating RAG: How to Measure a RAG System

How to evaluate a RAG system — separating retrieval metrics from generation metrics, measuring faithfulness and relevance, building a test set, and using an LLM as a judge. A plain-English guide to RAG evaluation.

June 30, 2026

We've built the full RAG pipeline — chunking, embeddings, vector search, retrieval, and re-ranking. One question remains, and it's the one that separates a demo from a product: is it any good, and how would you know? Without measurement, tuning RAG is guesswork — you change a setting, the answers "feel" different, and you have no idea if you helped or hurt. This chapter is about replacing that feeling with numbers.

Two failure modes, measured separately

The single most important idea in RAG evaluation: a wrong answer has two possible causes, and you must tell them apart.

FailureWhere it happenedExample
Retrieval failureThe search stageThe chunk with the answer was never fetched
Generation failureThe modelGood context was retrieved, but the model ignored or misused it

If you only score the final answer, you can't tell which half to fix. So good evaluation grades retrieval and generation independently — then you know whether to work on chunking and search, or on the prompt and model.

Evaluating retrieval

The retrieval question is concrete: for each test query, did we fetch the chunks needed to answer it, and how high did they rank? Given a question and the chunks that genuinely contain its answer, you can measure:

  • Were the right chunks retrieved at all? (recall — did the needle make it into the haystack we returned?)
  • How high did they rank? (a good reranker should push them toward the top.)

These metrics isolate the search half of the system entirely from the model. If retrieval scores poorly, no amount of prompt tuning will save you — the answer simply isn't in the context.

Evaluating generation: faithfulness and relevance

Once the right context is in the prompt, two different things can still go wrong with the answer, and they need two different metrics:

Retrieved context
Generated answer
Faithful to context?
Relevant to question?
A good answer must be both grounded in the context and on-topic for the question
  • Faithfulness (grounding): is every claim in the answer actually supported by the retrieved text? This is RAG's anti-hallucination metric — it catches the model inventing details the sources don't contain.
  • Answer relevance: does the answer actually address the question? A response can be perfectly faithful to the context and still miss the point — quoting a true but irrelevant passage.

You need both. A faithful-but-irrelevant answer is useless; a relevant-but-unfaithful answer is dangerous. Tracking them separately tells you whether to tighten the prompt's grounding instructions or its focus.

LLM-as-judge: scoring at scale

Grading faithfulness and relevance by hand is accurate but doesn't scale past a few dozen examples. The common solution is LLM-as-judge: give a strong model the question, the context, and the answer, plus a clear rubric, and have it score each dimension.

It's fast and surprisingly effective — but treat the judge as a measurement instrument that itself needs checking:

  • Validate it against a sample of human judgments so you trust its scores.
  • Use clear rubrics, not a vague "is this good?" — specific criteria give consistent grades.
  • Watch for bias — judges can favor longer or more confident answers regardless of correctness.

Used carefully, an LLM judge turns a slow manual chore into an automated metric you can run on every change.

Build a test set, then iterate

None of these metrics help without something to run them on. The foundation of RAG evaluation is a test set: a fixed collection of representative questions, each paired with the context that should be retrieved and a known-good answer.

With that in hand, evaluation becomes a loop:

Change something
Run the test set
Compare the scores
Keep it or revert
Evaluation turns tuning from guesswork into a measurable loop

Now "I tweaked the chunk size" produces a number, not a vibe. You can prove a change helped, catch regressions before users do, and know exactly which stage each change moved. Start small — even a few dozen well-chosen questions beats flying blind — and grow the set as you discover the queries your system gets wrong.

Recap

  • A RAG answer can fail at retrieval (wrong context) or generation (good context used badly) — measure them separately.
  • Retrieval metrics ask whether the needed chunks were fetched and how highly they ranked.
  • Faithfulness checks the answer is grounded in the context (anti-hallucination); answer relevance checks it addresses the question.
  • LLM-as-judge scores these at scale — powerful, but validate it against human judgment and use clear rubrics.
  • A fixed test set turns tuning into a measurable loop, so you can prove improvements and catch regressions.

That completes RAG Explained — you can now build a retrieval pipeline and measure it. RAG and agents are two halves of practical AI engineering; agents often use RAG as their long-term memory. Explore the rest from the guides hub.

Evaluating RAG: How to Measure a RAG System | Code Safari