Chapter 23·Intermediate·11 min read
Prompt Evaluation: How to Know If a Prompt Is Actually Good
How do you evaluate an LLM prompt? A practical guide to prompt evaluation — building a test set, choosing metrics, using LLM-as-judge, A/B testing prompts, and catching regressions so prompt engineering becomes measurable, not guesswork.
June 29, 2026
This is the finale, and it's the chapter that separates prompt tinkering from prompt engineering. You've learned a whole toolkit — basic prompting, few-shot, chain-of-thought, structured output, templates, system prompts, and RAG. But how do you know any given prompt is actually good? Not by vibes. By evaluation.
Why "it looks good" isn't enough
The natural way to judge a prompt is to read an output or two and think "yeah, that seems fine." This fails for three reasons:
- Sampling variance. As inference showed, the same prompt gives different outputs each run. One good answer proves nothing.
- Cherry-picking. You tend to test the inputs you had in mind when writing the prompt — not the messy real ones.
- No baseline. Without measurement, you can't tell whether your "improved" prompt is actually better or just different.
Step 1: build a test set
Everything starts with a test set (or "eval set"): a fixed collection of inputs that represent what your prompt will really face, ideally paired with the expected or ideal output.
Good test sets:
- Cover the distribution — typical inputs and the awkward edge cases (empty, very long, ambiguous, adversarial).
- Are representative — drawn from real usage where possible, not idealised examples.
- Are stable — you reuse the same set across prompt versions so comparisons are fair.
Even 20–50 well-chosen examples is enormously more informative than eyeballing one or two. This is the single highest-leverage step.
Step 2: choose a metric that fits the task
How you score depends on what kind of output you have:
| Task type | How to measure |
|---|---|
| Classification (sentiment, category) | Exact match / accuracy against the label |
| Extraction (structured output) | Field-by-field correctness; valid-format rate |
| Short factual answers | Exact or near match to the expected answer |
| Open-ended writing | Human rating or LLM-as-judge against a rubric |
| RAG answers | Groundedness (is it supported by context?), correctness, citation accuracy |
Step 3: LLM-as-judge for subjective quality
For open-ended outputs — writing, explanations, tone — there's no exact answer to match, and human grading doesn't scale. The modern solution is LLM-as-judge: use a strong model to score outputs against a rubric you define.
"Rate the response 1–5 on helpfulness, accuracy, and tone. A 5 is accurate, directly answers the question, and is concise. Explain your score, then give the number."
It's fast, consistent, and cheap enough to run on a whole test set. But use it carefully:
- Give a clear rubric — vague criteria produce vague, unreliable scores.
- Spot-check against human judgment — make sure the judge agrees with you on a sample before trusting it at scale.
- Watch for known biases — judges can favour longer answers or their own style. Design the rubric to counter that.
Step 4: A/B test your prompt changes
Now the payoff. When you tweak a prompt — add an example, change the system prompt, adjust the format — run both versions on the same test set and compare scores.
This is how "I think this is better" becomes "this raised accuracy from 82% to 91%." Change one thing at a time so you know what caused the difference — the same discipline a scientist or a careful engineer applies.
Step 5: guard against regressions
Here's the trap that catches everyone: fixing one case often quietly breaks another. You add an instruction to handle empty inputs, and it subtly degrades the normal ones. You'd never notice from spot-checking — but a full test-set run catches it immediately.
So treat your test set like a regression suite: every prompt change re-runs the whole set, and you watch not just the average but which cases changed. A net gain that secretly tanks a critical category isn't a gain.
This connects back to prompt templates: because a template runs on everything, and a one-word change ships to all users, that change deserves the same evaluation rigor you'd give a code deploy.
Putting it all together
The professional prompt-engineering loop looks like this:
- Write a prompt using the techniques in this guide.
- Run it against your frozen test set.
- Score with a task-appropriate metric (exact match, LLM-as-judge, etc.).
- Change one thing and A/B it against the baseline.
- Keep the change only if it improves the score without regressions.
- Repeat — and keep the test set growing as you discover new failure modes.
Recap
- Judging prompts by "it looks good" fails — sampling variance, cherry-picking, and no baseline all mislead.
- Build a representative test set of inputs (with expected outputs) covering typical and edge cases — the highest-leverage step.
- Match the metric to the task: exact match, field correctness, groundedness, or human/LLM judgment.
- LLM-as-judge scales subjective scoring against a rubric — with a clear rubric and human spot-checks.
- A/B test every change on the same set, one variable at a time, and guard against regressions with a full re-run.
That completes Prompt Engineering from Beginner to Advanced — from your first clear prompt to a measurable, production-grade practice. Pair it with How Large Language Models Work to understand the machine you're now skilled at driving, and explore more in the full guide library.