Prompt Evaluation: How to Know If a Prompt Is Actually Good

How do you evaluate an LLM prompt? A practical guide to prompt evaluation — building a test set, choosing metrics, using LLM-as-judge, A/B testing prompts, and catching regressions so prompt engineering becomes measurable, not guesswork.

This is the finale, and it's the chapter that separates prompt tinkering from prompt engineering. You've learned a whole toolkit — basic prompting, few-shot, chain-of-thought, structured output, templates, system prompts, and RAG. But how do you know any given prompt is actually good? Not by vibes. By evaluation.

Why "it looks good" isn't enough

The natural way to judge a prompt is to read an output or two and think "yeah, that seems fine." This fails for three reasons:

Sampling variance. As inference showed, the same prompt gives different outputs each run. One good answer proves nothing.
Cherry-picking. You tend to test the inputs you had in mind when writing the prompt — not the messy real ones.
No baseline. Without measurement, you can't tell whether your "improved" prompt is actually better or just different.

Step 1: build a test set

Everything starts with a test set (or "eval set"): a fixed collection of inputs that represent what your prompt will really face, ideally paired with the expected or ideal output.

Collect real inputs

Add expected outputs

Cover edge cases

Freeze it as your benchmark

A test set is the fixed yardstick you measure every prompt against

Good test sets:

Cover the distribution — typical inputs and the awkward edge cases (empty, very long, ambiguous, adversarial).
Are representative — drawn from real usage where possible, not idealised examples.
Are stable — you reuse the same set across prompt versions so comparisons are fair.

Even 20–50 well-chosen examples is enormously more informative than eyeballing one or two. This is the single highest-leverage step.

Step 2: choose a metric that fits the task

How you score depends on what kind of output you have:

Task type	How to measure
Classification (sentiment, category)	Exact match / accuracy against the label
Extraction (structured output)	Field-by-field correctness; valid-format rate
Short factual answers	Exact or near match to the expected answer
Open-ended writing	Human rating or LLM-as-judge against a rubric
RAG answers	Groundedness (is it supported by context?), correctness, citation accuracy

Step 3: LLM-as-judge for subjective quality

For open-ended outputs — writing, explanations, tone — there's no exact answer to match, and human grading doesn't scale. The modern solution is LLM-as-judge: use a strong model to score outputs against a rubric you define.

"Rate the response 1–5 on helpfulness, accuracy, and tone. A 5 is accurate, directly answers the question, and is concise. Explain your score, then give the number."

Prompt v1

3.4/5

Prompt v2

4.1/5

LLM-as-judge scoring outputs against a rubric (illustrative)

It's fast, consistent, and cheap enough to run on a whole test set. But use it carefully:

Give a clear rubric — vague criteria produce vague, unreliable scores.
Spot-check against human judgment — make sure the judge agrees with you on a sample before trusting it at scale.
Watch for known biases — judges can favour longer answers or their own style. Design the rubric to counter that.

Step 4: A/B test your prompt changes

Now the payoff. When you tweak a prompt — add an example, change the system prompt, adjust the format — run both versions on the same test set and compare scores.

Run v1 on test set

Change one thing → v2

Run v2 on same set

Compare scores

A/B testing prompts: change one thing, measure the difference

This is how "I think this is better" becomes "this raised accuracy from 82% to 91%." Change one thing at a time so you know what caused the difference — the same discipline a scientist or a careful engineer applies.

Step 5: guard against regressions

Here's the trap that catches everyone: fixing one case often quietly breaks another. You add an instruction to handle empty inputs, and it subtly degrades the normal ones. You'd never notice from spot-checking — but a full test-set run catches it immediately.

So treat your test set like a regression suite: every prompt change re-runs the whole set, and you watch not just the average but which cases changed. A net gain that secretly tanks a critical category isn't a gain.

This connects back to prompt templates: because a template runs on everything, and a one-word change ships to all users, that change deserves the same evaluation rigor you'd give a code deploy.

Putting it all together

The professional prompt-engineering loop looks like this:

Write a prompt using the techniques in this guide.
Run it against your frozen test set.
Score with a task-appropriate metric (exact match, LLM-as-judge, etc.).
Change one thing and A/B it against the baseline.
Keep the change only if it improves the score without regressions.
Repeat — and keep the test set growing as you discover new failure modes.

Recap

Judging prompts by "it looks good" fails — sampling variance, cherry-picking, and no baseline all mislead.
Build a representative test set of inputs (with expected outputs) covering typical and edge cases — the highest-leverage step.
Match the metric to the task: exact match, field correctness, groundedness, or human/LLM judgment.
LLM-as-judge scales subjective scoring against a rubric — with a clear rubric and human spot-checks.
A/B test every change on the same set, one variable at a time, and guard against regressions with a full re-run.

That completes Prompt Engineering from Beginner to Advanced — from your first clear prompt to a measurable, production-grade practice. Pair it with How Large Language Models Work to understand the machine you're now skilled at driving, and explore more in the full guide library.