Re-ranking in RAG: Putting the Best Context First

What is re-ranking in RAG? A plain-English guide to two-stage retrieval — using a fast retriever to gather candidates and a slower, sharper reranker to reorder them, why cross-encoders win, and the cost trade-off.

In the retrieval chapter we fetched the top-k chunks and noticed a quiet problem: the truly best passage often isn't ranked first. Embedding search is fast but coarse, and hybrid search hands you a merged pile of candidates in no particular order. Re-ranking is the step that fixes this — a second, sharper pass that reorders candidates so the strongest context lands at the top, where the model will actually use it.

Why first-stage ranking is imperfect

Vector search optimizes for speed. To search millions of chunks in milliseconds it uses approximate methods and a single similarity score per chunk — and that score is a blunt instrument. The result: relevant chunks are usually somewhere in your top-20, but the single most useful one might sit at rank 7, not rank 1.

You could pass more chunks to be safe, but that spends tokens and dilutes the prompt with marginal text. The better fix is to order the candidates well so a small, high-quality set rises to the top.

Two-stage retrieval: cast wide, then refine

The standard pattern is two-stage retrieval: a fast first stage for recall, a precise second stage for ranking.

Query

Fast retrieve top ~50

Rerank the 50

Keep top ~5

Into the prompt

Retrieve many candidates fast, then rerank a shortlist precisely

Retrieve (recall): use fast vector/hybrid search to gather a generous candidate set — say the top 50. The goal here is to not miss the right chunk, not to order it perfectly.
Rerank (precision): run a slower, sharper model over just those 50 to score relevance carefully, then keep the best handful.

You get the best of both: the speed of approximate search across the whole store, and careful scoring spent only on a small shortlist.

Why cross-encoders rerank better

The reason a reranker is sharper comes down to how it compares. Recall that retrieval embeds the query and each chunk separately and compares the two vectors — fast, because chunk vectors are precomputed, but each side is summarized without knowing the other.

A cross-encoder reranker instead reads the query and a chunk together, as one input, and outputs a direct relevance score. Seeing both at once lets it catch nuances a separated comparison misses — whether the chunk actually answers this question, not just whether it's broadly on-topic.

	First-stage retriever	Cross-encoder reranker
Compares	Two separate vectors	Query + chunk together
Speed	Very fast	Slower
Precision	Coarse	Sharp
Runs over	The whole store	A small shortlist

The cost of reading every query-chunk pair together is exactly why you can't run a cross-encoder over your whole database — and exactly why the two-stage design exists.

The cost trade-off

Reranking buys accuracy with latency and money. Each rerank scores many query-chunk pairs with a heavier model, so it's far pricier per query than a vector lookup. That's fine because it runs only on the shortlist — but it's a real cost to weigh:

More candidates reranked = better odds of surfacing the best chunk, but slower and costlier.
Fewer candidates = cheaper and faster, but you might rerank a set that already missed the answer.

As with every stage in this guide, it's a dial, not a free win — tune the candidate count to your latency and budget.

When reranking is worth it

Reranking isn't always necessary, so here's the tell. If you inspect your retrieved chunks and find that the answer is usually in there somewhere, but the model still gets it wrong, your bottleneck is ordering — and reranking is often the highest-return upgrade you can add. If the right chunk isn't being retrieved at all, reranking won't help; go back to chunking, embeddings, or hybrid search first. Fix recall before you fix ranking.

Recap

Re-ranking reorders retrieved candidates so the most relevant chunk rises to the top, where the model uses it.
First-stage search is fast but coarse — the best chunk often sits a few ranks down, unseen.
Two-stage retrieval casts a wide net fast, then carefully reranks a shortlist for precision.
Cross-encoder rerankers read the query and chunk together, judging relevance more sharply than separate-vector comparison.
Reranking trades latency and cost for accuracy — add it when the answer is retrieved but ranked too low.

We've built a strong retrieval pipeline. But how do we know it's good — and catch it when it regresses? The last piece is evaluation. Continue to Evaluating RAG.