Retrieval in RAG: Fetching the Right Context

How the retrieval step in RAG works — top-k search, hybrid keyword-plus-semantic retrieval, query rewriting, and how retrieved chunks become the model's prompt. A plain-English guide to fetching good context.

We've built the index — chunked documents, embedded, stored in a vector database. Everything so far was preparation. Retrieval is the live step that runs on every single query: take the user's question, find the right chunks, and assemble them into the prompt the model answers from. It's where all the earlier groundwork pays off — or doesn't.

The retrieval step, end to end

At query time, retrieval is a short pipeline:

User question

Embed it

Top-k search (+ filters)

Assemble chunks

Into the prompt

Retrieval turns a raw question into grounded context for the model

Embed the question with the same model used for the chunks, search the vector store for the top-k nearest matches (applying any metadata filters), and stitch those chunks into the context. Simple in outline — but each step has a way to go wrong, and small improvements here move the whole system.

Semantic search has a blind spot

Embeddings are great at meaning and surprisingly bad at exact tokens. Ask for error code E-4021 or part number XJ-9 and semantic search may shrug — these strings carry little "meaning" to embed, so nearby vectors aren't the ones you want.

Query type	Semantic search	Keyword search
"ways to cancel my plan"	Strong	Weak
"error code E-4021"	Weak	Strong
"the Henderson contract"	Mixed	Strong

Meaning-based and exact-match search fail in opposite situations. That observation leads straight to the fix.

Hybrid search: meaning plus exactness

Hybrid search runs both a semantic search and a traditional keyword search, then merges the results. You get the recall of meaning-based matching and the precision of exact-term matching — covering both columns of that table.

Merging the two result lists fairly is its own small problem — the next chapter on re-ranking is largely about putting a combined pile of candidates into the right order.

Query rewriting: fix the question first

Users don't ask clean search queries. They ask "what about the second one?" or "is that covered?" — questions full of pronouns and missing context that retrieve terribly on their own.

Query rewriting uses a model to rewrite the question into a clear, standalone search query before retrieval:

Conversation so far: "Tell me about your Pro plan." → [answer]
Follow-up:           "Does it include support?"
Rewritten query:     "Does the Pro plan include customer support?"

That rewritten query embeds and retrieves far better than the raw follow-up, because it carries its own context. For multi-turn assistants especially, query rewriting is often the single biggest retrieval improvement — the search is only as good as the question you feed it.

Assembling the context

Once you have the winning chunks, how you put them into the prompt matters more than people expect. The same chunks, formatted two ways, can produce noticeably different answers. A few habits that help:

Mark boundaries clearly so the model knows where each source starts and ends.
Keep source labels with each chunk so the model can cite them and you preserve grounding.
Mind the order — models pay uneven attention across a long context, so the position of key chunks matters.
Respect the budget — every chunk spends tokens; more isn't automatically better.

This is prompt engineering applied to retrieved results, and it's the bridge between "we found good chunks" and "the model gave a good answer."

This is where RAG lives or dies

Step back and notice where we are. The model hasn't even run yet, and almost every way RAG fails has already had its chance: a missed chunk, a vague query, the wrong top-k, badly assembled context. The generation step can only be as good as the context retrieval hands it. That's why, of the whole pipeline, retrieval is where most real-world quality work happens — and why the next chapter pushes further on getting the ordering right.

Recap

Retrieval is the per-query step: embed the question, fetch the top-k chunks (with filters), and assemble the prompt.
Semantic search has a blind spot for exact strings — codes, IDs, rare names — that keyword search handles well.
Hybrid search runs both and merges them, patching each method's weakness with the other's strength.
Query rewriting turns vague, context-dependent questions into clear standalone queries — a big quality lift.
How you assemble context (boundaries, labels, order, budget) shapes the final answer; retrieval is where RAG lives or dies.

We can pull a pile of candidate chunks — but the best one isn't always near the top. Re-ranking reorders candidates so the strongest context wins. Continue to Re-ranking in RAG.