Chapter 25·Intermediate·11 min read
RAG Prompts: Grounding LLM Answers in Your Own Data
What is RAG and how do you prompt with retrieved context? A practical guide to Retrieval-Augmented Generation — why it beats fine-tuning for facts, how to structure a RAG prompt, cite sources, and stop the model from hallucinating.
June 29, 2026
We've now seen how to instruct, exemplify, format, and govern the model. But one limitation keeps recurring: the model only knows what it learned during training, with a frozen knowledge cutoff, and it knows nothing about your private documents. RAG — Retrieval-Augmented Generation — is the technique that fixes this through prompting, and it's one of the most important patterns in applied AI.
The problem RAG solves
Ask a model "What's our company's refund policy?" and it can't know — that's in your internal docs, which were never in its training data. Ask "What happened in the news yesterday?" and it can't know — that's after its cutoff. The model will often hallucinate a plausible answer rather than admit ignorance.
You could fine-tune the model on your data, but as that chapter explained, fine-tuning is poor at reliably memorising facts and absurd for fast-changing information. There's a better way: just put the relevant facts in the prompt.
How RAG works, end to end
RAG has two phases: retrieve, then generate.
- Retrieve. When a question comes in, search your knowledge base for the most relevant passages. This usually uses embeddings: your documents are split into chunks, each chunk is embedded into a vector, stored in a vector database, and the question is embedded and matched to the closest chunks by semantic similarity.
- Generate. Take those retrieved chunks and put them in the prompt as context, then ask the model to answer the question using that context.
The model never "learned" your data — it's reading it fresh from the prompt, exactly as it would read anything you paste in.
The anatomy of a RAG prompt
The prompting craft is in phase two. A solid RAG prompt has three parts:
Context:
{{retrieved_chunks}}Question:
{{user_question}}Instructions: Answer the question using only the context above. If the answer isn't in the context, say "I don't have that information." Cite the source for each claim.
Let's break down why each piece matters.
| Part | Why it's there |
|---|---|
| Context | The retrieved facts the model should rely on |
| Question | The user's actual query |
| Grounding rule | "Use only the context" keeps it from drifting to training data |
| Honesty rule | "Say I don't know" stops it inventing answers |
| Citation rule | Lets users verify and builds trust |
Grounding: the anti-hallucination instruction
The single most important line in a RAG prompt is the grounding instruction: tell the model to answer only from the provided context, and to admit when the answer isn't there.
Pair it with citations ("cite which source each fact came from"). Citations do double duty: they let users verify answers, and they nudge the model to actually base its answer on the retrieved text rather than its memory.
Retrieval quality is the real ceiling
Here's the hard truth about RAG: the prompt can only be as good as what retrieval feeds it. If your retrieval step pulls the wrong chunks, no prompting wizardry will produce a right answer — the model is faithfully working from bad context.
So most RAG failures are actually retrieval failures:
| Symptom | Likely retrieval cause |
|---|---|
| Answer is irrelevant | Wrong chunks retrieved |
| Answer misses obvious info | Relevant chunk not retrieved |
| Answer is partially right | Chunks too small / context fragmented |
| Model ignores context | Too many chunks, signal buried (lost in the middle) |
Improving RAG usually means improving chunking (how documents are split), embedding quality, and re-ranking (reordering results so the best are nearest the model's attention) — the upstream retrieval pipeline, not just the prompt.
RAG vs fine-tuning, settled
To close the loop with the fine-tuning chapter:
| RAG | Fine-tuning | |
|---|---|---|
| Adds | Knowledge | Behaviour |
| Updates | Instantly (just change the data) | Requires re-training |
| Cost | Per-query retrieval | One-off training |
| Best for | Facts: fresh, private, specific | Style, format, task patterns |
| Hallucination | Reduces it (grounded) | Doesn't address it |
For giving a model facts, RAG is almost always the right answer. They're also complementary — you can fine-tune behaviour and use RAG for knowledge in the same system.
Recap
- RAG retrieves relevant text and puts it in the prompt so the model answers from real, current sources instead of its frozen memory.
- It exists because models don't know your private data or post-cutoff facts — and it beats fine-tuning for knowledge.
- A RAG prompt supplies context + question + grounding/honesty/citation rules.
- The key anti-hallucination move is "answer only from the context, and say 'I don't know' otherwise."
- Retrieval quality is the ceiling — most RAG failures are bad retrieval (chunking, embeddings, re-ranking), not bad prompts.
- Use RAG for facts, fine-tuning for behaviour — often together.
We've now built a full prompt-engineering toolkit. The last question is the one that separates guessing from engineering: how do you actually know a prompt is good? Continue to the finale, Evaluation.