RAG Prompts: Grounding LLM Answers in Your Own Data

What is RAG and how do you prompt with retrieved context? A practical guide to Retrieval-Augmented Generation — why it beats fine-tuning for facts, how to structure a RAG prompt, cite sources, and stop the model from hallucinating.

We've now seen how to instruct, exemplify, format, and govern the model. But one limitation keeps recurring: the model only knows what it learned during training, with a frozen knowledge cutoff, and it knows nothing about your private documents. RAG — Retrieval-Augmented Generation — is the technique that fixes this through prompting, and it's one of the most important patterns in applied AI.

The problem RAG solves

Ask a model "What's our company's refund policy?" and it can't know — that's in your internal docs, which were never in its training data. Ask "What happened in the news yesterday?" and it can't know — that's after its cutoff. The model will often hallucinate a plausible answer rather than admit ignorance.

You could fine-tune the model on your data, but as that chapter explained, fine-tuning is poor at reliably memorising facts and absurd for fast-changing information. There's a better way: just put the relevant facts in the prompt.

How RAG works, end to end

RAG has two phases: retrieve, then generate.

User question

Retrieve relevant chunks

Put them in the prompt

Model answers from them

Retrieval-Augmented Generation: find relevant text, then answer from it

Retrieve. When a question comes in, search your knowledge base for the most relevant passages. This usually uses embeddings: your documents are split into chunks, each chunk is embedded into a vector, stored in a vector database, and the question is embedded and matched to the closest chunks by semantic similarity.
Generate. Take those retrieved chunks and put them in the prompt as context, then ask the model to answer the question using that context.

The model never "learned" your data — it's reading it fresh from the prompt, exactly as it would read anything you paste in.

The anatomy of a RAG prompt

The prompting craft is in phase two. A solid RAG prompt has three parts:

Context: {{retrieved_chunks}}

Question: {{user_question}}

Instructions: Answer the question using only the context above. If the answer isn't in the context, say "I don't have that information." Cite the source for each claim.

Let's break down why each piece matters.

Part	Why it's there
Context	The retrieved facts the model should rely on
Question	The user's actual query
Grounding rule	"Use only the context" keeps it from drifting to training data
Honesty rule	"Say I don't know" stops it inventing answers
Citation rule	Lets users verify and builds trust

Grounding: the anti-hallucination instruction

The single most important line in a RAG prompt is the grounding instruction: tell the model to answer only from the provided context, and to admit when the answer isn't there.

Pair it with citations ("cite which source each fact came from"). Citations do double duty: they let users verify answers, and they nudge the model to actually base its answer on the retrieved text rather than its memory.

Retrieval quality is the real ceiling

Here's the hard truth about RAG: the prompt can only be as good as what retrieval feeds it. If your retrieval step pulls the wrong chunks, no prompting wizardry will produce a right answer — the model is faithfully working from bad context.

So most RAG failures are actually retrieval failures:

Symptom	Likely retrieval cause
Answer is irrelevant	Wrong chunks retrieved
Answer misses obvious info	Relevant chunk not retrieved
Answer is partially right	Chunks too small / context fragmented
Model ignores context	Too many chunks, signal buried (lost in the middle)

Improving RAG usually means improving chunking (how documents are split), embedding quality, and re-ranking (reordering results so the best are nearest the model's attention) — the upstream retrieval pipeline, not just the prompt.

RAG vs fine-tuning, settled

To close the loop with the fine-tuning chapter:

	RAG	Fine-tuning
Adds	Knowledge	Behaviour
Updates	Instantly (just change the data)	Requires re-training
Cost	Per-query retrieval	One-off training
Best for	Facts: fresh, private, specific	Style, format, task patterns
Hallucination	Reduces it (grounded)	Doesn't address it

For giving a model facts, RAG is almost always the right answer. They're also complementary — you can fine-tune behaviour and use RAG for knowledge in the same system.

Recap

RAG retrieves relevant text and puts it in the prompt so the model answers from real, current sources instead of its frozen memory.
It exists because models don't know your private data or post-cutoff facts — and it beats fine-tuning for knowledge.
A RAG prompt supplies context + question + grounding/honesty/citation rules.
The key anti-hallucination move is "answer only from the context, and say 'I don't know' otherwise."
Retrieval quality is the ceiling — most RAG failures are bad retrieval (chunking, embeddings, re-ranking), not bad prompts.
Use RAG for facts, fine-tuning for behaviour — often together.

We've now built a full prompt-engineering toolkit. The last question is the one that separates guessing from engineering: how do you actually know a prompt is good? Continue to the finale, Evaluation.