What Is RAG? Why Retrieval-Augmented Generation Exists

What is Retrieval-Augmented Generation? A plain-English explanation of why RAG exists — how it fixes an LLM's fixed knowledge, hallucinations, and lack of access to private data by retrieving relevant text before answering.

Ask a general LLM "what's our company's refund policy?" and it will confidently make something up. It has never seen your policy — that document wasn't in its training data, and the model has no way to look it up. Retrieval-Augmented Generation (RAG) is the standard fix: give the model the right document at question time, so it answers from real text instead of fuzzy memory.

This guide builds on ideas from the LLM guide. If tokens and the context window are unfamiliar, a quick skim will make everything here land faster.

The problem: a model's knowledge is frozen and generic

An LLM's knowledge is baked into its weights during training, and that creates three hard limits:

Limit	What it means
Frozen in time	The model knows nothing after its training cutoff — no recent events, no new docs.
Generic	It learned the public internet, not your data: your wiki, your tickets, your codebase.
Opaque	It can't tell you where a fact came from, so you can't verify it.

These limits are exactly why a raw model hallucinates on specific questions: asked about something it never saw, it produces the most plausible-sounding text rather than admitting ignorance. The knowledge simply isn't there to recall.

The idea: retrieve, then generate

RAG's insight is almost embarrassingly simple: before the model answers, look up relevant text and paste it into the prompt. The model then answers using that text.

Question

Retrieve relevant text

Add it to the prompt

Model answers from it

RAG inserts a retrieval step before the model ever generates

Instead of asking the model to recall your refund policy from memory, you find the policy document, drop it into the context, and ask: "Using this, answer the question." Now the model isn't remembering — it's reading. That shift from recall to reading is the whole game.

Why not just retrain the model on your data?

A fair question: if the model doesn't know your data, why not train it in? Because fine-tuning knowledge into weights is the expensive, brittle way to do it:

It costs significant compute and expertise every time.
It's stale the moment your data changes — a new policy means another training run.
It's hard to make the model cite sources or forget a specific fact on demand.

RAG sidesteps all of that. Your knowledge lives in a searchable store, outside the model. Update a document and the next query uses the new version instantly — no retraining. (Fine-tuning still has its place, but for teaching new facts, RAG is usually the right tool. We cover the distinction in the LLM fine-tuning chapter.)

The two phases of every RAG system

Every RAG system has the same two-phase shape — one offline, one online:

Index: chunk + store your docs

Query: embed the question

Retrieve top matches

Generate the answer

Indexing happens once; retrieval happens on every query

Indexing (offline, once): take your documents, split them into chunks, and store them in a way you can search by meaning. This is the subject of the next few chapters — embeddings, chunking, and vector databases.
Retrieval + generation (online, per query): for each question, find the most relevant chunks and hand them to the model to answer.

The rest of this guide walks that pipeline end to end, one stage per chapter.

Grounding: the real payoff

Beyond just knowing your data, RAG gives you something a raw model can't: grounding. Because every answer is built from specific retrieved chunks, you can show the user which sources backed each claim — a citation, a link, a quote.

That matters for two reasons. It lets people verify answers instead of trusting them blindly, and it sharply reduces hallucination, because the model is steered toward the supplied text rather than its own guesses.

RAG is only as good as its retrieval

One honest caveat to carry through the whole guide: RAG doesn't remove hallucination, it relocates the risk to retrieval. If your system fails to find the right chunk — because it was split badly, embedded poorly, or ranked too low — the model gets unhelpful or wrong context and is back to guessing. "Garbage in, garbage out" is the law of RAG. That's why most of this guide is really about retrieving well.

Recap

An LLM's knowledge is frozen, generic, and opaque — it can't answer about new, private, or specific data.
RAG fixes this by retrieving relevant text and putting it in the prompt, so the model reads instead of recalls.
It beats retraining for knowledge: instant updates, no compute, and answers you can trace to sources.
Every RAG system has two phases: index your data once, then retrieve + generate on each query.
RAG's payoff is grounding (verifiable, cited answers) — but it's only as good as what it retrieves.

The pipeline starts with a deceptively deep question: how do you search text by meaning rather than by keyword? The answer is embeddings. Continue to Embeddings for RAG.