Code Safari

Chapter 22·Intermediate·12 min read

How LLMs Are Trained: Pretraining and RLHF Explained

How is a large language model trained? A plain-English guide to LLM training — pretraining on huge text, how parameters are tuned by predicting the next token, and how RLHF and instruction tuning turn a raw model into a helpful assistant.

June 29, 2026

We've explored the finished machine: tokens, embeddings, the Transformer, and attention. But a fresh Transformer is useless — its billions of parameters start as random noise. Training is the process that turns that noise into a model that can write, reason, and assist. Here's how it works, no math required.

The one trick: predict the next token

All of an LLM's ability comes from one deceptively simple training task: predict the next token.

Take a sentence from the training data, hide what comes next, and ask the model to guess it. Compare its guess to the real answer. If it's wrong, nudge its parameters a tiny bit in the direction that would have made the right answer more likely. Then do it again — across trillions of tokens.

Show text, hide next token
Model predicts
Compare to truth
Nudge parameters
Repeat
The training loop, repeated trillions of times

No single nudge teaches much. But trillions of them, across the breadth of human writing, gradually carve grammar, facts, reasoning patterns, and style into the parameters. This is the same "next-token prediction" we met in chapter one — training is just doing it in reverse to learn the prediction.

Phase 1: Pretraining

The first and largest phase is pretraining: running that next-token loop over a vast corpus — books, websites, code, articles, reference text. This is where the model absorbs language and world knowledge in bulk.

Pretraining has defining characteristics:

PropertyDetail
ScaleTrillions of tokens; billions of parameters
CostEnormous compute and energy — among the most expensive things in tech
FrequencyDone rarely; the result is a reusable base model
OutputA "base" model that's knowledgeable but not yet a helpful assistant

This is the phase where scaling laws showed their power: more data plus more parameters reliably produced a better model — the discovery that drove the whole race we describe in the history of LLMs.

Why a raw model isn't enough

Here's a surprise: a freshly pretrained base model is powerful but not the helpful assistant you're used to. It's a pure text-continuer. Ask it "What's the capital of France?" and it might continue with "What's the capital of Germany? What's the capital of Spain?" — because in its training data, questions are often followed by more questions.

The base model has the knowledge and fluency, but not the behaviour. It doesn't know it's supposed to be helpful, answer directly, or refuse harmful requests. Turning a base model into an assistant takes more training — the alignment phase.

Phase 2: Instruction tuning

The first step of alignment is instruction tuning (a form of supervised fine-tuning). The model is trained on many examples of the form instruction → good response: a question paired with a helpful answer, a task paired with a correct completion.

Write instruction–response pairs
Train the base model on them
Model learns to answer, not continue
Instruction tuning teaches the model the assistant pattern

After this, the model understands the shape of being an assistant: when it sees a question, it should produce an answer. This is closely related to fine-tuning, which the next chapter covers in depth.

Phase 3: RLHF — learning from human preference

The final polish is RLHF — Reinforcement Learning from Human Feedback. The idea: it's hard to write the perfect answer for every prompt, but easy for a human to compare two answers and say which is better.

So the process is:

  1. The model generates several responses to a prompt.
  2. Humans (or a model trained to mimic them) rank which responses are better.
  3. The model is nudged to produce more of what people preferred and less of what they didn't.
Helpful & clear
👍
Vague
Rambling
Harmful
👎
RLHF rewards the responses humans prefer

RLHF is what makes a model feel helpful, honest, and harmless — it learns the hard-to-specify qualities of a good response. As we noted in the history of LLMs, this step is what turned a raw text predictor into ChatGPT.

The full picture

PhaseWhat it doesResult
PretrainingNext-token prediction over huge textKnowledge + fluency (base model)
Instruction tuningTrain on instruction → response pairsKnows how to assist
RLHFLearn from human preferencesHelpful, safe, well-behaved

The first phase is enormous and rare; the alignment phases are smaller but decisive. Together they produce the assistant you actually talk to.

Recap

  • LLMs train by predicting the next token, billions of times, nudging parameters when wrong.
  • Pretraining runs this over a vast corpus — this is where knowledge and fluency are baked in, at huge cost, and where the knowledge cutoff comes from.
  • A raw base model continues text but doesn't assist — it needs alignment.
  • Instruction tuning teaches the assistant pattern; RLHF uses human preference to make it helpful, honest, and harmless.
  • The result is one reusable set of parameters — frozen knowledge plus learned behaviour.

Pretraining is general. But what if you need a model that's expert in your domain — legal, medical, your company's tone? That's fine-tuning. Continue to Fine-tuning LLMs: adapting a model to your task.

How LLMs Are Trained: Pretraining and RLHF Explained | Code Safari