What's Inside an AI Model: Training, Parameters, and Why Size Matters

What is a model, really? A no-math look inside a large language model — what parameters are, how training actually works, what 'billions of parameters' means, and why size matters (but isn't everything).

People imagine an AI model as a kind of program — pages of code with facts and rules written into it. That picture is wrong, and replacing it is the single most clarifying thing you can do.

In this chapter we open the box. What is a model? What are the "parameters" everyone counts? What actually happens during "training"? And why does size matter — but not as much as the headlines suggest? No math, just the real mechanics.

A model is mostly numbers

Strip away the marketing and a trained language model is two things:

A fixed architecture — the wiring that says how to push text through the system (for today's models, a Transformer, from the history chapter).
A gigantic set of parameters — numbers that fill in that wiring.

That's it. When you download a model, you're mostly downloading the parameters. The "intelligence" isn't in handwritten rules; it's in the specific values of billions of numbers. Change the numbers and you change what the model knows and how it behaves.

So what is a parameter?

Think of a parameter as a dial — a single adjustable number. On its own it means nothing. But wire billions of them together and let the right values flow through, and they collectively transform an input ("the capital of France is") into an output (a high probability for "Paris").

When a model is described as "70 billion parameters" (70B), that's the number of dials it has. Each one started as a random value and was nudged, over training, toward a setting that helps the model predict text better.

More dials means more capacity — more room to store patterns and represent subtle distinctions. That's the core reason size correlates with capability, though as we'll see, it's not the whole story.

How training actually works

Training is where those random dials become useful. The process is conceptually simple and is just repeated an unfathomable number of times.

Show text, hide next word

Model guesses it

Compare to real word

Nudge parameters to be less wrong

One training step — repeated trillions of times across the dataset

Here's the loop in words:

Show the model some text with the next token hidden. For example: "The mitochondria is the powerhouse of the ___".
The model guesses a probability for every possible next token.
Compare its guess to the actual next token ("cell"). The gap between guess and reality is the error (often called the "loss").
Adjust every parameter a tiny amount in the direction that would have made the correct token slightly more likely. This nudging is done by an algorithm called gradient descent — but you don't need the math; just picture "turn each dial a hair toward being less wrong."

Do this once and nothing happens. Do it across trillions of tokens of text, for weeks or months, on thousands of specialised chips, and the dials gradually settle into a configuration that predicts language remarkably well. Grammar, facts, styles, and patterns of reasoning all get baked into the parameters as a side effect of relentlessly reducing prediction error.

Two phases: learn the world, then learn the job

Training happens in stages, and it helps to separate them.

Phase	Input	What it produces	Cost
Pre-training	Vast raw text from the web, books, code	Broad language ability — a "base model"	Enormous (most of the total)
Fine-tuning	Curated examples of good answers	A model that follows instructions	Small by comparison
RLHF / feedback	Human preferences between answers	A helpful, safe, on-topic assistant	Small by comparison

Pre-training is the expensive part — this is where the model reads a large slice of the internet and builds general competence. Fine-tuning and human feedback (the RLHF we met in the history chapter) then shape that raw ability into something pleasant and useful to talk to.

A useful mental image: pre-training is a broad education; fine-tuning is on-the-job training for a specific role.

What the training data is

The base model's abilities are downstream of what it read. Pre-training datasets are huge mixes, typically including:

Web pages — the bulk, filtered for quality.
Books and articles — long-form, well-structured prose.
Code — public repositories, which sharpen reasoning and structure.
Reference text — encyclopaedias, documentation, Q&A sites.

Two consequences fall straight out of this:

The model's "knowledge" reflects its data, including its biases and gaps. If something is rare or absent in the data, the model is weak on it.
The model has a knowledge cutoff — it only saw text up to a certain date, so it doesn't inherently know about events after that unless given the information in the prompt.

Why size matters

For years, the reliable way to get a better model was to make it bigger — more parameters, more data, more compute. Larger models could:

Store more patterns, from rare facts to niche writing styles.
Capture subtler relationships that smaller models smear over.
Show emergent abilities that simply don't appear below a certain scale.

Small (~7B)

Medium (~70B)

70B

Large (~400B)

400B

Capacity grows with parameters — but so do training and running costs (illustrative)

But scale isn't free. Bigger models cost more to train, more to run, and respond more slowly. That tension — capability versus cost and speed — drives a lot of today's engineering.

Why size isn't everything

Here's the nuance the headlines miss: a smaller, well-made model can beat a bigger, sloppier one. Parameter count is capacity, not quality. What you do with that capacity matters just as much:

Data quality. A model trained on cleaner, more relevant text learns more per parameter. Garbage in, garbage out — at scale.
Training technique. Better methods extract more capability from the same size.
Alignment and tuning. A smaller model that's well-tuned to follow instructions often feels smarter than a larger raw one.
Distillation. Large models can "teach" smaller ones, packing much of the ability into a fraction of the size.

This is why the industry trend has partly reversed from "biggest wins" to "smartest-per-dollar wins." Efficient mid-size models now rival giants from a couple of years earlier.

If you only look at...	You'll miss...
Parameter count	Data quality and training method
Benchmark scores	How it behaves on your real tasks
Raw capability	Cost, speed, and reliability

Recap

A model is an architecture plus billions of parameters — knowledge lives in the numbers, not in written rules.
A parameter is a tunable dial; "70B" means 70 billion of them.
Training is a simple loop — guess the next token, measure the error, nudge the dials — repeated at vast scale.
It happens in phases: pre-training (learn broadly) then fine-tuning and feedback (become a helpful assistant).
Size buys capacity, but data, technique, and tuning often matter more.

We've seen how the model is built and what's inside it. Next, we use that understanding to explain one of AI's most talked-about flaws: Why AI hallucinates — and what it reveals about how it thinks.