Code Safari

Expedition 10·Beginner·13 min read

Training, Testing, and Why Models Overfit (Explained Simply)

A model that scores perfectly on its own study material can still be useless. This is the most important idea in machine learning: training vs. testing, overfitting vs. underfitting, and why a model that memorises is a model that fails.

June 17, 2026

We've established the goal of machine learning: not to memorise examples, but to generalise — to perform well on new inputs it has never seen (chapter one). This chapter is about the single most important practice that makes generalisation real, and the failure that haunts every project when it doesn't.

If you take one technical idea away from this whole guide, make it this one. It's the difference between a model that looks brilliant in a demo and one that actually works in the wild.

The trap: grading a model on its own homework

Imagine a student who gets the answer key to an exam, memorises every question-and-answer pair, then sits the exact same exam. They score 100%. Are they a genius? You have no idea — because you tested them on the very material they memorised. The score tells you nothing about whether they understand the subject.

Machine learning has exactly this trap. If you train a model on some examples and then measure how well it does on those same examples, a model that simply memorised them will look perfect. The score is meaningless. You learn nothing about how it handles the real world.

The fix: split your data

The solution is simple and it underpins all of machine learning. Before training, you split your examples into two piles:

  • The training set — the model learns from this. It sees these examples and adjusts to them.
  • The test set — locked away during training. The model never sees it while learning. You use it only at the end, to measure performance on genuinely fresh data.

A common split is something like 80% for training and 20% for testing. The test set is your stand-in for "the real world the model will face later."

All labelled data
Split: ~80% train / ~20% test
Train on the training set
Score on the held-out test set
The train/test split: learn on one slice, judge on a slice the model never saw

Now the score means something. If the model does well on the test set — data it never studied — it must have captured a real pattern, not just memorised. That's generalisation, measured.

Overfitting: when a model memorises instead of learns

Here's the failure that splits hold-out testing wide open. Overfitting is when a model learns its training examples too well — including their noise, quirks, and accidents — and as a result fails on anything new.

Picture predicting house prices. The real pattern is roughly "bigger house, better location, higher price." But an overfit model goes further: it latches onto irrelevant accidents in your specific data — "houses sold on a Tuesday went for slightly more," "the one with a red door was pricey." Those weren't real patterns; they were coincidences in your particular examples. The model treated noise as signal.

The result is the classic overfitting signature: excellent on training data, poor on test data. It memorised the study set and can't handle the exam.

Good model — train
92%
Good model — test
89%
Overfit — train
99%
Overfit — test
67%
The signature of overfitting: a big gap between training and test accuracy

See the contrast. The good model scores similarly on both — it learned something that transfers. The overfit model is near-perfect on training but collapses on the test set. That gap between training and test performance is the alarm bell every practitioner watches for.

Underfitting: the opposite mistake

You can also err the other way. Underfitting is when a model is too simple to capture the real pattern at all. It does poorly on the training data and on new data — it never learned enough to begin with.

If overfitting is a student who memorised the answer key without understanding, underfitting is a student who barely studied and grasps neither the practice questions nor the exam.

UnderfittingJust rightOverfitting
Model complexityToo simpleBalancedToo complex
Training performancePoorGoodExcellent (suspiciously)
Test performancePoorGoodPoor
The studentDidn't study enoughUnderstood the topicMemorised the answer key

The art of machine learning lives in that middle column: a model complex enough to catch the genuine pattern, but not so complex it starts memorising noise. Too simple and you miss the signal; too complex and you drown in the noise.

How practitioners fight overfitting

Overfitting is the everyday enemy, so there's a standard toolkit for it:

  • More data, more variety. The richer and more varied your examples, the harder it is for the model to memorise its way through — there's simply too much to memorise, so learning the real pattern becomes the easier path. This is the most reliable fix, which is part of why data matters so much (the subject of the next chapter).
  • Simpler models. Deliberately limiting how complex a model can get stops it from fitting every quirk. Sometimes a plainer model that generalises beats a fancy one that memorises.
  • A validation set. Beyond train and test, teams often hold out a third slice — a validation set — used during development to tune the model and catch overfitting early, keeping the final test set pristine for one honest end-of-line measurement.
  • Early stopping. Watch test-style performance as training proceeds and stop the moment it stops improving, before the model slides into memorising.
Train set — learn
Validation set — tune & catch overfitting
Test set — final honest score
The three-way split many teams actually use

None of these require math to understand. They're all variations on one instinct: keep the model honest by constantly checking it against data it hasn't seen.

Why this is the idea that matters most

Almost every machine-learning disaster traces back to this chapter. A fraud model that aced its tests but missed real fraud? Likely overfit to historical quirks. A demo that dazzled but flopped in production? The demo was probably run on training data. A model that "worked last year" but degrades now? The world drifted away from the examples it memorised.

Hold "train on one slice, judge on another, and mind the gap" in your head and you'll see through a huge fraction of AI hype — and understand why the boring work of data and evaluation matters more than the choice of algorithm.

Recap

  • Never grade a model on its training data — that measures memory, not understanding.
  • Split your data: train on one slice, test on a held-out slice the model never saw.
  • Overfitting = memorising noise; great on training, poor on test (watch for the gap).
  • Underfitting = too simple; poor on both. The goal sits in the middle.
  • Fight overfitting with more/varied data, simpler models, a validation set, and early stopping.

We've leaned hard on one phrase: "more and better data." But what is the data, exactly — what goes in, what answer comes out, and what makes data "good"? That's the foundation everything else rests on, and it's where we go next: Features, labels, and why data quality is everything.

Training, Testing, and Why Models Overfit (Explained Simply) | Code Safari