Features, Labels, and Why Data Quality Is Everything in ML

Practitioners say models are only as good as their data — and they mean it literally. Here's what features and labels actually are, why feature choice often beats algorithm choice, and the data problems (bias, leakage, imbalance) that quietly wreck models.

Three chapters in, one phrase keeps returning: a model is only as good as its data. We've treated it as a slogan. This chapter makes it concrete — because the parts of machine learning that decide success or failure are usually not the algorithm, but the data you feed it and the answers you train it against.

Practitioners have a saying: most of the job isn't building clever models, it's wrangling data. Here's why that's true, starting with the two words at the heart of it.

Features and labels: the two halves of an example

Back in chapter one we said supervised learning is built from examples that pair an input with an answer. Those two parts have names.

A feature is an input — a measurable fact about the thing you're studying. For a house: its size, number of bedrooms, location, age. Each is a feature. Together they're the information the model gets to look at.
A label is the answer — the thing you want the model to predict. For that house: its actual sale price.

The model's whole job in supervised learning is to learn the relationship between the features and the label: given these inputs, predict that answer.

	Features (inputs)	Label (answer)
House price model	Size, bedrooms, location, age	Sale price
Spam filter	Words used, sender, number of links	Spam / not spam
Medical model	Symptoms, test results, age	Diagnosis

In unsupervised learning (from chapter two) there are features but no labels — that's exactly what makes it unsupervised. For the rest of this chapter we'll focus on the labelled, supervised case, since that's where features and labels both come into play.

Why features often matter more than the algorithm

Here's something that surprises newcomers: which features you give the model frequently matters more than which algorithm you pick. A great algorithm fed poor features loses to a simple algorithm fed insightful ones.

Suppose you're predicting whether someone will repay a loan. Hand the model a raw birth date and it has to work hard to extract anything useful. Hand it age, years in current job, and debt-to-income ratio — facts you computed from the raw data — and even a simple model does well, because you've handed it the signal directly.

That craft of turning raw data into informative features is called feature engineering, and it's where experienced practitioners spend much of their time.

Raw data (birth date, timestamps, text)

Transform into useful features (age, weekday, word counts)

Model learns far more easily

Feature engineering: raw data rarely arrives model-ready — you shape it into signal

Simple transformations punch above their weight:

A birth date → an age the model can use directly.
A timestamp → "weekday or weekend?", "time of day" — often what actually drives behaviour.
A blob of text → counts of key words, length, presence of links.

None of this is glamorous and none of it is math-heavy. It's mostly knowing your problem well enough to point the model at what matters.

Garbage in, garbage out — the failure modes

Because the model learns everything from the data, any flaw in the data becomes a flaw in the model. Three flaws cause an outsized share of real-world failures.

Bias: the model inherits your data's prejudices

If your training data reflects human bias, the model learns that bias and applies it at scale — with a misleading air of mathematical objectivity. A hiring model trained on a company's past decisions will reproduce whatever patterns were in those decisions, including the unfair ones. The model isn't malicious; it's faithfully learning the examples it was given.

Data leakage: a clue from the future

Leakage is subtle and brutal. It happens when a feature secretly contains information the model wouldn't have at prediction time — often the answer itself, in disguise.

Classic example: a model predicting whether a patient has a disease, fed a feature like "was prescribed the treatment for that disease." Of course that predicts the diagnosis perfectly — but only after the diagnosis was made. In testing, the model looks flawless. In real use, where that feature isn't available yet, it's worthless. Leakage is one of the top reasons a model aces the lab and faceplants in production.

In testing (leaky)

99%

In production

55%

Leakage's deceptive signature — suspiciously perfect in testing, useless in the real world

The cure is vigilance: for every feature, ask "would I actually know this at the moment I need to predict?" If the answer is no — or "only because the outcome already happened" — it's leakage.

Imbalance: when the rare case is the important one

Some of the most valuable problems are about rare events: fraud, disease, equipment failure. If 99% of transactions are legitimate, a lazy model can score 99% accuracy by labelling everything legitimate — and catch exactly zero fraud. The headline number looks great; the model is useless.

This connects straight back to overfitting and honest evaluation: a single accuracy figure can hide total failure on the cases you actually care about. Imbalanced problems need deliberate handling — gathering more of the rare examples, or measuring success in ways that don't let the model ignore them.

How much data, and how good?

Two dials, both matter:

Quantity — more examples generally help, especially against overfitting, because there's more pattern to learn and less room to memorise. But ten thousand clean examples usually beat a million junk ones.
Quality — accurate labels, consistent measurements, and good coverage of the situations the model will actually face. A model trained only on sunny-day driving footage is dangerous the first time it rains, no matter how many sunny clips it saw.

Recap

A feature is an input the model sees; a label is the answer it's trained to predict.
Feature engineering — shaping raw data into informative features — often matters more than the choice of algorithm.
Bias in the data becomes bias in the model, automated and authoritative.
Data leakage — a feature that smuggles in the answer — makes a model look perfect in testing and fail in reality.
Imbalance lets a model hide total failure behind a high accuracy number; rare-but-important cases need deliberate care.
Quality usually beats quantity — clean, representative data wins.

We now have the full picture of how a model learns and what it learns from. The last question is practical: how does a model actually go from an idea in someone's head to a working system real people rely on? That's the journey we close with — How a machine learning model goes from idea to deployment.