From GPT to ChatGPT: A Short History of the LLM Era

How did we get from autocomplete to AI that writes essays and code? A clear, non-technical history of large language models — the Transformer, the GPT series, the scaling race, and the ChatGPT moment that changed everything.

Five years ago, "AI writes your email" sounded like science fiction. Today it's a checkbox in your inbox. The technology behind that leap — the large language model, or LLM — didn't appear overnight, but it did arrive astonishingly fast.

This chapter is the short version of how we got here. You don't need the math or the politics — just the few key moments that turned next-token prediction (from the previous chapter) into a world-changing tool.

Before 2017: the awkward years

Computers have tried to handle language for decades. Early systems used hand-written grammar rules — brittle and easy to break. Later ones used statistics over word sequences, which worked for autocomplete but fell apart over long sentences.

The deep-learning era brought RNNs and LSTMs — networks that read text one word at a time, carrying a memory forward. They were a real step up, but they had two stubborn problems: they read sequentially (slow, hard to scale) and they forgot the start of long passages by the time they reached the end.

The field needed an architecture that could look at all the words at once and figure out which ones mattered. In 2017, it got one.

The moment everything changed

2017The Transformer
Google's paper 'Attention Is All You Need' introduces an architecture that processes all tokens in parallel and learns which ones relate to which. It scales far better than anything before it.
2018GPT-1 and BERT
OpenAI and Google show that pre-training one big model on raw text, then fine-tuning it, beats building separate models per task.
2019GPT-2
A much larger model writes coherent paragraphs. It was considered striking enough that its full release was initially staged out of caution.
2020GPT-3
175 billion parameters. It can do new tasks from just a few examples in the prompt — no retraining needed. The power of scale becomes undeniable.
2022ChatGPT
A fine-tuned, chat-friendly model behind a simple interface reaches 100M users in two months. AI goes mainstream.
2023GPT-4 and rivals
More capable, multimodal models arrive; Anthropic, Google, Meta and others ship serious competitors. The race is on.
2024+Cheaper, faster, multimodal
Models handle images and audio, process long documents, run at a fraction of the cost, and get embedded into everyday software.

The key milestones from research paper to mainstream technology

Let's unpack the three ideas that did the heavy lifting.

Idea 1: the Transformer and "attention"

The 2017 Transformer introduced a mechanism called attention. In plain terms: when the model processes a word, attention lets it look at every other word in the text and decide how relevant each one is to the current word.

In "the trophy didn't fit in the suitcase because it was too big," attention is what helps the model link "it" to "trophy" rather than "suitcase." It learns these relationships rather than being told them.

Two things made this a breakthrough:

Parallelism. Unlike older models that crawled word by word, Transformers process all tokens at once. That makes them dramatically faster to train — which means you can train them on far more data.
Scalability. The architecture kept getting better as you made it bigger. That single property set off everything that followed.

Idea 2: pre-training, then adapting

Before GPT, you built a separate model for each task: one for translation, one for sentiment, one for summarisation. Each needed its own labelled dataset.

GPT flipped the recipe. Pre-train one model on a giant pile of general text so it learns language broadly, then adapt that single model to specific tasks. One generalist outperformed a roomful of specialists — and you only had to pay the enormous training cost once.

This is why we talk about "foundation models." You build the foundation once, then everyone builds on top of it. We'll look at what that training actually involves in What's inside a model.

Idea 3: scale, and the surprises it brought

From GPT-2 to GPT-3, the recipe barely changed. What changed was size — more data, more parameters, more compute. And it kept working. Each jump in scale brought a jump in capability.

GPT-1 (2018)

117M

GPT-2 (2019)

1.5B

GPT-3 (2020)

175B

Approximate parameter counts across the GPT line (illustrative — newer models' sizes are not public)

The genuinely strange part: scaling unlocked abilities nobody explicitly trained for. GPT-3 could do basic arithmetic, translate languages, and follow examples given in the prompt — emergent skills that smaller models simply didn't have. Make the same kind of model big enough, and new behaviour appears.

This observation — scale reliably buys capability — became the field's guiding strategy and justified the eye-watering cost of training ever-larger models.

Idea 4: making models actually helpful

Raw, pre-trained models are powerful but awkward. Ask one a question and it might continue with more questions, because that's a plausible text continuation. It was a brilliant autocomplete, not an assistant.

The fix was alignment — most famously RLHF, reinforcement learning from human feedback. Humans rank the model's responses, and the model is tuned to produce the kind of answers people prefer: helpful, on-topic, instruction-following.

Stage	What it produces
Pre-training	A model that completes text
Instruction tuning	A model that follows instructions
RLHF / feedback	A model that's helpful, safe, and on-topic

RLHF is the quiet hero of the ChatGPT story. The underlying model existed before; alignment is what made it feel like talking to a capable assistant instead of a strange text generator.

The ChatGPT moment

In November 2022, OpenAI wrapped an aligned model in a plain chat interface and released it for free. That packaging — not a new breakthrough so much as the right model behind the right door — was the spark.

ChatGPT hit 100 million users in two months, the fastest adoption of any consumer app at the time. Suddenly everyone, not just researchers, could feel what these models could do. Every major tech company reprioritised around AI within months.

Era	Defining question
2017–2020	"Can a model write coherently?"
2020–2022	"How big can we go, and what emerges?"
2022–2023	"Can anyone use it?"
2023–now	"How reliable, cheap, and capable can it get?"

Where things stand now

The frontier has shifted from raw capability to reliability and reach:

Multimodal — models read images, hear audio, and generate across formats, not just text.
Cheaper and faster — costs have fallen sharply, putting capable models in everyday apps.
Longer context — models can now read whole books or codebases at once (more on this in Tokens and context windows).
Embedded everywhere — from search to spreadsheets to code editors.

The open questions are no longer "can it talk?" but "can we trust it, afford it, and control it?" — which is exactly where the rest of this guide goes.

Recap

The 2017 Transformer and its attention mechanism made scalable language models possible.
GPT introduced the pre-train-then-adapt recipe; one generalist beat many specialists.
Scaling kept working and produced emergent abilities nobody trained for.
RLHF / alignment turned a raw text predictor into a helpful assistant.
ChatGPT (2022) put it in everyone's hands and launched the current AI era.

Now that you know how we got here, let's open the box. Next: What's inside an AI model — training, parameters, and why size matters.