LLM Limitations: What Large Language Models Can't Do

What are the limitations of LLMs? An honest guide to what large language models can't do — hallucination, no real-time knowledge, weak math and counting, no true understanding or memory — and how to work around each one.

We've followed a large language model from raw text to a streamed answer: tokens, embeddings, the Transformer, attention, training, and inference. It's a genuinely remarkable machine. But using it well means being clear-eyed about what it can't do. This final chapter is the honest list — and how to work around each limit.

The root cause of almost everything

Nearly every limitation flows from one fact established in chapter one: an LLM is built to produce plausible text, not true text. It's a next-token predictor optimised for what sounds right given its training data — and "sounds right" and "is right" are not the same thing.

Keep that frame and the rest of this chapter is just the specific ways it shows up.

1. Hallucination

The most famous limitation: when the model doesn't know something, it often invents a plausible answer rather than admitting ignorance. Fake citations, non-existent functions, confidently wrong dates, plausible-sounding statistics with no source.

This isn't lying — the model has no concept of truth to violate. It's just doing its job: generating likely text. When the likely text happens to be false, you get a hallucination. We cover the mechanism in depth in why AI hallucinates.

Work around it: treat factual output as a draft to verify, ask for sources you can check, and use RAG to ground answers in real documents.

2. Frozen, incomplete knowledge

As we saw in training, a model's knowledge is baked in during pretraining and stops at a knowledge cutoff. Two consequences:

It knows nothing after that cutoff — recent events, new releases, today's prices.
It knows nothing private it never saw — your codebase, your company's docs, your customer data.

Model lacks the fact

Retrieve from a source

Put it in the prompt

Model answers from it

Give the model knowledge it lacks by retrieving it at answer time

Work around it: connect the model to live tools (search, APIs) and use retrieval to inject current or private facts into the context window at answer time.

3. Weak at exact operations

Because the model perceives tokens, not characters or numbers, anything requiring exactness is shaky:

Task	Why it struggles
Counting letters ("R's in strawberry")	Sees tokens, not letters
Multi-step arithmetic	Predicts plausible digits, doesn't calculate
Counting items precisely	No internal counter
Exact quotes / lookups	Reconstructs from patterns, may drift

Work around it: hand these to real tools. Modern systems let the model call a calculator, run code, or query a database instead of guessing — far more reliable than asking it to compute in its head.

4. No true understanding or grounding

The model has no beliefs, intentions, or contact with the physical world. It has never seen a sunset or held a cup — it has only read text about them. Its "knowledge" is statistical structure in language, not grounded experience.

This is why it can produce something fluent and authoritative that's also nonsense, and not notice. There's no internal model of reality checking the output against the world — just patterns checking against patterns.

5. No memory and bounded attention

From the context window chapter: the model has no memory between requests, and within a request it can only see what fits in the window. Long conversations lose their early parts, and details buried in the middle of long inputs can be overlooked ("lost in the middle").

Work around it: front-load what matters, restate key facts in long sessions, and don't assume it "remembers" anything you didn't re-send.

6. Bias, inconsistency, and sensitivity

A few more to keep on your radar:

Bias. The model reflects patterns — including biases — in its training data.
Inconsistency. Thanks to sampling, the same prompt can give different answers; it may even contradict itself across a long output.
Prompt sensitivity. Small wording changes can meaningfully change results — which is exactly why prompt engineering is a real skill.

A practical operating manual

You don't need to fear these limits — you need to design around them. The reliable pattern:

Verify anything that matters. Output is a fast first draft, not a final authority.
Retrieve, don't trust memory. Use RAG for facts; don't rely on baked-in knowledge for specifics.
Offload exact work to tools. Calculators, code, and databases for anything precise.
Lower the temperature when you need consistency.
Keep a human in the loop for high-stakes decisions.

Do this, and the limitations stop being traps. You get the model's real strengths — fluency, breadth, speed, transformation — while covering its weaknesses with verification and tools.

Recap

Almost every limitation traces to one fact: the model optimises for plausible, not true.
Hallucination (confident invention), frozen knowledge, and weak exact operations are the big three.
It has no true understanding, no memory between requests, and bounded attention — and its confidence is unrelated to its accuracy.
It can be biased, inconsistent, and sensitive to prompt wording.
Work with it by verifying, retrieving facts, offloading exact tasks to tools, lowering temperature, and keeping humans in the loop.

That completes How Large Language Models Work — from a single next-token prediction to the full, honest picture of the machine. The natural next step is learning to drive it well. Continue to the Prompt Engineering guide to turn this understanding into results.