If you’ve used a large language model for anything factual – a date, a citation, a person’s biography – you’ve probably encountered it: the model answers with complete fluency and total confidence, and is simply wrong. Not wrong in the way a confused student is wrong. Wrong in the way a very articulate person who has never heard of the topic is wrong, but has decided to answer anyway.
This phenomenon has been labeled hallucination, borrowing from psychiatry to describe outputs that feel real but aren’t grounded in reality. It’s arguably the most important unsolved problem in applied AI right now, and the research landscape around it is more nuanced than most public discussion suggests.
What’s Actually Happening
The term hallucination is useful as a shorthand but slightly misleading as a description of mechanism. Language models don’t “make things up” in any intentional sense. They do something more specific and more interesting: they generate text that is maximally plausible given their training distribution, without any guaranteed connection to factual ground truth.
During pre-training, the model learns statistical patterns in text. It learns that certain phrases follow certain other phrases. It learns associations between entities, attributes, and relationships. But it doesn’t learn a fact-checked world model – it learns a statistical shadow of the world as described in text.
This creates a few distinct failure modes that researchers tend to separate carefully:
Intrinsic hallucination occurs when the model’s output contradicts information present in the input context. If you paste in a paragraph and ask a question the paragraph answers, and the model gives a different answer, that’s intrinsic.
Extrinsic hallucination occurs when the model generates information that’s neither supported nor contradicted by context – it’s simply added. This is the more common failure mode in open-ended generation.
Knowledge boundary confusion is arguably the most common issue at scale: the model has some representation of a topic but that representation is noisy, partial, or outdated, and the model doesn’t know where the edge of its knowledge is.
Why Fluency Makes It Worse
There’s a cruel irony in the architecture. The same training objective that makes language models so fluent and coherent also makes hallucination hard to detect.
A model trained to minimize perplexity – to predict the next token as accurately as possible – learns that confident, well-formed sentences are rewarded in the training distribution. Real text is rarely uncertain and halting. So models learn to sound certain even when the underlying representation is noisy.
This means the surface signal we’d normally use to detect a lie – hesitation, vagueness, hedging – is actively suppressed by training. Hallucinations come out sounding just like correct answers.
What the Research is Actually Trying
The research response to hallucination is not a single thread but several distinct bets, each with genuine results and genuine limitations.
Retrieval-Augmented Generation (RAG) grounds the model’s responses in retrieved documents at inference time. Rather than relying solely on parametric memory baked into weights during training, RAG systems retrieve relevant passages from a corpus and pass them to the model as context. The model is then effectively being asked to synthesize and summarize rather than recall.
RAG substantially reduces factual errors on knowledge-intensive tasks. But it introduces its own problems: retrieval quality matters enormously, retrieved context can conflict with model priors, and the model can still fail to faithfully represent what the retrieved text says.
Calibration training tries to teach models to know what they don’t know. Work on uncertainty quantification trains models to produce better-calibrated confidence estimates, so that a stated high confidence actually correlates with correctness. This is technically harder than it sounds, because confidence in language models isn’t a single scalar – it’s distributed across token probabilities in ways that don’t straightforwardly map to semantic certainty.
RLHF and factuality feedback use human preference data to steer models toward factually accurate outputs. The challenge is that factual accuracy is expensive to label at scale – you need domain experts, not just preference annotators. Some recent work uses AI-assisted fact-checking to scale this signal.
Inference-time interventions attempt to reduce hallucination at generation time without retraining. Methods like chain-of-thought prompting, self-consistency (generating multiple answers and taking the majority), and iterative self-critique have all shown empirical benefits, though the mechanisms are still debated.
The Honest Assessment
None of these approaches fully solves the problem, and that’s not defeatism – it’s the correct read of the literature. RAG helps with knowledge boundary failures but doesn’t help if the retrieved document itself is wrong or misleading. Calibration training improves average cases but doesn’t eliminate overconfident errors. RLHF can introduce sycophancy, where models learn to say what sounds right rather than what is right.
The deeper issue is that hallucination is not a single bug but a symptom of the fundamental architecture: models that generate by predicting plausible text will generate plausible-sounding text even in the absence of reliable knowledge.
Progress is real. Models trained today hallucinate meaningfully less than their predecessors on standard benchmarks. But benchmarks measure what they measure, and deployed systems face a much wider distribution of queries than benchmarks cover.
This is not a reason to stop deploying language models. It is a reason to deploy them with retrieval grounding wherever possible, to build interfaces that expose uncertainty rather than hiding it, and to treat high-stakes factual claims from any language model as hypotheses requiring verification rather than facts requiring citation.