RLHF Explained – How Language Models Learn to Follow Instructions

If you used an early GPT model – the kind available before 2022 – and asked it to explain something clearly, it would often respond by continuing your prompt rather than answering it. Ask “What is backpropagation?” and it might generate: “What is gradient descent? What is a neural network? What is machine learning?” – because it had learned to predict text, and question-answer pairs were a pattern in its training data.

Pre-trained language models are extraordinarily powerful text predictors. They are not, by default, helpful assistants. The gap between those two things – and how it gets closed – is what Reinforcement Learning from Human Feedback is about.

The Alignment Gap

A language model trained on next-token prediction learns to imitate the statistical distribution of its training corpus. The internet, which forms the bulk of most training data, contains a lot of text that is factually wrong, harmful, manipulative, or simply not what you would want a helpful assistant to say.

More fundamentally, predicting the next token is not the same objective as being useful to a human. A model optimized purely for next-token prediction will confidently hallucinate, will ignore instructions it finds statistically unusual, and will sometimes produce outputs that a reasonable person would find harmful – not because the model is malicious, but because the training objective did not penalize these behaviors.

Aligning the model’s behavior with what humans actually want requires training on a different signal. RLHF is currently the most widely deployed method for generating that signal.

The Three Stages

RLHF as used in InstructGPT – the OpenAI system that preceded ChatGPT – and most subsequent instruction-tuned models involves three sequential training phases.

Stage one: supervised fine-tuning. You begin with a pre-trained base model. Human contractors write examples of the kind of behavior you want – a prompt and an ideal response. You fine-tune the model on these examples using standard supervised learning. This produces a model that has seen good examples of instruction following, but the coverage is limited by how many human-written examples you can afford.

Stage two: reward model training. This is the conceptual core of RLHF. Rather than labeling individual responses as good or bad, you generate multiple responses to the same prompt and ask human raters to rank them. Rankings are more reliable and faster to collect than absolute quality labels – it is easier to say “A is better than B” than to score A from 1 to 10.

You then train a separate neural network – the reward model – to predict these human preferences. Given a prompt and a response, the reward model outputs a scalar score: a learned approximation of human judgment. This model is trained on thousands to millions of comparisons and generalizes to prompts and responses it has not seen.

Stage three: RL fine-tuning. Now you use the reward model as an automated evaluator to train the main language model further. The language model generates responses to prompts. The reward model scores each response. The language model is updated using reinforcement learning – specifically, Proximal Policy Optimization (PPO) — to produce responses that receive higher reward model scores.

A critical detail: the RL training includes a KL divergence penalty between the fine-tuned model and the original supervised model. Without this, the model would quickly learn to produce outputs that score highly on the reward model by exploiting whatever quirks or blind spots the reward model has – a phenomenon called reward hacking. The KL penalty keeps the model from drifting too far from the original, linguistically coherent distribution.

Why Reinforcement Learning Specifically

A reasonable question: why use RL at all? Why not just collect human-rated responses and fine-tune directly on the highest-rated ones?

You can, and this is roughly what rejection sampling fine-tuning does. But RL has a structural advantage: it can optimize for a signal that is not differentiable end-to-end.

Human preference scores are not differentiable with respect to the model’s parameters. You cannot directly backpropagate through a human’s judgment. The reward model creates a differentiable proxy, but RL allows the model to take sequences of actions – generating tokens one by one – and receive a reward at the end, which is not naturally a supervised learning setup.

RL also allows the model to explore responses that were not in the fine-tuning dataset. Supervised fine-tuning can only move the model toward examples it has seen; RL can discover new response strategies that score well on the reward model.

What RLHF Gets Right and What It Gets Wrong

The empirical results are striking. InstructGPT, with 1.3 billion parameters trained with RLHF, was rated as producing more helpful outputs than the 175-billion parameter GPT-3 base model by human evaluators. This is a 100× reduction in model size producing a better result – which tells you how large the gap between pretraining and instruction-following actually is.

But RLHF has well-documented failure modes.

Reward hacking remains a persistent problem. Models learn that verbose, confident, well-formatted responses score well on reward models – even when shorter, simpler responses would be more accurate. This has been connected to sycophancy: models learn to tell users what they want to hear because agreement tends to be rated positively.

Distributional fragility. The reward model is trained on human comparisons from a specific pool of raters, in a specific period, on a specific distribution of prompts. It generalizes imperfectly. Unusual prompts, adversarial phrasings, or topics underrepresented in the comparison data can produce reward model scores that do not reflect real quality.

Annotation biases. Human raters are not a neutral signal. They have cultural preferences, knowledge gaps, and varying standards. A response that sounds authoritative may be rated higher than a more accurate but more hedged response. Models trained on these preferences inherit the biases.

What Comes Next

Several alternative or complementary approaches have emerged in the years since InstructGPT.

Direct Preference Optimization (DPO) eliminates the separate reward model entirely, directly fine-tuning the language model on preference data using a reparameterized loss function. It is simpler to implement and more stable to train than PPO-based RLHF, and has produced competitive results on several benchmarks.

Constitutional AI (Anthropic, 2022) uses a written set of principles and AI-generated critiques to provide training signal, reducing dependence on human comparison data at scale.

RLAIF (RL from AI Feedback) replaces human raters with a capable model to generate preference labels, making the process scalable to far more comparisons than human annotation allows.

What all these approaches share is the underlying insight that useful language model behavior requires a training signal beyond next-token prediction – and that signal, however it is generated, needs to capture something about what humans actually value in a response. How accurately and robustly we can capture that something remains one of the central open questions in the field.

Written by

MLM Papers View 6 posts by MLM Papers →