InstructGPT: How AI Learned What Humans Actually Want
The paper behind ChatGPT. InstructGPT showed how to use human feedback to align model outputs with human preferences—turning a capable language model into an actually helpful assistant. This is reinforcement learning from human feedback (RLHF) made real.
InstructGPT: How AI Learned What Humans Actually Want
The Problem: Better Instructions, Same Problem
FLAN showed us that instruction tuning works. Fine-tune a model on tasks described in natural language, and it follows instructions better. Google's FLAN and OpenAI's own supervised fine-tuning experiments proved the concept.
But there was a catch.
Try asking a fine-tuned GPT-3 a question, and you'd get an answer. Try asking it a tricky question, and it might:
- Make up confident-sounding nonsense (hallucinate)
- Give an answer that's technically correct but misleading
- Follow instructions too literally when common sense should override them
- Produce verbose, boring outputs when conciseness would be better
The model had learned to follow instructions. But it hadn't learned what humans actually wanted.
Here's the core tension: instruction tuning optimizes for doing what you're told. RLHF optimizes for doing what you want.
Those aren't the same thing.
The Human-in-the-Loop Insight
The breakthrough was deceptively simple: ask humans to rate model outputs, then train the model to predict which outputs humans prefer.
Instead of:
Input: "Write me a poem about cats"
Output: "Roses are red, violets are blue, cats are... [continues awkwardly]"
You get:
Input: "Write me a poem about cats"
Rating by humans: 9/10 - "Delightful, creative, captures cat personality"
And you train the model to generate outputs that humans rate highly.
But how do you optimize for human preference? You can't just compute "how much did humans like this?" as a loss function.
Enter: Reinforcement Learning from Human Feedback (RLHF).
The Three-Step Process
InstructGPT isn't just one technique. It's a pipeline:
Step 1: Supervised Fine-Tuning (SFT)
Start with FLAN—train the model to follow instructions on demonstration data. This is the baseline.
Instruction: Write a haiku about spring
Output: Blossoms break the frost
Green shoots push through tired earth
Hope returns again
Data: ~13K instruction-output pairs, written by human contractors.
Why this step? The reward model (coming next) needs outputs to compare. Pure pretraining isn't helpful. You need the model to at least attempt instruction-following.
Step 2: Reward Modeling
This is where humans come in. Take model outputs and have humans compare them.
Instruction: Explain quantum entanglement to a 10-year-old
Output A (GPT-3 SFT): "Quantum entanglement is a phenomenon where two
or more particles become correlated such that the quantum state cannot
be described independently..."
Output B (GPT-3 SFT): "Imagine two magic coins that are connected. When
you flip one and it lands on heads, the other coin instantly lands on
tails—even if it's far away. That's kind of like entanglement..."
Human rater: Prefers Output B (much better for 10-year-olds)
Collect ~33K comparisons (human pairs rating outputs).
From these pairwise comparisons, train a reward model: a neural network that takes text and outputs a score representing "how much would a human like this?"
# Conceptual reward model
# InstructGPT used a 6B GPT-3 model with the final unembedding layer
# removed, replaced by a linear projection to a single scalar output.
reward_model = GPT3RewardModel(base_model="gpt3-6B")
# Takes generated text, outputs a single score
score = reward_model(prompt + response) # float, higher = humans prefer it
The reward model learns to predict: "if a human read this, would they like it?"
Key insight: You're not training the language model yet. You're training a separate model that judges outputs.
Step 3: Reinforcement Learning (RLHF)
Now use the reward model to train the language model.
The language model generates text. The reward model scores it. The language model gets reinforcement: high scores → encourage those outputs, low scores → discourage them.
This is classical reinforcement learning:
- Agent: Language model (GPT-3)
- Environment: Text generation task
- Reward: Score from the reward model
- Objective: Maximize expected reward
# Simplified RL loop
language_model = GPT3()
reward_model = trained_reward_model # from Step 2
for epoch in range(num_epochs):
# Generate text from language model
prompt = "Explain quantum entanglement to a 10-year-old"
generated_text = language_model.generate(prompt)
# Get reward score
reward = reward_model(generated_text)
# Update language model to maximize reward
# (using PPO - Proximal Policy Optimization)
loss = -reward # negative because we're minimizing loss
loss.backward()
optimizer.step()
But there's a catch: if you purely optimize for reward, the model might find weird loopholes or degrade in capability.
Solution: Constrain the model to stay close to the SFT version using a technique called KL divergence penalty.
# Actual RL objective
reward = reward_model(generated_text)
kl_penalty = kl_divergence(language_model, sft_model)
loss = -reward + 0.02 * kl_penalty
This says: "maximize reward, but don't drift too far from the SFT baseline." It prevents reward hacking and preserves general capability.
Why This Actually Works
Alignment Without Explicit Rules
You can't write down "how to be helpful" in rules. What does it mean to give a helpful answer?
- Accuracy
- Clarity
- Appropriate tone
- Acknowledging uncertainty
- Conciseness vs. detail (context-dependent)
- Honesty even when it's uncomfortable
- ...dozens of other factors
RLHF learns these implicitly. The reward model sees thousands of human preferences and learns their underlying pattern.
Scaling Human Feedback
You can't have humans evaluate every output. So you:
- Have humans compare a sample of outputs (~33K comparisons)
- Train a model to predict human preferences
- Use that model to reward the language model
This is scalable. Step 1 costs money but is manageable. Step 2 and 3 are computation.
Emergent Behavior
Something remarkable happens: the model learns concepts it was never explicitly trained on.
After RLHF, models become better at:
- Admitting uncertainty ("I'm not sure, but...")
- Declining harmful requests
- Showing reasoning step-by-step
- Correcting themselves
These behaviors emerge from optimizing for human preference, even though they weren't explicitly rewarded.
The Technical Picture
Reward Model Architecture
InstructGPT used a 6B parameter GPT-3 model with:
- The final unembedding layer removed
- A linear projection layer added to output a single scalar reward score
- Input: prompt + model-generated response
- Output: single scalar reward score
The model learns from contrastive pairs: "this output is better than that one."
Training objective:
# Given a pair (better_output, worse_output)
score_better = reward_model(better_output)
score_worse = reward_model(worse_output)
# Want: score_better > score_worse
loss = -log(sigmoid(score_better - score_worse))
This is binary classification in disguise: given two texts, predict which one humans prefer.
RL Algorithm: PPO
InstructGPT uses Proximal Policy Optimization (PPO), a standard RL algorithm.
PPO is stable and sample-efficient compared to simpler policy gradient methods. It prevents the model from changing too rapidly per update, which helps preserve capability.
Simplified version:
for epoch in range(epochs):
# Rollout: generate completions for prompts
prompts = load_batch(prompt_dataset)
completions = language_model.generate(prompts)
# Evaluate: get rewards
rewards = reward_model(completions)
# Update: optimize with PPO
ppo_update(language_model, prompts, completions, rewards)
Hyperparameters
- Batch size: 256-512 prompts per update
- Learning rate: Small, ~1e-5 (don't want to overwrite pretraining)
- KL coefficient: 0.02 (balance between reward and staying close to SFT)
- RL epochs: Few iterations (1-5) over each batch
- Prompt diversity: Mix in-distribution and out-of-distribution prompts
Results: InstructGPT vs GPT-3
Human Evaluation
Human labelers compared outputs from different models side by side. The results were measured as win rates — how often labelers preferred one model's output over another:
| Comparison | Preferred InstructGPT |
|---|---|
| InstructGPT (175B) vs. GPT-3 (175B) | ~85% of the time |
| InstructGPT (175B) vs. GPT-3 prompted (175B) | ~71% of the time |
| InstructGPT (175B) vs. SFT (175B) | ~61% of the time |
The most striking result: the 1.3B parameter InstructGPT model was preferred over the 175B GPT-3 — a model 100× larger. RLHF made a tiny model more useful than a giant one.
Specific Behaviors
Writing Quality:
- GPT-3 (SFT): Rambling, unclear, often grammatically awkward
- InstructGPT: Concise, well-structured, polished
Truthfulness:
- GPT-3: Confident lies ("Einstein discovered electricity")
- InstructGPT: Admits uncertainty ("I don't know, but...")
Harm Avoidance:
- GPT-3: Explains how to make explosives or synthesize drugs if asked
- InstructGPT: Declines harmful requests while being respectful
Math/Logic:
- GPT-3: Often wrong on arithmetic
- InstructGPT: Frequently correct (and shows work)
Scaling
The technique scales across model sizes. OpenAI tested on 1.3B, 6B, and 175B parameter models. All sizes benefited from RLHF, with consistent preference improvements at every scale.
The headline result: a 1.3B InstructGPT model was preferred over the raw 175B GPT-3. This demonstrated that alignment techniques can be more impactful than raw scale — a finding with major practical implications.
Why This Mattered: The Path to ChatGPT
InstructGPT was published by OpenAI in March 2022 — eight months before ChatGPT launched in November 2022.
InstructGPT was the technique that made ChatGPT possible. OpenAI described ChatGPT as a "sibling model" to InstructGPT, trained with the same RLHF pipeline on a more conversational dataset.
The pipeline is:
- Pretraining (GPT-3 style)
- Instruction tuning (FLAN style)
- RLHF (InstructGPT style)
This became the standard. Claude, Gemini, and other modern assistants all use RLHF.
Why RLHF Mattered
Before InstructGPT:
- Models were impressive but unpredictable
- They would confidently hallucinate
- They'd follow instructions to absurd extremes
- Companies couldn't reliably control behavior
After InstructGPT:
- Models were genuinely helpful
- They admitted uncertainty
- They declined harmful requests
- Companies had a path to alignment
This wasn't just an incremental improvement. It was the difference between "impressive toy" and "actually usable assistant."
The Alignment Story
InstructGPT introduced a framework for alignment: use human feedback to shape model behavior.
This is important because:
- Alignment is hard to specify: You can't write perfect rules for helpful behavior
- Humans can judge better than they can describe: We know good answers when we see them
- Scalable: Train a reward model once, use it to train many models
But it's not perfect:
- Reward models can be gamed (model learns to exploit the reward signal)
- Human raters have biases (biased feedback → biased reward model)
- Changing the reward model can dramatically change behavior
- You're only as good as your training data (biases accumulate)
Still, RLHF became the standard approach to alignment in LLMs. It's not perfect, but it works.
Practical Insights
Cost Breakdown (Estimated)
Approximate costs for RLHF training at GPT-3 scale (industry estimates, not from the paper):
- Pretraining: $5-10 million (one-time)
- Instruction data: $50K-100K (13K examples × human annotation)
- Reward model training: $100K+ (GPU time)
- RLHF: $500K-1M+ (lots of RL sampling and compute)
This is expensive but much cheaper than the pretraining. Practical for companies with scale.
Why You See This Everywhere
RLHF works. Modern models use it. Why?
- Simple concept: Humans prefer X, reward X
- Empirically effective: +2-3 point improvements are real
- Scalable: Doesn't require model retraining from scratch
- Adaptable: Can swap reward models or feedback sources
The Bigger Picture
From Language Modeling to Assistance
The three-step pipeline (pretraining → instruction tuning → RLHF) is how we went from:
GPT-2/3 era (2018-2020):
You: "Write a story about dragons"
GPT-3: "Write a story about dragons flying in the sky, breathing fire on
castles, fighting knights, and... [continues for pages unprompted]"
You: [frustrated] "I didn't ask for all that"
ChatGPT era (2022+):
You: "Write a short story about dragons"
ChatGPT: "# The Lonely Dragon
In a valley of grey stone lived Ember, a dragon with scales of silver..."
[200 words, exactly what was asked for, engaging, well-written]
RLHF is a large part of why the second version exists.
Open Questions
RLHF doesn't solve everything:
- Hallucination: Models still make things up; RLHF helps but doesn't eliminate it
- Adversarial robustness: Careful prompting can still break alignment
- Long-horizon reasoning: Hard to reward long chains of reasoning with short feedback
- Value misalignment: What if human raters are wrong? What if we optimize for the wrong thing?
These remain active research areas.
Series Navigation
- Attention Is All You Need: The Paper That Changed AI
- BERT: How AI Learned to Truly Read
- GPT-2: How AI Learned to Write
- FLAN: How AI Learned to Follow Instructions
- InstructGPT: How AI Learned What Humans Actually Want ← You are here
Last Updated: March 30, 2026
Author: RESEARCHER
Category: Research / Tutorial
Difficulty: Intermediate
Prerequisite: Understanding of FLAN and basic reinforcement learning concepts