FLAN: How AI Learned to Follow Instructions
The paper that bridged pretraining and ChatGPT. Instruction tuning showed how a simple formatādescribing tasks as natural languageācould make models dramatically better at understanding and following what you ask them to do.
FLAN: How AI Learned to Follow Instructions
The Problem with Pretraining
By 2021, we had powerful models. GPT-2, GPT-3, and their cousins could generate text, answer questions, and summarize documents. But there was a catch: they were brilliant at continuing text patterns, not necessarily at following instructions.
Ask GPT-2 to summarize an article, and it might instead continue writing the article. Ask it to classify sentiment and it might generate more text that looks like the original. The models excelled at predicting what comes next, but they didn't reliably understand what you were asking them to do.
Why? Because they were trained on raw internet text, where the task is always the same: "predict the next token." They never learned the meta-skill of understanding task descriptions.
Instruction tuning changed this.
The Core Idea: Format Tasks as Instructions
The insight behind FLAN is deceptively simple: if you describe tasks in natural language and fine-tune on those descriptions, models learn to follow instructions.
Instead of training on unlabeled text, you create examples like this:
Instruction: Summarize the following text in one sentence.
Text: The Arctic is warming twice as fast as the rest of the planet due to
the albedo effect. When ice melts, it exposes dark water that absorbs more heat,
accelerating further warming.
Output: The Arctic warms faster than Earth overall because melting ice exposes
water that absorbs more heat.
Instruction: What is the sentiment of the following text?
Text: I can't believe how much I loved that movie. The characters were amazing
and the ending made me cry with joy.
Output: positive
Instruction: Translate the following to French.
Text: Good morning, how are you?
Output: Bonjour, comment allez-vous?
Then you fine-tune the model on these examples, teaching it to take instructions and produce appropriate outputs.
The magic: the model learns the pattern itself. It figures out that:
- When it sees "Summarize the following text," it should produce a short summary
- When it sees "What is the sentiment," it should classify
- When it sees "Translate," it should change languages
You don't need to train separate models for each task. One model, many instructions.
Why This Works
Pattern Recognition Across Tasks
The model learns a meta-pattern: instructions describe what to do, outputs show how to do it.
When trained on hundreds of different tasksāsummarization, classification, translation, question answering, paraphrasing, reasoningāthe model starts to extract the common principle: understand what the instruction is asking, then produce the appropriate output.
This is transfer learning at the meta level. Instead of transferring knowledge about language, you're transferring the ability to follow instructions.
Few-Shot from Zero-Shot
Here's where it gets powerful. After instruction tuning, the model becomes better at zero-shot tasks it's never seen.
Compare:
Before instruction tuning (vanilla GPT-2):
Input: Classify the sentiment: "This movie was terrible."
Output: "This movie was terrible and I hated every minute of it. The plot
was predictable, the acting was wooden..."
It continues the text instead of classifying.
After instruction tuning (FLAN):
Input: Classify the sentiment: "This movie was terrible."
Output: negative
It understands the task and performs it correctly, even on unseen tasks.
Why? Because during tuning, it learned the general pattern: "when someone describes a task in natural language, produce the expected output type."
Scaling
FLAN showed something remarkable: instruction tuning improves performance and scales with both the number of tasks and model size.
- More task clusters during fine-tuning = better generalization to unseen tasks
- Larger models = better instruction following (the 137B model benefited far more than smaller variants)
- The effect compounds: a large model trained on diverse instructed tasks can handle completely novel instructions
The FLAN Collection
The original FLAN paper (2021) fine-tuned on 62 tasks. But the real power came with the FLAN Collection (2022)āa curated dataset of 1.8K tasks covering:
- NLP classics: summarization, translation, Q&A, classification
- Reasoning: arithmetic, common sense, multi-step logic
- Knowledge: trivia, factual questions, definitions
- Generation: creative writing, paraphrasing, expansion
- Domain-specific: medical, legal, scientific texts
The scale mattered. By training on hundreds of different task formulations, the model internalized the concept of instruction-following itself.
The Technical Picture
Format
FLAN uses a consistent format across all tasks:
[Instruction describing the task]
[Input/context if needed]
[Expected output]
For example:
Instruction: Identify the category of the product based on the description.
Input: A lightweight, waterproof jacket perfect for hiking in rain.
Output: clothing
Training Procedure
- Start with a pretrained model (the original FLAN used Google's 137B parameter LaMDA-PT; later versions used PaLM and T5)
- Add instruction/output pairs as training data
- Fine-tune on these pairs for several epochs
- The model learns to map instructions ā appropriate outputs
Key Hyperparameters
- Learning rate: Lower than pretraining (you don't want to forget what you learned)
- Batch size: Moderate (32-128, depending on model size)
- Epochs: Few (1-5; instruction tuning converges quickly)
- Loss: Standard language modeling loss on the output only
Example training loop (conceptual, using a generic causal LM):
# Start with a pretrained model
# FLAN used Google's 137B LaMDA-PT; this simplified example
# uses a smaller model to illustrate the concept
from transformers import AutoModelForCausalLM, AutoTokenizer
model = AutoModelForCausalLM.from_pretrained("model-name")
tokenizer = AutoTokenizer.from_pretrained("model-name")
# Create instruction-output pairs
training_data = [
("Summarize: The Arctic is warming...", "The Arctic warms faster..."),
("Classify sentiment: This movie was great!", "positive"),
# ... over 60 different task types
]
# Fine-tune
optimizer = Adam(model.parameters(), lr=1e-4)
for epoch in range(3):
for instruction, output in training_data:
inputs = tokenizer(instruction + " " + output, return_tensors="pt")
loss = model(**inputs, labels=inputs["input_ids"]).loss
loss.backward()
optimizer.step()
Results: The Numbers
FLAN's impact was dramatic. The 137B FLAN model was evaluated on unseen task types ā tasks it had never been explicitly trained on ā and compared against zero-shot and few-shot GPT-3 (175B):
- Zero-shot FLAN surpassed zero-shot GPT-3 on 20 of 25 evaluated tasks
- FLAN even outperformed few-shot GPT-3 (with carefully crafted examples) on benchmarks like ANLI, RTE, BoolQ, AI2-ARC, OpenbookQA, and StoryCloze
- Natural Language Inference saw the largest gains ā FLAN dramatically improved on ANLI and RTE
- Reading comprehension (BoolQ, MultiRC) also showed strong improvements
Not everything improved equally. FLAN was less effective on commonsense reasoning and coreference resolution, only outperforming the base LaMDA-PT model on three of seven tasks in those categories.
The ablation studies revealed three key factors for success:
- Number of tasks matters ā more task clusters during fine-tuning improved generalization
- Model scale matters ā instruction tuning benefits larger models more
- Natural language instructions matter ā removing the instruction templates significantly hurt performance
Why This Matters
From Prediction to Understanding
Pretraining teaches models to predict text. Instruction tuning teaches them to understand what you're asking and do it.
This is the bridge between:
- Language models (GPT-2, GPT-3): brilliant at predicting text
- Assistants (ChatGPT, Claude): trained to follow instructions
Without instruction tuning, ChatGPT wouldn't exist. Or rather, it would be much worseāa model that continues your prompt instead of answering your question.
One Model, Many Capabilities
Before FLAN, the standard approach was to fine-tune separate models for each task, or use multi-task learning without explicit instructions (like T5). FLAN showed that adding natural language instructions to multi-task fine-tuning unlocked something new: the model generalized to task types it had never seen during training.
Fine-tune once on diverse instructed tasks, and the model handles novel instructions at inference time. Cost-effective. Elegant. Practical.
Democratizing AI
While the original FLAN required a 137B parameter model, the technique itself was model-agnostic. Later work ā especially FLAN-T5 (2022) ā applied instruction tuning to models ranging from 80M to 11B parameters, making the approach accessible on modest hardware. FLAN-T5 11B outperformed the base T5 11B by double-digit margins, and even matched PaLM 62B on some challenging benchmarks.
The Connection to ChatGPT (Spoiler)
FLAN solved the "understanding instructions" problem. But there's a follow-up question: what if the instructions themselves come from humans, rated by humans, based on human preferences?
That's where InstructGPT (coming next in this series) enters the story. InstructGPT takes instruction tuning further: instead of using generic task instructions, it uses human feedback to align model outputs with what humans actually want.
But that's a story for another paper.
The Code, Conceptually
The insight of FLAN is that instruction formatting is learnable. Here's the core pattern a fine-tuned model internalizes:
def instruction_following_model(instruction, input_text=None):
"""
A FLAN-tuned model understands this pattern:
1. Read the instruction (e.g., "Summarize", "Classify", "Translate")
2. Parse what task type it is
3. Apply the appropriate transformation to the input
4. Output the result in the expected format
"""
# The model has learned this implicitly from training on many task types
prompt = instruction
if input_text:
prompt += f"\nInput: {input_text}\n"
prompt += "Output: "
# The model continues this prompt appropriately
output = model.generate(prompt, max_tokens=100)
return output
The beauty is: the model doesn't have explicit if-else logic. It learns the pattern from examples.
Practical Applications
Internal
- Better zero-shot performance on new tasks
- Reduced need for task-specific fine-tuning
- Easier to add new capabilities (just add new task examples)
External (What You See)
- ChatGPT understanding complex questions
- Claude answering in multiple languages
- Gemini following specific formatting requests
- Any modern AI assistant being genuinely assistive
Limitations and Caveats
Hallucination
Instruction tuning doesn't eliminate hallucination. If asked a question it doesn't know, an instruction-tuned model will still confidently make up an answer. Better prompting helps, but doesn't solve it.
Complex Reasoning
Simple instructions work great. But very complex, multi-step reasoning still benefits from larger models or additional techniques (like chain-of-thought, which would become its own big paper).
Task Distribution
FLAN works best when trained on diverse tasks. If you only fine-tune on one type (say, only translation), it loses some generalization ability.
Model Size Matters
Instruction tuning helps small models, but a small instruction-tuned model is still weaker than a large one. You can't completely overcome architecture limits through data.
Why FLAN Mattered for the Field
- Empirical proof: Natural language instructions work. This wasn't obvious beforehand.
- Scaling insight: Instruction tuning scales with model and task diversity.
- Practical path: Showed how to take a raw LLM and make it usable.
- Future directions: Enabled all the work that followed on alignment, RLHF, and multi-task systems.
Without FLAN, the jump to ChatGPT would have been harder to conceptualize.
Series Navigation
- Attention Is All You Need: The Paper That Changed AI
- BERT: How AI Learned to Truly Read
- GPT-2: How AI Learned to Write
- FLAN: How AI Learned to Follow Instructions ā You are here
- InstructGPT: How AI Learned What Humans Actually Want
Last Updated: March 30, 2026
Author: RESEARCHER
Category: Research / Tutorial
Difficulty: Intermediate
Prerequisite: Understanding of GPT-2 and basic fine-tuning concepts