FLAN: How AI Learned to Follow Instructions

The Problem with Pretraining

By 2021, we had powerful models. GPT-2, GPT-3, and their cousins could generate text, answer questions, and summarize documents. But there was a catch: they were brilliant at continuing text patterns, not necessarily at following instructions.

Ask GPT-2 to summarize an article, and it might instead continue writing the article. Ask it to classify sentiment and it might generate more text that looks like the original. The models excelled at predicting what comes next, but they didn't reliably understand what you were asking them to do.

Why? Because they were trained on raw internet text, where the task is always the same: "predict the next token." They never learned the meta-skill of understanding task descriptions.

Instruction tuning changed this.

The Core Idea: Format Tasks as Instructions

The insight behind FLAN is deceptively simple: if you describe tasks in natural language and fine-tune on those descriptions, models learn to follow instructions.

Instead of training on unlabeled text, you create examples like this:

Instruction: Summarize the following text in one sentence.
Text: The Arctic is warming twice as fast as the rest of the planet due to 
the albedo effect. When ice melts, it exposes dark water that absorbs more heat, 
accelerating further warming.
Output: The Arctic warms faster than Earth overall because melting ice exposes 
water that absorbs more heat.

Instruction: What is the sentiment of the following text?
Text: I can't believe how much I loved that movie. The characters were amazing 
and the ending made me cry with joy.
Output: positive

Instruction: Translate the following to French.
Text: Good morning, how are you?
Output: Bonjour, comment allez-vous?

Then you fine-tune the model on these examples, teaching it to take instructions and produce appropriate outputs.

The magic: the model learns the pattern itself. It figures out that:

When it sees "Summarize the following text," it should produce a short summary
When it sees "What is the sentiment," it should classify
When it sees "Translate," it should change languages

You don't need to train separate models for each task. One model, many instructions.

Why This Works

Pattern Recognition Across Tasks

The model learns a meta-pattern: instructions describe what to do, outputs show how to do it.

When trained on hundreds of different tasks—summarization, classification, translation, question answering, paraphrasing, reasoning—the model starts to extract the common principle: understand what the instruction is asking, then produce the appropriate output.

This is transfer learning at the meta level. Instead of transferring knowledge about language, you're transferring the ability to follow instructions.

Few-Shot from Zero-Shot

Here's where it gets powerful. After instruction tuning, the model becomes better at zero-shot tasks it's never seen.

Compare:

Before instruction tuning (vanilla GPT-2):

Input: Classify the sentiment: "This movie was terrible."
Output: "This movie was terrible and I hated every minute of it. The plot 
was predictable, the acting was wooden..."

It continues the text instead of classifying.

After instruction tuning (FLAN):

Input: Classify the sentiment: "This movie was terrible."
Output: negative

It understands the task and performs it correctly, even on unseen tasks.

Why? Because during tuning, it learned the general pattern: "when someone describes a task in natural language, produce the expected output type."

Scaling

FLAN showed something remarkable: instruction tuning improves performance and scales with both the number of tasks and model size.

More task clusters during fine-tuning = better generalization to unseen tasks
Larger models = better instruction following (the 137B model benefited far more than smaller variants)
The effect compounds: a large model trained on diverse instructed tasks can handle completely novel instructions

The FLAN Collection

The original FLAN paper (2021) fine-tuned on 62 tasks. But the real power came with the FLAN Collection (2022)—a curated dataset of 1.8K tasks covering:

NLP classics: summarization, translation, Q&A, classification
Reasoning: arithmetic, common sense, multi-step logic
Knowledge: trivia, factual questions, definitions
Generation: creative writing, paraphrasing, expansion
Domain-specific: medical, legal, scientific texts

The scale mattered. By training on hundreds of different task formulations, the model internalized the concept of instruction-following itself.

The Technical Picture

Format

FLAN uses a consistent format across all tasks:

[Instruction describing the task]
[Input/context if needed]
[Expected output]

For example:

Instruction: Identify the category of the product based on the description.

Input: A lightweight, waterproof jacket perfect for hiking in rain.

Output: clothing

Training Procedure

Start with a pretrained model (the original FLAN used Google's 137B parameter LaMDA-PT; later versions used PaLM and T5)
Add instruction/output pairs as training data
Fine-tune on these pairs for several epochs
The model learns to map instructions → appropriate outputs

Key Hyperparameters

Learning rate: Lower than pretraining (you don't want to forget what you learned)
Batch size: Moderate (32-128, depending on model size)
Epochs: Few (1-5; instruction tuning converges quickly)
Loss: Standard language modeling loss on the output only

Example training loop (conceptual, using a generic causal LM):

# Start with a pretrained model
# FLAN used Google's 137B LaMDA-PT; this simplified example
# uses a smaller model to illustrate the concept
from transformers import AutoModelForCausalLM, AutoTokenizer

model = AutoModelForCausalLM.from_pretrained("model-name")
tokenizer = AutoTokenizer.from_pretrained("model-name")

# Create instruction-output pairs
training_data = [
    ("Summarize: The Arctic is warming...", "The Arctic warms faster..."),
    ("Classify sentiment: This movie was great!", "positive"),
    # ... over 60 different task types
]

# Fine-tune
optimizer = Adam(model.parameters(), lr=1e-4)
for epoch in range(3):
    for instruction, output in training_data:
        inputs = tokenizer(instruction + " " + output, return_tensors="pt")
        loss = model(**inputs, labels=inputs["input_ids"]).loss
        loss.backward()
        optimizer.step()

Results: The Numbers

FLAN's impact was dramatic. The 137B FLAN model was evaluated on unseen task types — tasks it had never been explicitly trained on — and compared against zero-shot and few-shot GPT-3 (175B):

Zero-shot FLAN surpassed zero-shot GPT-3 on 20 of 25 evaluated tasks
FLAN even outperformed few-shot GPT-3 (with carefully crafted examples) on benchmarks like ANLI, RTE, BoolQ, AI2-ARC, OpenbookQA, and StoryCloze
Natural Language Inference saw the largest gains — FLAN dramatically improved on ANLI and RTE
Reading comprehension (BoolQ, MultiRC) also showed strong improvements

Not everything improved equally. FLAN was less effective on commonsense reasoning and coreference resolution, only outperforming the base LaMDA-PT model on three of seven tasks in those categories.

The ablation studies revealed three key factors for success:

Number of tasks matters — more task clusters during fine-tuning improved generalization
Model scale matters — instruction tuning benefits larger models more
Natural language instructions matter — removing the instruction templates significantly hurt performance

Why This Matters

From Prediction to Understanding

Pretraining teaches models to predict text. Instruction tuning teaches them to understand what you're asking and do it.

This is the bridge between:

Language models (GPT-2, GPT-3): brilliant at predicting text
Assistants (ChatGPT, Claude): trained to follow instructions

Without instruction tuning, ChatGPT wouldn't exist. Or rather, it would be much worse—a model that continues your prompt instead of answering your question.

One Model, Many Capabilities

Before FLAN, the standard approach was to fine-tune separate models for each task, or use multi-task learning without explicit instructions (like T5). FLAN showed that adding natural language instructions to multi-task fine-tuning unlocked something new: the model generalized to task types it had never seen during training.

Fine-tune once on diverse instructed tasks, and the model handles novel instructions at inference time. Cost-effective. Elegant. Practical.

Democratizing AI

While the original FLAN required a 137B parameter model, the technique itself was model-agnostic. Later work — especially FLAN-T5 (2022) — applied instruction tuning to models ranging from 80M to 11B parameters, making the approach accessible on modest hardware. FLAN-T5 11B outperformed the base T5 11B by double-digit margins, and even matched PaLM 62B on some challenging benchmarks.

The Connection to ChatGPT (Spoiler)

FLAN solved the "understanding instructions" problem. But there's a follow-up question: what if the instructions themselves come from humans, rated by humans, based on human preferences?

That's where InstructGPT (coming next in this series) enters the story. InstructGPT takes instruction tuning further: instead of using generic task instructions, it uses human feedback to align model outputs with what humans actually want.

But that's a story for another paper.

The Code, Conceptually

The insight of FLAN is that instruction formatting is learnable. Here's the core pattern a fine-tuned model internalizes:

def instruction_following_model(instruction, input_text=None):
    """
    A FLAN-tuned model understands this pattern:
    1. Read the instruction (e.g., "Summarize", "Classify", "Translate")
    2. Parse what task type it is
    3. Apply the appropriate transformation to the input
    4. Output the result in the expected format
    """
    
    # The model has learned this implicitly from training on many task types
    prompt = instruction
    if input_text:
        prompt += f"\nInput: {input_text}\n"
    prompt += "Output: "
    
    # The model continues this prompt appropriately
    output = model.generate(prompt, max_tokens=100)
    return output

The beauty is: the model doesn't have explicit if-else logic. It learns the pattern from examples.

Practical Applications

Internal

Better zero-shot performance on new tasks
Reduced need for task-specific fine-tuning
Easier to add new capabilities (just add new task examples)

External (What You See)

ChatGPT understanding complex questions
Claude answering in multiple languages
Gemini following specific formatting requests
Any modern AI assistant being genuinely assistive

Limitations and Caveats

Hallucination

Instruction tuning doesn't eliminate hallucination. If asked a question it doesn't know, an instruction-tuned model will still confidently make up an answer. Better prompting helps, but doesn't solve it.

Complex Reasoning

Simple instructions work great. But very complex, multi-step reasoning still benefits from larger models or additional techniques (like chain-of-thought, which would become its own big paper).

Task Distribution

FLAN works best when trained on diverse tasks. If you only fine-tune on one type (say, only translation), it loses some generalization ability.

Model Size Matters

Instruction tuning helps small models, but a small instruction-tuned model is still weaker than a large one. You can't completely overcome architecture limits through data.

Why FLAN Mattered for the Field

Empirical proof: Natural language instructions work. This wasn't obvious beforehand.
Scaling insight: Instruction tuning scales with model and task diversity.
Practical path: Showed how to take a raw LLM and make it usable.
Future directions: Enabled all the work that followed on alignment, RLHF, and multi-task systems.

Without FLAN, the jump to ChatGPT would have been harder to conceptualize.

Attention Is All You Need: The Paper That Changed AI
BERT: How AI Learned to Truly Read
GPT-2: How AI Learned to Write
FLAN: How AI Learned to Follow Instructions ← You are here
InstructGPT: How AI Learned What Humans Actually Want

Last Updated: March 30, 2026
Author: RESEARCHER
Category: Research / Tutorial
Difficulty: Intermediate
Prerequisite: Understanding of GPT-2 and basic fine-tuning concepts

FLAN: How AI Learned to Follow Instructions

The Problem with Pretraining

Why? Because they were trained on raw internet text, where the task is always the same: "predict the next token." They never learned the meta-skill of understanding task descriptions.

Instruction tuning changed this.

The Core Idea: Format Tasks as Instructions

The insight behind FLAN is deceptively simple: if you describe tasks in natural language and fine-tune on those descriptions, models learn to follow instructions.

Instead of training on unlabeled text, you create examples like this:

Instruction: Summarize the following text in one sentence.
Text: The Arctic is warming twice as fast as the rest of the planet due to 
the albedo effect. When ice melts, it exposes dark water that absorbs more heat, 
accelerating further warming.
Output: The Arctic warms faster than Earth overall because melting ice exposes 
water that absorbs more heat.

Instruction: What is the sentiment of the following text?
Text: I can't believe how much I loved that movie. The characters were amazing 
and the ending made me cry with joy.
Output: positive

Instruction: Translate the following to French.
Text: Good morning, how are you?
Output: Bonjour, comment allez-vous?

Then you fine-tune the model on these examples, teaching it to take instructions and produce appropriate outputs.

The magic: the model learns the pattern itself. It figures out that:

When it sees "Summarize the following text," it should produce a short summary
When it sees "What is the sentiment," it should classify
When it sees "Translate," it should change languages

You don't need to train separate models for each task. One model, many instructions.

Why This Works

Pattern Recognition Across Tasks

The model learns a meta-pattern: instructions describe what to do, outputs show how to do it.

This is transfer learning at the meta level. Instead of transferring knowledge about language, you're transferring the ability to follow instructions.

Few-Shot from Zero-Shot

Here's where it gets powerful. After instruction tuning, the model becomes better at zero-shot tasks it's never seen.

Compare:

Before instruction tuning (vanilla GPT-2):

Input: Classify the sentiment: "This movie was terrible."
Output: "This movie was terrible and I hated every minute of it. The plot 
was predictable, the acting was wooden..."

It continues the text instead of classifying.

After instruction tuning (FLAN):

Input: Classify the sentiment: "This movie was terrible."
Output: negative

It understands the task and performs it correctly, even on unseen tasks.

Why? Because during tuning, it learned the general pattern: "when someone describes a task in natural language, produce the expected output type."

Scaling

FLAN showed something remarkable: instruction tuning improves performance and scales with both the number of tasks and model size.

More task clusters during fine-tuning = better generalization to unseen tasks
Larger models = better instruction following (the 137B model benefited far more than smaller variants)
The effect compounds: a large model trained on diverse instructed tasks can handle completely novel instructions

The FLAN Collection

The original FLAN paper (2021) fine-tuned on 62 tasks. But the real power came with the FLAN Collection (2022)—a curated dataset of 1.8K tasks covering:

NLP classics: summarization, translation, Q&A, classification
Reasoning: arithmetic, common sense, multi-step logic
Knowledge: trivia, factual questions, definitions
Generation: creative writing, paraphrasing, expansion
Domain-specific: medical, legal, scientific texts

The scale mattered. By training on hundreds of different task formulations, the model internalized the concept of instruction-following itself.

The Technical Picture

Format

FLAN uses a consistent format across all tasks:

[Instruction describing the task]
[Input/context if needed]
[Expected output]

For example:

Instruction: Identify the category of the product based on the description.

Input: A lightweight, waterproof jacket perfect for hiking in rain.

Output: clothing

Training Procedure

Start with a pretrained model (the original FLAN used Google's 137B parameter LaMDA-PT; later versions used PaLM and T5)
Add instruction/output pairs as training data
Fine-tune on these pairs for several epochs
The model learns to map instructions → appropriate outputs

Key Hyperparameters

Learning rate: Lower than pretraining (you don't want to forget what you learned)
Batch size: Moderate (32-128, depending on model size)
Epochs: Few (1-5; instruction tuning converges quickly)
Loss: Standard language modeling loss on the output only

Example training loop (conceptual, using a generic causal LM):

# Start with a pretrained model
# FLAN used Google's 137B LaMDA-PT; this simplified example
# uses a smaller model to illustrate the concept
from transformers import AutoModelForCausalLM, AutoTokenizer

model = AutoModelForCausalLM.from_pretrained("model-name")
tokenizer = AutoTokenizer.from_pretrained("model-name")

# Create instruction-output pairs
training_data = [
    ("Summarize: The Arctic is warming...", "The Arctic warms faster..."),
    ("Classify sentiment: This movie was great!", "positive"),
    # ... over 60 different task types
]

# Fine-tune
optimizer = Adam(model.parameters(), lr=1e-4)
for epoch in range(3):
    for instruction, output in training_data:
        inputs = tokenizer(instruction + " " + output, return_tensors="pt")
        loss = model(**inputs, labels=inputs["input_ids"]).loss
        loss.backward()
        optimizer.step()

Results: The Numbers

FLAN's impact was dramatic. The 137B FLAN model was evaluated on unseen task types — tasks it had never been explicitly trained on — and compared against zero-shot and few-shot GPT-3 (175B):

Zero-shot FLAN surpassed zero-shot GPT-3 on 20 of 25 evaluated tasks
FLAN even outperformed few-shot GPT-3 (with carefully crafted examples) on benchmarks like ANLI, RTE, BoolQ, AI2-ARC, OpenbookQA, and StoryCloze
Natural Language Inference saw the largest gains — FLAN dramatically improved on ANLI and RTE
Reading comprehension (BoolQ, MultiRC) also showed strong improvements

The ablation studies revealed three key factors for success:

Number of tasks matters — more task clusters during fine-tuning improved generalization
Model scale matters — instruction tuning benefits larger models more
Natural language instructions matter — removing the instruction templates significantly hurt performance

Why This Matters

From Prediction to Understanding

Pretraining teaches models to predict text. Instruction tuning teaches them to understand what you're asking and do it.

This is the bridge between:

Language models (GPT-2, GPT-3): brilliant at predicting text
Assistants (ChatGPT, Claude): trained to follow instructions

Without instruction tuning, ChatGPT wouldn't exist. Or rather, it would be much worse—a model that continues your prompt instead of answering your question.

One Model, Many Capabilities

Fine-tune once on diverse instructed tasks, and the model handles novel instructions at inference time. Cost-effective. Elegant. Practical.

Democratizing AI

The Connection to ChatGPT (Spoiler)

FLAN solved the "understanding instructions" problem. But there's a follow-up question: what if the instructions themselves come from humans, rated by humans, based on human preferences?

But that's a story for another paper.

The Code, Conceptually

The insight of FLAN is that instruction formatting is learnable. Here's the core pattern a fine-tuned model internalizes:

def instruction_following_model(instruction, input_text=None):
    """
    A FLAN-tuned model understands this pattern:
    1. Read the instruction (e.g., "Summarize", "Classify", "Translate")
    2. Parse what task type it is
    3. Apply the appropriate transformation to the input
    4. Output the result in the expected format
    """
    
    # The model has learned this implicitly from training on many task types
    prompt = instruction
    if input_text:
        prompt += f"\nInput: {input_text}\n"
    prompt += "Output: "
    
    # The model continues this prompt appropriately
    output = model.generate(prompt, max_tokens=100)
    return output

The beauty is: the model doesn't have explicit if-else logic. It learns the pattern from examples.

Practical Applications

Internal

Better zero-shot performance on new tasks
Reduced need for task-specific fine-tuning
Easier to add new capabilities (just add new task examples)

External (What You See)

ChatGPT understanding complex questions
Claude answering in multiple languages
Gemini following specific formatting requests
Any modern AI assistant being genuinely assistive

Limitations and Caveats

Hallucination

Complex Reasoning

Simple instructions work great. But very complex, multi-step reasoning still benefits from larger models or additional techniques (like chain-of-thought, which would become its own big paper).

Task Distribution

FLAN works best when trained on diverse tasks. If you only fine-tune on one type (say, only translation), it loses some generalization ability.

Model Size Matters

Instruction tuning helps small models, but a small instruction-tuned model is still weaker than a large one. You can't completely overcome architecture limits through data.

Why FLAN Mattered for the Field

Empirical proof: Natural language instructions work. This wasn't obvious beforehand.
Scaling insight: Instruction tuning scales with model and task diversity.
Practical path: Showed how to take a raw LLM and make it usable.
Future directions: Enabled all the work that followed on alignment, RLHF, and multi-task systems.

Without FLAN, the jump to ChatGPT would have been harder to conceptualize.

Attention Is All You Need: The Paper That Changed AI
BERT: How AI Learned to Truly Read
GPT-2: How AI Learned to Write
FLAN: How AI Learned to Follow Instructions ← You are here
InstructGPT: How AI Learned What Humans Actually Want

Last Updated: March 30, 2026
Author: RESEARCHER
Category: Research / Tutorial
Difficulty: Intermediate
Prerequisite: Understanding of GPT-2 and basic fine-tuning concepts

FLAN: How AI Learned to Follow Instructions

The Problem with Pretraining

The Core Idea: Format Tasks as Instructions

Why This Works

Pattern Recognition Across Tasks

Few-Shot from Zero-Shot

Scaling

The FLAN Collection

The Technical Picture

Format

Training Procedure

Key Hyperparameters

Results: The Numbers

Why This Matters

From Prediction to Understanding

One Model, Many Capabilities

Democratizing AI

The Connection to ChatGPT (Spoiler)

The Code, Conceptually

Practical Applications

Internal

External (What You See)

Limitations and Caveats

Hallucination

Complex Reasoning

Task Distribution

Model Size Matters

Why FLAN Mattered for the Field

Series Navigation

FLAN: How AI Learned to Follow Instructions

The Problem with Pretraining

The Core Idea: Format Tasks as Instructions

Why This Works

Pattern Recognition Across Tasks

Few-Shot from Zero-Shot

Scaling

The FLAN Collection

The Technical Picture

Format

Training Procedure

Key Hyperparameters

Results: The Numbers

Why This Matters

From Prediction to Understanding

One Model, Many Capabilities

Democratizing AI

The Connection to ChatGPT (Spoiler)

The Code, Conceptually

Practical Applications

Internal

External (What You See)

Limitations and Caveats

Hallucination

Complex Reasoning

Task Distribution

Model Size Matters

Why FLAN Mattered for the Field

Series Navigation