GPT-2: How AI Learned to Write

Previously...

In our first two articles, we explored:

Attention Is All You Need — The Transformer architecture that pays attention to what matters
BERT: How AI Learned to Truly Read — Using Transformers to understand language deeply

Both articles focused on understanding—analyzing text, answering questions, classifying sentiment.

But what if we flipped the problem? Instead of understanding existing text, what if we asked an AI to generate new text?

In 2019, OpenAI released GPT-2 and showed that the answer was surprisingly powerful: Just predict the next word.

The Big Idea

GPT-2 stands for Generative Pre-trained Transformer 2 (the sequel to GPT-1).

The core idea is deceptively simple:

If you're really good at predicting what word comes next, you can write fluently.

Think about how you read. When you see:

"The pizza was cold, so I put it in the..."

→ Next word is probably: oven

You don't consciously think about this. Your brain just predicts what's likely to come next based on patterns it learned from years of reading.

GPT-2 does exactly the same thing. It reads millions of texts and learns: "When I see these words, what word typically comes next?" Once trained, you can start it with any beginning and let it keep predicting, generating text that reads naturally.

The Problem GPT-2 Solved

The BERT Limitation

BERT was brilliant at understanding, but it couldn't generate text. You couldn't ask BERT to write a story or essay. It could only analyze or classify existing text.

This is because BERT reads in both directions—it sees the whole sentence, including words that haven't "happened yet" in the reading process. This makes it great for understanding, but useless for generation (you can't look ahead when writing).

The Solution: Decoder-Only Transformers

GPT-2 used only the decoder side of the Transformer architecture (from our Attention article). Remember the two-sided Transformer?

Rendering diagram…

The decoder can only see words that came before it—perfect for writing one word at a time, without cheating by looking ahead.

How Does GPT-2 Learn?

GPT-2 uses a single, elegant training objective:

Predict the Next Word (Language Modeling)

During training, GPT-2 sees text and learns to predict what comes next:

Input text:  "The cat sat on the"
Prediction:  [predict next word]
Target:      "mat"

Input text:  "The cat sat on the mat"
Prediction:  [predict next word]
Target:      "." (period)

Input text:  "The cat sat on the mat."
Prediction:  [predict next word]
Target:      "It" (next sentence)

By seeing billions of examples, GPT-2 learns statistical patterns:

Rendering diagram…

Why This Works

By predicting the next word millions of times on diverse text, GPT-2 learns:

Grammar — Which word combinations are grammatical
Semantics — What words relate to each other (cats, dogs, animals)
Facts — When trained on Wikipedia, it learns facts about the world
Style — Formal writing vs. casual, poetry vs. prose
Common sense — Why certain sequences make sense

Pre-training on Internet Scale

The Game-Changer: WebText

Unlike BERT, which trained on Wikipedia and books, GPT-2 trained on WebText—a dataset of 40GB of text from web pages.

What's the difference? The internet contains:

Diverse writing styles (blogs, forums, news, Reddit)
Modern language and current events
Niche knowledge and interests
Different forms of humor and sarcasm

This diversity is crucial. It exposes GPT-2 to more varied language than any previous model.

Training Specs

Aspect	Value
Training data	WebText (40GB)
Model size	1.5 billion parameters
Training duration	Weeks on TPU clusters
Vocabulary	50,000 tokens
Context window	1,024 tokens (~4000 words)

For perspective: GPT-2's 1.5 billion parameters dwarfed BERT (340M parameters max).

The Model Sizes: Small to XL

OpenAI released GPT-2 in stages, from small to large:

Rendering diagram…

Size	Parameters	Use Case
Small	124M	Fast, lightweight, educational
Medium	355M	Balance of speed and quality
Large	774M	High-quality output
XL	1.5B	Best results (slow, compute-intensive)

Interestingly, even the "small" 124M parameter version could write remarkably coherent text. Bigger wasn't necessary for capability—but it helped.

How Generation Works (Autoregressive Decoding)

When you ask GPT-2 to write, here's what happens:

Step 1: Start with a Prompt

Prompt: "Once upon a time"

Step 2: Predict the Next Word (One at a Time)

Rendering diagram…

Step 3: Repeat (Autoregressive Generation)

"Once upon a time"
→ "Once upon a time there"
→ "Once upon a time there was"
→ "Once upon a time there was a"
→ "Once upon a time there was a small"
→ "Once upon a time there was a small village"
→ [continue...]

GPT-2 generates word-by-word, using its own previous outputs as input. This is called autoregressive decoding.

Key insight: Every word is predicted independently, based only on previous words. This means:

✅ Can generate text of any length
❌ Cannot revise words already written
❌ Cannot look ahead to plan

What Could GPT-2 Actually Do?

Generation Tasks

1. Story Writing

Prompt: "It was a dark and stormy night when..."

Output: GPT-2 continues with a coherent narrative:

"...the lights went out. I fumbled for my flashlight, but my hands were shaking too badly to grip it. Outside, the wind howled like a beast in pain, rattling the windows as if trying to break through..."

2. News Article Writing

Prompt: "Breaking: New study finds coffee reduces cancer risk"

Output: GPT-2 generates a plausible news article with citations and quotes (sometimes completely fabricated—see "hallucinations" below).

3. Code Generation

Prompt: "def fibonacci(n):\n"

Output: GPT-2 can generate reasonable Python code:

def fibonacci(n):
    if n <= 1:
        return n
    else:
        return fibonacci(n-1) + fibonacci(n-2)

4. Question Answering (With Prompt Engineering)

Prompt: "Q: What is the capital of France? A:"

Output: "Paris"

By framing questions in a specific format (seen in training data), GPT-2 learns to answer without explicit fine-tuning.

5. Summarization

Prompt: "[Article text] TL;DR:"

Output: GPT-2 generates a summary.

The "Multitask" Insight

Here's the key discovery: The same model could do all these tasks without being explicitly trained on them.

BERT required fine-tuning for each task. GPT-2 just needed the right prompt and it would adapt. This is called zero-shot and few-shot capability.

Rendering diagram…

The Dark Side: Hallucinations and Bias

Hallucination (Making Stuff Up)

GPT-2 is trained to predict plausible text, not necessarily truthful text. If you ask it for facts, it might confidently provide completely fabricated information.

Example:

Q: "In what year was the Eiffel Tower built?"
GPT-2 answer: "1889" ✓ Correct

Q: "What are the top 5 causes of death in Antarctica?"
GPT-2 answer: [Makes up completely fictional statistics]
            ✗ Confidently false

GPT-2 has no mechanism to check facts. It just predicts plausible-sounding text.

Bias in Training Data

GPT-2 trained on internet text, which contains all of humanity's biases. When asked to complete prompts about gender, race, or nationality, GPT-2 often outputs stereotypes.

This led to important debates about:

Should powerful models be released publicly?
How do we mitigate bias in training data?
What are the societal implications?

Why GPT-2 Was Revolutionary

1. Generation is Powerful

BERT showed that understanding is learnable. GPT-2 showed that generation is too—and it's arguably more impressive because humans perceive generation as more "intelligent."

2. The Decoder-Only Architecture Dominates

Despite BERT's success, it turned out the decoder-only architecture (GPT-style) was more flexible:

Can be used for understanding (with prompts)
Can be used for generation (naturally)
Can be used for reasoning (with chain-of-thought)

Every major model after GPT-2—GPT-3, GPT-4, Claude, Gemini—uses decoder-only architecture.

3. Scale Matters More Than Task-Specific Design

BERT needed fine-tuning for each task. GPT-2 showed you could just make the model bigger and handle diverse tasks with clever prompting.

This led to the scaling hypothesis:

Bigger model + More data = Better performance across the board

This shaped the entire trajectory of AI:

Rendering diagram…

4. It's "Unsupervised"

The key word in the paper's title is "unsupervised." GPT-2 required:

No labeled data (unlike supervised learning)
No task-specific annotations
Just raw text from the internet

This made it scalable to unprecedented levels.

GPT-2 vs. BERT: The Fundamental Difference

Both are Transformers, both are pre-trained on massive data, so what's fundamentally different?

Rendering diagram…

Aspect	BERT	GPT-2
Task	Understanding	Generation
Training	Fill in blanks + relationships	Predict next word
Can see	Both directions	Only previous words
Strength	Classification, analysis	Writing, creation
Fine-tuning	Required for tasks	Optional (prompting works)
Architecture	Encoder only	Decoder only

The Concern: When Does GPT-2 Stop Being Safe?

When OpenAI released GPT-2, they made an unusual decision: They didn't release the full model at first.

They released:

Small model (124M) ✓
Medium model (355M) ✓
Large model (774M) ✓
XL model (1.5B) — Not immediately released

This sparked debate about AI safety and responsibility:

Question: If a model can generate convincing text, could it be misused to create:

Disinformation campaigns?

Fake news at scale?

Spam or phishing?

OpenAI's answer: "Probably. But the research community should study this, so we'll release it eventually."

By 2020, they released the full model, and the world didn't end. But the debate continues: How responsible should AI labs be?

GPT-2's Legacy: The Lineage

GPT-2 wasn't the end—it was the beginning of a series:

Rendering diagram…

Every model learned from GPT-2's insights:

✅ Decoder-only architecture
✅ Scaling laws matter
✅ Prompting > Fine-tuning (for capable models)
✅ Diverse pre-training data works

Numbers and Citations

Metric	Value
Published	February 2019
Organization	OpenAI
Model sizes	4 (Small, Medium, Large, XL)
Largest:	1.5 billion parameters
Training data	WebText (40 GB)
Evaluated on	8 different benchmarks
Unique finding	Zero-shot multitask learning
Citations	50,000+ (highly influential)
Commercial impact	Led to GPT-3 → ChatGPT → AI boom

Why You Should Care Today (2026)

GPT-2 is from 2019. Why does it matter in 2026?

It's still used — Many applications use small GPT-2 models for efficiency
It established the playbook — Decoder-only + scaling + prompting is now standard
It started the AI revolution — Without GPT-2, no GPT-3, no ChatGPT, no AI boom
It raised important questions — About AI safety, bias, and responsible release

When you use ChatGPT, Gemini, or Claude today, you're using descendants of GPT-2's architecture and training approach.

The Broader Lesson

BERT taught us: AI can understand language deeply.

GPT-2 taught us: AI can generate language fluently.

Together, they showed that:

The Transformer architecture is universal
Pre-training on diverse data works
Scale is a reliable path to capability
The same model can handle many tasks with clever prompting

These insights directly led to the AI revolution we're living through in 2026.

Comparing the Series So Far

Rendering diagram…

Paper	Year	Focus	Innovation	Impact
Attention	2017	Architecture	Transformer	The foundation
BERT	2018	Understanding	Bidirectional pre-training	NLP boom
GPT-2	2019	Generation	Decoder-only scaling	ChatGPT path

GPT-2: How AI Learned to Write

Previously...

In our first two articles, we explored:

Attention Is All You Need — The Transformer architecture that pays attention to what matters
BERT: How AI Learned to Truly Read — Using Transformers to understand language deeply

Both articles focused on understanding—analyzing text, answering questions, classifying sentiment.

But what if we flipped the problem? Instead of understanding existing text, what if we asked an AI to generate new text?

In 2019, OpenAI released GPT-2 and showed that the answer was surprisingly powerful: Just predict the next word.

The Big Idea

GPT-2 stands for Generative Pre-trained Transformer 2 (the sequel to GPT-1).

The core idea is deceptively simple:

If you're really good at predicting what word comes next, you can write fluently.

Think about how you read. When you see:

"The pizza was cold, so I put it in the..."

→ Next word is probably: oven

You don't consciously think about this. Your brain just predicts what's likely to come next based on patterns it learned from years of reading.

The Problem GPT-2 Solved

The BERT Limitation

BERT was brilliant at understanding, but it couldn't generate text. You couldn't ask BERT to write a story or essay. It could only analyze or classify existing text.

The Solution: Decoder-Only Transformers

GPT-2 used only the decoder side of the Transformer architecture (from our Attention article). Remember the two-sided Transformer?

Rendering diagram…

The decoder can only see words that came before it—perfect for writing one word at a time, without cheating by looking ahead.

How Does GPT-2 Learn?

GPT-2 uses a single, elegant training objective:

Predict the Next Word (Language Modeling)

During training, GPT-2 sees text and learns to predict what comes next:

Input text:  "The cat sat on the"
Prediction:  [predict next word]
Target:      "mat"

Input text:  "The cat sat on the mat"
Prediction:  [predict next word]
Target:      "." (period)

Input text:  "The cat sat on the mat."
Prediction:  [predict next word]
Target:      "It" (next sentence)

By seeing billions of examples, GPT-2 learns statistical patterns:

Rendering diagram…

Why This Works

By predicting the next word millions of times on diverse text, GPT-2 learns:

Grammar — Which word combinations are grammatical
Semantics — What words relate to each other (cats, dogs, animals)
Facts — When trained on Wikipedia, it learns facts about the world
Style — Formal writing vs. casual, poetry vs. prose
Common sense — Why certain sequences make sense

Pre-training on Internet Scale

The Game-Changer: WebText

Unlike BERT, which trained on Wikipedia and books, GPT-2 trained on WebText—a dataset of 40GB of text from web pages.

What's the difference? The internet contains:

Diverse writing styles (blogs, forums, news, Reddit)
Modern language and current events
Niche knowledge and interests
Different forms of humor and sarcasm

This diversity is crucial. It exposes GPT-2 to more varied language than any previous model.

Training Specs

Aspect	Value
Training data	WebText (40GB)
Model size	1.5 billion parameters
Training duration	Weeks on TPU clusters
Vocabulary	50,000 tokens
Context window	1,024 tokens (~4000 words)

For perspective: GPT-2's 1.5 billion parameters dwarfed BERT (340M parameters max).

The Model Sizes: Small to XL

OpenAI released GPT-2 in stages, from small to large:

Rendering diagram…

Size	Parameters	Use Case
Small	124M	Fast, lightweight, educational
Medium	355M	Balance of speed and quality
Large	774M	High-quality output
XL	1.5B	Best results (slow, compute-intensive)

Interestingly, even the "small" 124M parameter version could write remarkably coherent text. Bigger wasn't necessary for capability—but it helped.

How Generation Works (Autoregressive Decoding)

When you ask GPT-2 to write, here's what happens:

Step 1: Start with a Prompt

Prompt: "Once upon a time"

Step 2: Predict the Next Word (One at a Time)

Rendering diagram…

Step 3: Repeat (Autoregressive Generation)

"Once upon a time"
→ "Once upon a time there"
→ "Once upon a time there was"
→ "Once upon a time there was a"
→ "Once upon a time there was a small"
→ "Once upon a time there was a small village"
→ [continue...]

GPT-2 generates word-by-word, using its own previous outputs as input. This is called autoregressive decoding.

Key insight: Every word is predicted independently, based only on previous words. This means:

✅ Can generate text of any length
❌ Cannot revise words already written
❌ Cannot look ahead to plan

What Could GPT-2 Actually Do?

Generation Tasks

1. Story Writing

Prompt: "It was a dark and stormy night when..."

Output: GPT-2 continues with a coherent narrative:

"...the lights went out. I fumbled for my flashlight, but my hands were shaking too badly to grip it. Outside, the wind howled like a beast in pain, rattling the windows as if trying to break through..."

2. News Article Writing

Prompt: "Breaking: New study finds coffee reduces cancer risk"

Output: GPT-2 generates a plausible news article with citations and quotes (sometimes completely fabricated—see "hallucinations" below).

3. Code Generation

Prompt: "def fibonacci(n):\n"

Output: GPT-2 can generate reasonable Python code:

def fibonacci(n):
    if n <= 1:
        return n
    else:
        return fibonacci(n-1) + fibonacci(n-2)

4. Question Answering (With Prompt Engineering)

Prompt: "Q: What is the capital of France? A:"

Output: "Paris"

By framing questions in a specific format (seen in training data), GPT-2 learns to answer without explicit fine-tuning.

5. Summarization

Prompt: "[Article text] TL;DR:"

Output: GPT-2 generates a summary.

The "Multitask" Insight

Here's the key discovery: The same model could do all these tasks without being explicitly trained on them.

BERT required fine-tuning for each task. GPT-2 just needed the right prompt and it would adapt. This is called zero-shot and few-shot capability.

Rendering diagram…

The Dark Side: Hallucinations and Bias

Hallucination (Making Stuff Up)

GPT-2 is trained to predict plausible text, not necessarily truthful text. If you ask it for facts, it might confidently provide completely fabricated information.

Example:

Q: "In what year was the Eiffel Tower built?"
GPT-2 answer: "1889" ✓ Correct

Q: "What are the top 5 causes of death in Antarctica?"
GPT-2 answer: [Makes up completely fictional statistics]
            ✗ Confidently false

GPT-2 has no mechanism to check facts. It just predicts plausible-sounding text.

Bias in Training Data

GPT-2 trained on internet text, which contains all of humanity's biases. When asked to complete prompts about gender, race, or nationality, GPT-2 often outputs stereotypes.

This led to important debates about:

Should powerful models be released publicly?
How do we mitigate bias in training data?
What are the societal implications?

Why GPT-2 Was Revolutionary

1. Generation is Powerful

BERT showed that understanding is learnable. GPT-2 showed that generation is too—and it's arguably more impressive because humans perceive generation as more "intelligent."

2. The Decoder-Only Architecture Dominates

Despite BERT's success, it turned out the decoder-only architecture (GPT-style) was more flexible:

Can be used for understanding (with prompts)
Can be used for generation (naturally)
Can be used for reasoning (with chain-of-thought)

Every major model after GPT-2—GPT-3, GPT-4, Claude, Gemini—uses decoder-only architecture.

3. Scale Matters More Than Task-Specific Design

BERT needed fine-tuning for each task. GPT-2 showed you could just make the model bigger and handle diverse tasks with clever prompting.

This led to the scaling hypothesis:

Bigger model + More data = Better performance across the board

This shaped the entire trajectory of AI:

Rendering diagram…

4. It's "Unsupervised"

The key word in the paper's title is "unsupervised." GPT-2 required:

No labeled data (unlike supervised learning)
No task-specific annotations
Just raw text from the internet

This made it scalable to unprecedented levels.

GPT-2 vs. BERT: The Fundamental Difference

Both are Transformers, both are pre-trained on massive data, so what's fundamentally different?

Rendering diagram…

Aspect	BERT	GPT-2
Task	Understanding	Generation
Training	Fill in blanks + relationships	Predict next word
Can see	Both directions	Only previous words
Strength	Classification, analysis	Writing, creation
Fine-tuning	Required for tasks	Optional (prompting works)
Architecture	Encoder only	Decoder only

The Concern: When Does GPT-2 Stop Being Safe?

When OpenAI released GPT-2, they made an unusual decision: They didn't release the full model at first.

They released:

Small model (124M) ✓
Medium model (355M) ✓
Large model (774M) ✓
XL model (1.5B) — Not immediately released

This sparked debate about AI safety and responsibility:

Question: If a model can generate convincing text, could it be misused to create:

Disinformation campaigns?

Fake news at scale?

Spam or phishing?

OpenAI's answer: "Probably. But the research community should study this, so we'll release it eventually."

By 2020, they released the full model, and the world didn't end. But the debate continues: How responsible should AI labs be?

GPT-2's Legacy: The Lineage

GPT-2 wasn't the end—it was the beginning of a series:

Rendering diagram…

Every model learned from GPT-2's insights:

✅ Decoder-only architecture
✅ Scaling laws matter
✅ Prompting > Fine-tuning (for capable models)
✅ Diverse pre-training data works

Numbers and Citations

Metric	Value
Published	February 2019
Organization	OpenAI
Model sizes	4 (Small, Medium, Large, XL)
Largest:	1.5 billion parameters
Training data	WebText (40 GB)
Evaluated on	8 different benchmarks
Unique finding	Zero-shot multitask learning
Citations	50,000+ (highly influential)
Commercial impact	Led to GPT-3 → ChatGPT → AI boom

Why You Should Care Today (2026)

GPT-2 is from 2019. Why does it matter in 2026?

It's still used — Many applications use small GPT-2 models for efficiency
It established the playbook — Decoder-only + scaling + prompting is now standard
It started the AI revolution — Without GPT-2, no GPT-3, no ChatGPT, no AI boom
It raised important questions — About AI safety, bias, and responsible release

When you use ChatGPT, Gemini, or Claude today, you're using descendants of GPT-2's architecture and training approach.

The Broader Lesson

BERT taught us: AI can understand language deeply.

GPT-2 taught us: AI can generate language fluently.

Together, they showed that:

The Transformer architecture is universal
Pre-training on diverse data works
Scale is a reliable path to capability
The same model can handle many tasks with clever prompting

These insights directly led to the AI revolution we're living through in 2026.

Comparing the Series So Far

Rendering diagram…

Paper	Year	Focus	Innovation	Impact
Attention	2017	Architecture	Transformer	The foundation
BERT	2018	Understanding	Bidirectional pre-training	NLP boom
GPT-2	2019	Generation	Decoder-only scaling	ChatGPT path

GPT-2: How AI Learned to Write

Previously...

The Big Idea

The Problem GPT-2 Solved

The BERT Limitation

The Solution: Decoder-Only Transformers

How Does GPT-2 Learn?

Predict the Next Word (Language Modeling)

Why This Works

Pre-training on Internet Scale

The Game-Changer: WebText

Training Specs

The Model Sizes: Small to XL

How Generation Works (Autoregressive Decoding)

Step 1: Start with a Prompt

Step 2: Predict the Next Word (One at a Time)

Step 3: Repeat (Autoregressive Generation)

What Could GPT-2 Actually Do?

Generation Tasks

1. Story Writing

2. News Article Writing

3. Code Generation

4. Question Answering (With Prompt Engineering)

5. Summarization

The "Multitask" Insight

The Dark Side: Hallucinations and Bias

Hallucination (Making Stuff Up)

Bias in Training Data

Why GPT-2 Was Revolutionary

1. Generation is Powerful

2. The Decoder-Only Architecture Dominates

3. Scale Matters More Than Task-Specific Design

4. It's "Unsupervised"

GPT-2 vs. BERT: The Fundamental Difference

The Concern: When Does GPT-2 Stop Being Safe?

GPT-2's Legacy: The Lineage

Numbers and Citations

Why You Should Care Today (2026)

The Broader Lesson

Comparing the Series So Far

Further Reading

GPT-2: How AI Learned to Write

Previously...

The Big Idea

The Problem GPT-2 Solved

The BERT Limitation

The Solution: Decoder-Only Transformers

How Does GPT-2 Learn?

Predict the Next Word (Language Modeling)

Why This Works

Pre-training on Internet Scale

The Game-Changer: WebText

Training Specs

The Model Sizes: Small to XL

How Generation Works (Autoregressive Decoding)

Step 1: Start with a Prompt

Step 2: Predict the Next Word (One at a Time)

Step 3: Repeat (Autoregressive Generation)

What Could GPT-2 Actually Do?

Generation Tasks

1. Story Writing

2. News Article Writing

3. Code Generation

4. Question Answering (With Prompt Engineering)

5. Summarization

The "Multitask" Insight

The Dark Side: Hallucinations and Bias

Hallucination (Making Stuff Up)

Bias in Training Data

Why GPT-2 Was Revolutionary

1. Generation is Powerful

2. The Decoder-Only Architecture Dominates

3. Scale Matters More Than Task-Specific Design

4. It's "Unsupervised"

GPT-2 vs. BERT: The Fundamental Difference

The Concern: When Does GPT-2 Stop Being Safe?

GPT-2's Legacy: The Lineage

Numbers and Citations

Why You Should Care Today (2026)

The Broader Lesson