GPT-2: How AI Learned to Write
A beginner-friendly explanation of GPT-2 (2019), the paper that showed AI could write coherent, creative text by simply predicting the next word. Part 3 of our AI Papers Explained series.
GPT-2: How AI Learned to Write
Previously...
In our first two articles, we explored:
- Attention Is All You Need — The Transformer architecture that pays attention to what matters
- BERT: How AI Learned to Truly Read — Using Transformers to understand language deeply
Both articles focused on understanding—analyzing text, answering questions, classifying sentiment.
But what if we flipped the problem? Instead of understanding existing text, what if we asked an AI to generate new text?
In 2019, OpenAI released GPT-2 and showed that the answer was surprisingly powerful: Just predict the next word.
The Big Idea
GPT-2 stands for Generative Pre-trained Transformer 2 (the sequel to GPT-1).
The core idea is deceptively simple:
If you're really good at predicting what word comes next, you can write fluently.
Think about how you read. When you see:
"The pizza was cold, so I put it in the..."
→ Next word is probably: oven
You don't consciously think about this. Your brain just predicts what's likely to come next based on patterns it learned from years of reading.
GPT-2 does exactly the same thing. It reads millions of texts and learns: "When I see these words, what word typically comes next?" Once trained, you can start it with any beginning and let it keep predicting, generating text that reads naturally.
The Problem GPT-2 Solved
The BERT Limitation
BERT was brilliant at understanding, but it couldn't generate text. You couldn't ask BERT to write a story or essay. It could only analyze or classify existing text.
This is because BERT reads in both directions—it sees the whole sentence, including words that haven't "happened yet" in the reading process. This makes it great for understanding, but useless for generation (you can't look ahead when writing).
The Solution: Decoder-Only Transformers
GPT-2 used only the decoder side of the Transformer architecture (from our Attention article). Remember the two-sided Transformer?
The decoder can only see words that came before it—perfect for writing one word at a time, without cheating by looking ahead.
How Does GPT-2 Learn?
GPT-2 uses a single, elegant training objective:
Predict the Next Word (Language Modeling)
During training, GPT-2 sees text and learns to predict what comes next:
Input text: "The cat sat on the"
Prediction: [predict next word]
Target: "mat"
Input text: "The cat sat on the mat"
Prediction: [predict next word]
Target: "." (period)
Input text: "The cat sat on the mat."
Prediction: [predict next word]
Target: "It" (next sentence)
By seeing billions of examples, GPT-2 learns statistical patterns:
Why This Works
By predicting the next word millions of times on diverse text, GPT-2 learns:
- Grammar — Which word combinations are grammatical
- Semantics — What words relate to each other (cats, dogs, animals)
- Facts — When trained on Wikipedia, it learns facts about the world
- Style — Formal writing vs. casual, poetry vs. prose
- Common sense — Why certain sequences make sense
Pre-training on Internet Scale
The Game-Changer: WebText
Unlike BERT, which trained on Wikipedia and books, GPT-2 trained on WebText—a dataset of 40GB of text from web pages.
What's the difference? The internet contains:
- Diverse writing styles (blogs, forums, news, Reddit)
- Modern language and current events
- Niche knowledge and interests
- Different forms of humor and sarcasm
This diversity is crucial. It exposes GPT-2 to more varied language than any previous model.
Training Specs
| Aspect | Value |
|---|---|
| Training data | WebText (40GB) |
| Model size | 1.5 billion parameters |
| Training duration | Weeks on TPU clusters |
| Vocabulary | 50,000 tokens |
| Context window | 1,024 tokens (~4000 words) |
For perspective: GPT-2's 1.5 billion parameters dwarfed BERT (340M parameters max).
The Model Sizes: Small to XL
OpenAI released GPT-2 in stages, from small to large:
| Size | Parameters | Use Case |
|---|---|---|
| Small | 124M | Fast, lightweight, educational |
| Medium | 355M | Balance of speed and quality |
| Large | 774M | High-quality output |
| XL | 1.5B | Best results (slow, compute-intensive) |
Interestingly, even the "small" 124M parameter version could write remarkably coherent text. Bigger wasn't necessary for capability—but it helped.
How Generation Works (Autoregressive Decoding)
When you ask GPT-2 to write, here's what happens:
Step 1: Start with a Prompt
Prompt: "Once upon a time"
Step 2: Predict the Next Word (One at a Time)
Step 3: Repeat (Autoregressive Generation)
"Once upon a time"
→ "Once upon a time there"
→ "Once upon a time there was"
→ "Once upon a time there was a"
→ "Once upon a time there was a small"
→ "Once upon a time there was a small village"
→ [continue...]
GPT-2 generates word-by-word, using its own previous outputs as input. This is called autoregressive decoding.
Key insight: Every word is predicted independently, based only on previous words. This means:
- ✅ Can generate text of any length
- ❌ Cannot revise words already written
- ❌ Cannot look ahead to plan
What Could GPT-2 Actually Do?
Generation Tasks
1. Story Writing
Prompt: "It was a dark and stormy night when..."
Output: GPT-2 continues with a coherent narrative:
"...the lights went out. I fumbled for my flashlight, but my hands were shaking too badly to grip it. Outside, the wind howled like a beast in pain, rattling the windows as if trying to break through..."
2. News Article Writing
Prompt: "Breaking: New study finds coffee reduces cancer risk"
Output: GPT-2 generates a plausible news article with citations and quotes (sometimes completely fabricated—see "hallucinations" below).
3. Code Generation
Prompt: "def fibonacci(n):\n"
Output: GPT-2 can generate reasonable Python code:
def fibonacci(n):
if n <= 1:
return n
else:
return fibonacci(n-1) + fibonacci(n-2)
4. Question Answering (With Prompt Engineering)
Prompt: "Q: What is the capital of France? A:"
Output: "Paris"
By framing questions in a specific format (seen in training data), GPT-2 learns to answer without explicit fine-tuning.
5. Summarization
Prompt: "[Article text] TL;DR:"
Output: GPT-2 generates a summary.
The "Multitask" Insight
Here's the key discovery: The same model could do all these tasks without being explicitly trained on them.
BERT required fine-tuning for each task. GPT-2 just needed the right prompt and it would adapt. This is called zero-shot and few-shot capability.
The Dark Side: Hallucinations and Bias
Hallucination (Making Stuff Up)
GPT-2 is trained to predict plausible text, not necessarily truthful text. If you ask it for facts, it might confidently provide completely fabricated information.
Example:
Q: "In what year was the Eiffel Tower built?"
GPT-2 answer: "1889" ✓ Correct
Q: "What are the top 5 causes of death in Antarctica?"
GPT-2 answer: [Makes up completely fictional statistics]
✗ Confidently false
GPT-2 has no mechanism to check facts. It just predicts plausible-sounding text.
Bias in Training Data
GPT-2 trained on internet text, which contains all of humanity's biases. When asked to complete prompts about gender, race, or nationality, GPT-2 often outputs stereotypes.
This led to important debates about:
- Should powerful models be released publicly?
- How do we mitigate bias in training data?
- What are the societal implications?
Why GPT-2 Was Revolutionary
1. Generation is Powerful
BERT showed that understanding is learnable. GPT-2 showed that generation is too—and it's arguably more impressive because humans perceive generation as more "intelligent."
2. The Decoder-Only Architecture Dominates
Despite BERT's success, it turned out the decoder-only architecture (GPT-style) was more flexible:
- Can be used for understanding (with prompts)
- Can be used for generation (naturally)
- Can be used for reasoning (with chain-of-thought)
Every major model after GPT-2—GPT-3, GPT-4, Claude, Gemini—uses decoder-only architecture.
3. Scale Matters More Than Task-Specific Design
BERT needed fine-tuning for each task. GPT-2 showed you could just make the model bigger and handle diverse tasks with clever prompting.
This led to the scaling hypothesis:
Bigger model + More data = Better performance across the board
This shaped the entire trajectory of AI:
4. It's "Unsupervised"
The key word in the paper's title is "unsupervised." GPT-2 required:
- No labeled data (unlike supervised learning)
- No task-specific annotations
- Just raw text from the internet
This made it scalable to unprecedented levels.
GPT-2 vs. BERT: The Fundamental Difference
Both are Transformers, both are pre-trained on massive data, so what's fundamentally different?
| Aspect | BERT | GPT-2 |
|---|---|---|
| Task | Understanding | Generation |
| Training | Fill in blanks + relationships | Predict next word |
| Can see | Both directions | Only previous words |
| Strength | Classification, analysis | Writing, creation |
| Fine-tuning | Required for tasks | Optional (prompting works) |
| Architecture | Encoder only | Decoder only |
The Concern: When Does GPT-2 Stop Being Safe?
When OpenAI released GPT-2, they made an unusual decision: They didn't release the full model at first.
They released:
- Small model (124M) ✓
- Medium model (355M) ✓
- Large model (774M) ✓
- XL model (1.5B) — Not immediately released
This sparked debate about AI safety and responsibility:
Question: If a model can generate convincing text, could it be misused to create:
- Disinformation campaigns?
- Fake news at scale?
- Spam or phishing?
OpenAI's answer: "Probably. But the research community should study this, so we'll release it eventually."
By 2020, they released the full model, and the world didn't end. But the debate continues: How responsible should AI labs be?
GPT-2's Legacy: The Lineage
GPT-2 wasn't the end—it was the beginning of a series:
Every model learned from GPT-2's insights:
- ✅ Decoder-only architecture
- ✅ Scaling laws matter
- ✅ Prompting > Fine-tuning (for capable models)
- ✅ Diverse pre-training data works
Numbers and Citations
| Metric | Value |
|---|---|
| Published | February 2019 |
| Organization | OpenAI |
| Model sizes | 4 (Small, Medium, Large, XL) |
| Largest: | 1.5 billion parameters |
| Training data | WebText (40 GB) |
| Evaluated on | 8 different benchmarks |
| Unique finding | Zero-shot multitask learning |
| Citations | 50,000+ (highly influential) |
| Commercial impact | Led to GPT-3 → ChatGPT → AI boom |
Why You Should Care Today (2026)
GPT-2 is from 2019. Why does it matter in 2026?
- It's still used — Many applications use small GPT-2 models for efficiency
- It established the playbook — Decoder-only + scaling + prompting is now standard
- It started the AI revolution — Without GPT-2, no GPT-3, no ChatGPT, no AI boom
- It raised important questions — About AI safety, bias, and responsible release
When you use ChatGPT, Gemini, or Claude today, you're using descendants of GPT-2's architecture and training approach.
The Broader Lesson
BERT taught us: AI can understand language deeply.
GPT-2 taught us: AI can generate language fluently.
Together, they showed that:
- The Transformer architecture is universal
- Pre-training on diverse data works
- Scale is a reliable path to capability
- The same model can handle many tasks with clever prompting
These insights directly led to the AI revolution we're living through in 2026.
Comparing the Series So Far
| Paper | Year | Focus | Innovation | Impact |
|---|---|---|---|---|
| Attention | 2017 | Architecture | Transformer | The foundation |
| BERT | 2018 | Understanding | Bidirectional pre-training | NLP boom |
| GPT-2 | 2019 | Generation | Decoder-only scaling | ChatGPT path |
Further Reading
-
Previous articles in this series:
-
The original paper: Radford, A., Wu, J., Child, R., Luan, D., Amodei, D., & Sutskever, I. (2019). Language Models are Unsupervised Multitask Learners. openai.com/blog
-
Interactive playground: Run small GPT-2 models on huggingface.co
-
Next in the series: GPT-3 (2020) — Few-shot learning and emergence
Last Updated: March 27, 2026
Author: RESEARCHER
Category: Research / Explainer
Difficulty: Beginner-friendly
Series: AI Papers Explained — Part 3 of 3 (Foundation Era)