GPT-3: The Model That Proved Bigger Could Be Smarter
In 2020, OpenAI scaled GPT-2 by over 100Ăâto 175 billion parametersâand discovered something unexpected: the model could perform tasks it was never trained on, just by reading a few examples in its prompt. 'Language Models are Few-Shot Learners' didn't just set new benchmarks. It changed what we thought language models could do.
GPT-3: The Model That Proved Bigger Could Be Smarter
Previously...
In our earlier articles, we traced the arc of modern AI:
- Attention Is All You Need â The Transformer architecture (2017)
- BERT â Teaching AI to understand language (2018)
- GPT-2 â Teaching AI to generate language (2019)
GPT-2 showed that predicting the next word, at scale, could produce remarkably fluent text. OpenAI even delayed the full release over concerns about misuse.
But GPT-2 had 1.5 billion parameters. What would happen if you made it 100 times bigger?
In May 2020, OpenAI answered that question with GPT-3âand the answer surprised everyone.
The Paper
Title: "Language Models are Few-Shot Learners" Authors: Tom Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, and 21 others at OpenAI Published: May 28, 2020 (arXiv), presented at NeurIPS 2020 ArXiv: 2005.14165 Award: NeurIPS 2020 Best Paper Award
The Big Idea
GPT-3's core claim is deceptively simple:
If you make a language model big enough, it can learn new tasks from just a few examplesâwithout any fine-tuning.
This is called in-context learning (or few-shot learning), and it was GPT-3's breakthrough discovery. You don't retrain the model. You don't update its weights. You just show it a few examples in the prompt, and it figures out the pattern.
Here's what that looks like:
Translate English to French:
sea otter â loutre de mer
peppermint â menthe poivrĂŠe
cheese â fromage
plaid shirt â
GPT-3 completes: chemise Ă carreaux
Nobody trained GPT-3 to be a translator. Nobody fine-tuned it on English-French pairs. It learned translation as a side effect of being trained to predict the next word on a massive datasetâand it could apply that knowledge with just a few demonstrations.
Why This Matters
Before GPT-3: One Model Per Task
Before GPT-3, the standard approach to NLP was:
- Pretrain a language model on generic text (like BERT or GPT-2)
- Fine-tune it on task-specific data (sentiment analysis, question answering, translation, etc.)
- Deploy that single-purpose model
This worked, but it was expensive and rigid. Want to do a new task? Collect labeled data, fine-tune again, deploy a new model. Every task required its own dataset and its own training run.
After GPT-3: One Model, Many Tasks
GPT-3 showed you could skip fine-tuning entirely. One model could handle translation, summarization, question answering, code generation, arithmetic, and moreâjust by changing the prompt.
This was a paradigm shift. Instead of engineering models, you could engineer prompts.
The Three Modes of Learning
The paper carefully distinguished three ways to use GPT-3, each requiring different amounts of information in the prompt:
Zero-Shot
Give GPT-3 only a task description. No examples at all.
Translate the following English text to French: "What time is it?"
This tests whether the model understands the task from the description alone.
One-Shot
Give GPT-3 one example, then ask it to perform the task.
Translate English to French:
hello â bonjour
What time is it? â
Few-Shot
Give GPT-3 several examples (typically 10â100, limited by the context window), then ask.
Translate English to French:
hello â bonjour
goodbye â au revoir
thank you â merci
What time is it? â
The paper's key finding: larger models benefited far more from few-shot examples than smaller ones. A small model might improve marginally with examples. GPT-3 improved dramatically. This suggested that in-context learning was an emergent ability that appeared with scale.
The Scale of GPT-3
By the Numbers
| Property | GPT-2 | GPT-3 |
|---|---|---|
| Parameters | 1.5 billion | 175 billion |
| Transformer layers | 48 | 96 |
| Hidden dimension | 1600 | 12288 |
| Attention heads | 25 | 96 |
| Context window | 1024 tokens | 2048 tokens |
| Training data | ~40 GB (WebText) | ~570 GB (mixed) |
| Training tokens | ~10 billion | 300 billion |
GPT-3 was approximately 117Ă larger than GPT-2 in parameter count. At the time, it was the largest non-sparse language model ever trainedâ10Ă bigger than Microsoft's Turing-NLG (17 billion parameters), which had held the record just months earlier.
Training Data
GPT-3 was trained on a diverse mix of internet text:
| Dataset | Tokens | Weight in Training |
|---|---|---|
| Common Crawl (filtered) | 410 billion | 60% |
| WebText2 | 19 billion | 22% |
| Books1 | 12 billion | 8% |
| Books2 | 55 billion | 8% |
| Wikipedia | 3 billion | 3% |
A critical detail: the weights don't match the token counts. Wikipedia, despite being only 3 billion tokens, was sampled at 3% of trainingâmeaning the model saw Wikipedia content roughly 3.4 times during training. Common Crawl, with 410 billion tokens, was sampled at 60%âmeaning the model saw less than half of it. This upsampling of high-quality sources was a deliberate design choice.
The Common Crawl data wasn't used raw. OpenAI filtered it using a classifier trained to distinguish high-quality text (similar to WebText, their curated dataset) from low-quality web scrapes. They also performed fuzzy deduplication using MinHash to remove near-duplicate documents. This filtering reduced the raw Common Crawl from petabytes down to roughly 570 GB of cleaned text.
Training Compute
Training GPT-3 required approximately 3.14 à 10²³ FLOPs (floating-point operations). For context:
- That's 3,640 petaflop/s-daysâmeaning a computer performing one quadrillion operations per second would need 3,640 days (about 10 years) to complete the training
- Lambda Labs estimated the cost at roughly $4.6 million using V100 GPUs at cloud pricing (2020 rates)
- Training was done on NVIDIA V100 GPUs on a high-bandwidth cluster provided by Microsoft
- All models in the GPT-3 family were trained for 300 billion tokens
The Results: What GPT-3 Could Do
Language Understanding (LAMBADA)
The LAMBADA benchmark tests a model's ability to predict the last word of a passage, requiring long-range contextual understanding. Previous models struggled hereâthe task was considered very difficult.
GPT-3 results:
- Zero-shot: 76.2% accuracy (previous state of the art: 68%)
- One-shot: 72.5% accuracy
- Few-shot: 86.4% accuracy
The few-shot result surpassed the previous state of the art by 18 percentage pointsâwithout any task-specific training.
Question Answering (TriviaQA)
On TriviaQA, a challenging open-domain question-answering benchmark:
- Zero-shot: 64.3% accuracy
- One-shot: 68.0% accuracy
- Few-shot: 71.2% accuracy
GPT-3 in the few-shot setting outperformed a fine-tuned T5-11B modelâa model that had been specifically trained on the task.
General Language Understanding (SuperGLUE)
On the SuperGLUE benchmark (a suite of challenging language understanding tasks):
- Few-shot GPT-3: 71.8 points
- Fine-tuned BERT-Large (2018): 69.0 points
- Human baseline: 89.8 points
GPT-3 exceeded BERT's fine-tuned performance without any fine-tuning at all. It still fell well short of human performanceâbut the trend was clear.
Arithmetic
Perhaps the most surprising result: GPT-3 could do arithmetic, despite never being explicitly trained on math:
| Task | GPT-3 (few-shot) |
|---|---|
| 2-digit addition | 100% |
| 3-digit addition | 80.2% |
| 4-digit addition | 25.5% |
| 5-digit addition | 9.3% |
| 2-digit subtraction | 98.9% |
| 2-digit multiplication | 29.2% |
The model learned arithmetic from patterns in text. It could reliably add two-digit numbers and even handle some three-digit additionâall from next-word prediction. Performance degraded with complexity, but the fact it worked at all was remarkable.
News Article Generation
In a particularly striking test, GPT-3 generated news articles that humans had difficulty distinguishing from real ones. In an experiment with 80 participants, human evaluators could only identify GPT-3-generated articles 52% of the timeâbarely above random chance (50%).
How In-Context Learning Works
The Mystery
In-context learning is strange if you think about it. During few-shot prompting, GPT-3's weights don't change. No gradients are computed. No parameters are updated. The model just reads the examples and somehow performs the task.
How?
The paper proposed that GPT-3 functions as a kind of meta-learner. During pretraining on diverse internet text, it encountered millions of implicit "tasks"âtext that switches between languages, text that follows question-answer patterns, text that demonstrates cause and effect. The model didn't just learn language; it learned how to recognize and adapt to patterns.
Think of it like a musician who has played thousands of songs. Show them a few bars of a new song in a style they know, and they can improvise alongânot because they've learned that specific song, but because they've internalized the patterns that make music work.
Scale Is the Key Ingredient
The paper tested eight different model sizes, from 125 million to 175 billion parameters. A consistent finding emerged:
- Zero-shot performance improved steadily with model size
- Few-shot performance improved much faster with model size
This means larger models don't just perform betterâthey learn more efficiently from examples. In-context learning is an emergent property of scale. Smaller models simply don't exhibit it in the same way.
The paper's Figure 1.3 (aggregating across 42 benchmarks) showed this strikingly: the gap between zero-shot and few-shot performance widens as models get bigger. A 125M-parameter model barely benefits from examples. A 175B-parameter model transforms.
The Model Family
GPT-3 wasn't actually one modelâit was a family of eight models, ranging in size:
| Model | Parameters | API Name |
|---|---|---|
| GPT-3 Small | 125M | â |
| GPT-3 Medium | 350M | Ada |
| GPT-3 Large | 760M | â |
| GPT-3 XL | 1.3B | Babbage |
| GPT-3 2.7B | 2.7B | â |
| GPT-3 6.7B | 6.7B | Curie |
| GPT-3 13B | 13B | â |
| GPT-3 175B | 175B | Davinci |
When people say "GPT-3," they typically mean Davinciâthe full 175B model. But the smaller variants were important for the paper's scaling analysis and were offered through the API at lower cost.
In September 2020, Microsoft announced it had licensed GPT-3 exclusively, though others could still access the model through OpenAI's API.
The Limitations GPT-3 Acknowledged
The paper was unusually candid about GPT-3's shortcomings:
1. Text Generation Coherence
While GPT-3 generated impressively fluent text, longer passages would often lose coherence, repeat themselves, or contradict earlier statements. The 2,048-token context window meant it literally couldn't "remember" anything beyond roughly 1,500 words.
2. Common-Sense Reasoning
GPT-3 struggled with certain types of reasoning that feel obvious to humans:
Q: If I put cheese in the fridge in the morning,
where is the cheese at lunchtime?
GPT-3 sometimes: "On the counter" or "In the store"
Physical reasoning, temporal reasoning, and causal chains remained weak spots.
3. Comparison Tasks
On benchmarks requiring comparison between two texts (like determining if two sentences mean the same thing), GPT-3 performed poorly. The paper attributed this to the autoregressive architectureâpredicting the next token is fundamentally different from comparing two sequences.
4. Societal Risks
The paper devoted an entire section to potential harms:
- Misinformation: GPT-3 could generate convincing fake news
- Bias: The model reflected biases present in its training data, including gender, racial, and religious biases
- Spam and phishing: Automated generation of persuasive text at scale
- Energy use: Training required enormous compute resources
The authors specifically studied bias in GPT-3's outputs, finding, for example, that the model associated certain occupations with specific genders and produced varying sentiment when prompted about different religions and races. They published these findings as a call for further research on mitigation.
Why GPT-3 Changed Everything
1. It Validated the Scaling Hypothesis
Before GPT-3, the idea that "just make it bigger" would lead to qualitatively new capabilities was debated. GPT-3 settled the debate. In-context learning didn't exist in smaller modelsâit emerged at scale. This sent every major AI lab racing to train larger models.
2. It Created the Prompt Engineering Paradigm
If you've ever carefully crafted a prompt for ChatGPT, you're using techniques that GPT-3 made possible. The paper showed that how you ask matters as much as what you ask. This spawned an entirely new discipline: prompt engineering.
3. It Made the API Model Viable
GPT-3 was the first major language model released as an API (June 11, 2020). Instead of every company training their own model, they could call OpenAI's API. This fundamentally changed the AI industry's business modelâfrom "train your own" to "pay per token."
4. It Set the Stage for ChatGPT
GPT-3 proved the foundation worked. The two subsequent breakthroughsâinstruction tuning (FLAN) and RLHF (InstructGPT)âmade GPT-3 actually useful as an assistant. ChatGPT, released in November 2022, was essentially GPT-3.5 with instruction tuning and RLHF applied. Without GPT-3 proving that scale unlocks capability, that pipeline would never have been built.
GPT-3 in the Series Arc
Our series has traced a clear progression:
| Step | Paper | What It Showed |
|---|---|---|
| Architecture | Attention Is All You Need (2017) | Transformers work |
| Understanding | BERT (2018) | Pretraining + fine-tuning works |
| Generation | GPT-2 (2019) | Next-word prediction generates fluent text |
| Scale | GPT-3 (2020) | Scale unlocks in-context learning |
| Instruction | FLAN (2021) | Instruction format improves task following |
| Alignment | InstructGPT (2022) | Human feedback aligns model behavior |
GPT-3 is the pivot point. Everything before it was about building better architectures and training techniques. Everything after it was about making large models more useful, more aligned, and more accessible.
Without GPT-3, we might still be fine-tuning BERT variants for individual tasks. GPT-3 showed that a single, massive model could be the foundation for everything.
Practical Takeaways
What GPT-3 Got Right
- Scale matters: 175B parameters enabled genuinely new capabilities
- Data diversity matters: Training on books, web pages, and Wikipedia created broad knowledge
- Evaluation flexibility: Testing zero-shot, one-shot, and few-shot gave a richer picture of capability
- Transparency about limitations: The paper's honesty about bias and risks set a standard
What GPT-3 Got Wrong (in hindsight)
- Undertrained by modern standards: Scaling laws research later showed GPT-3 was too large for its data budgetâthe Chinchilla paper suggested a 175B model should see ~3.5 trillion tokens, not 300 billion
- Context window too small: 2,048 tokens seems tiny by today's standards (modern models handle 100K+)
- No instruction following: Raw GPT-3 was powerful but hard to controlâit took FLAN and InstructGPT to make it actually useful
- Closed source: Unlike GPT-2, GPT-3's weights were never publicly released, sparking debate about open research
Series Navigation
- Attention Is All You Need: The Paper That Changed AI
- BERT: How AI Learned to Truly Read
- GPT-2: How AI Learned to Write
- FLAN: How AI Learned to Follow Instructions
- InstructGPT: How AI Learned What Humans Actually Want
- Chain-of-Thought: How AI Learned to Show Its Work
- Scaling Laws: Why Bigger Isn't Always Better
- Mixture of Experts: How AI Learned to Cheat the Scaling Laws
- GPT-3: The Model That Proved Bigger Could Be Smarter â You are here
Last Updated: April 5, 2026 Author: RESEARCHER Category: Research Difficulty: Intermediate Paper: Brown et al., "Language Models are Few-Shot Learners" (arXiv:2005.14165, May 2020, NeurIPS 2020 Best Paper Award)