Scaling Laws: Why Bigger Isn't Always Better
Two landmark papers revealed that AI model performance follows predictable mathematical lawsβand that the industry was training models wrong. The Chinchilla paper showed that a 70B model trained on more data could outperform models 4Γ its size, reshaping how every major AI lab builds models today.
Scaling Laws: Why Bigger Isn't Always Better
The Question Nobody Was Asking
By 2020, AI labs were in an arms race. Bigger models, more parameters, more compute. GPT-2 had 1.5 billion parameters. GPT-3 had 175 billion. Google's models were pushing even further. The assumption was simple: bigger model = better performance.
But nobody had rigorously asked: Is this the best way to spend our compute budget?
If you have a fixed amount of money (and therefore compute), should you:
- Train a huge model on a modest amount of data?
- Train a smaller model on much more data?
- Something in between?
Two papers answered this question β and they initially disagreed.
Paper 1: Kaplan et al. (2020) β The First Scaling Laws
Paper: "Scaling Laws for Neural Language Models"
Authors: Jared Kaplan, Sam McCandlish, et al. (OpenAI)
Published: January 2020
ArXiv: 2001.08361
The Discovery
OpenAI researchers trained hundreds of language models of different sizes and measured their performance (cross-entropy loss). They found something remarkable: performance follows a power law.
The relationship is:
L(N) β N^(-0.076)
Where L is the loss (lower = better) and N is the number of parameters.
In plain language: doubling the model size reduces loss by about 5%. This might sound small, but loss improvements compound β and cross-entropy loss relates exponentially to perplexity (a common measure of model quality).
Three Power Laws
Kaplan found that performance scales predictably with three independent variables:
- Model size (N): More parameters β lower loss
- Dataset size (D): More training data β lower loss
- Compute (C): More training FLOPs β lower loss
Each follows its own power law. Performance improves smoothly and predictably as you increase any of them β there are no sudden jumps or plateaus.
The Key Conclusion
Given a fixed compute budget, Kaplan's analysis suggested that model size should be scaled faster than dataset size. In other words: build a bigger model and train it on relatively less data.
This recommendation shaped the industry. GPT-3 (175B parameters) was trained on only 300 billion tokens β a ratio of roughly 1.7 tokens per parameter. The prevailing wisdom became: go big on parameters.
Why It Mattered
Before Kaplan, scaling decisions were largely based on intuition and trial-and-error. After Kaplan, labs had mathematical formulas to predict how much performance they'd gain from a given investment. This turned model design from art into engineering.
Paper 2: Chinchilla (2022) β The Correction
Paper: "Training Compute-Optimal Large Language Models"
Authors: Jordan Hoffmann, Sebastian Borgeaud, et al. (DeepMind)
Published: March 2022
ArXiv: 2203.15556
Two years after Kaplan, DeepMind ran a much larger experiment and reached a different conclusion.
The Experiment
Hoffmann's team trained over 400 language models ranging from 70 million to 16 billion parameters, on datasets from 5 billion to 500 billion tokens. They systematically varied the ratio of model size to training data to find the optimal balance.
The Key Finding
For every doubling of model size, the number of training tokens should also be doubled.
The optimal ratio is approximately 20 tokens per parameter.
This directly contradicted the industry's practice. Most large models were significantly undertrained β too many parameters, not enough data.
The Proof: Chinchilla vs. Gopher
DeepMind proved their point with a dramatic experiment. They took the same compute budget used to train Gopher (their 280B parameter model trained on 300B tokens) and instead trained Chinchilla β a 70B parameter model on 1.4 trillion tokens.
The results:
| Model | Parameters | Training Tokens | Tokens/Param | MMLU Accuracy |
|---|---|---|---|---|
| Gopher | 280B | 300B | ~1.1 | 60.0% |
| GPT-3 | 175B | 300B | ~1.7 | 43.9% |
| Jurassic-1 | 178B | 300B | ~1.7 | β |
| Megatron-Turing | 530B | 270B | ~0.5 | β |
| Chinchilla | 70B | 1.4T | 20 | 67.5% |
Chinchilla β 4Γ smaller than Gopher β outperformed it on virtually every benchmark. On MMLU (Massive Multitask Language Understanding), Chinchilla scored 67.5% compared to Gopher's 60%, a 7+ percentage point improvement.
The conclusion was stark: Gopher, GPT-3, Jurassic-1, and Megatron-Turing were all undertrained. They had too many parameters relative to their training data.
What Went Wrong with Kaplan?
The disagreement between Kaplan and Chinchilla was later traced to methodological differences in the Kaplan study:
- Embedding parameters excluded: Kaplan did not count parameters in the token embedding layer. For smaller models, this creates a significant bias in the scaling coefficients.
- Smaller models studied: Kaplan's experiments used smaller models than Chinchilla's, limiting the accuracy of extrapolation.
- Warmup and optimizer tuning: Differences in learning rate warmup and scale-dependent optimizer settings affected the results.
When these factors are corrected, the two studies converge toward the Chinchilla conclusion: scale model size and data equally.
The Implications
Most Models Were Undertrained
The Chinchilla paper implied that virtually every large model in existence was suboptimal:
| Model | Actual Tokens | Chinchilla-Optimal Tokens | Undertrained By |
|---|---|---|---|
| GPT-3 (175B) | 300B | ~3.5T | ~12Γ |
| Gopher (280B) | 300B | ~5.6T | ~19Γ |
| Megatron (530B) | 270B | ~10.6T | ~39Γ |
These models could have achieved the same performance at a fraction of their size β or much better performance at the same size β if trained on more data.
The Data Problem
If models need 20 tokens per parameter, then:
- A 70B model needs 1.4 trillion tokens
- A 175B model needs 3.5 trillion tokens
- A 500B model needs 10 trillion tokens
Where does all this data come from? This shifted the industry's bottleneck from compute to data quality. Suddenly, curating massive, high-quality text datasets became as important as building bigger GPUs.
Inference Cost
Chinchilla had a practical bonus: smaller models are cheaper to run. A 70B model uses 4Γ less memory and runs 4Γ faster at inference time compared to a 280B model. Same performance, much lower serving costs.
This matters enormously for production deployment. Running GPT-3 (175B) costs roughly 4Γ more per query than running a Chinchilla-optimal 70B model that achieves the same or better results.
The Power Law Formulas
For those interested in the mathematics, here are the core relationships.
Kaplan's Laws (2020)
Performance as a function of model size (with sufficient data):
L(N) = (N_c / N)^Ξ±_N where Ξ±_N β 0.076
Performance as a function of dataset size (with sufficient model):
L(D) = (D_c / D)^Ξ±_D where Ξ±_D β 0.095
Performance as a function of compute:
L(C) = (C_c / C)^Ξ±_C where Ξ±_C β 0.050
Chinchilla's Laws (2022)
Combined loss function:
L(N, D) = E + A/N^Ξ± + B/D^Ξ²
Where:
- E is the irreducible loss (entropy of natural language, ~1.69 nats)
- A/N^Ξ± captures the model size contribution
- B/D^Ξ² captures the dataset size contribution
- Ξ± β 0.34, Ξ² β 0.28
The optimal allocation for a compute budget C:
N_opt β C^0.50 (parameters scale as square root of compute)
D_opt β C^0.50 (tokens scale as square root of compute)
Both scale equally β confirming the "double both" rule.
Real-World Impact
LLaMA (Meta, 2023)
Meta's LLaMA models were explicitly designed using Chinchilla-optimal scaling:
| Model | Parameters | Training Tokens | Tokens/Param |
|---|---|---|---|
| LLaMA 7B | 7B | 1T | ~143 |
| LLaMA 13B | 13B | 1T | ~77 |
| LLaMA 33B | 33B | 1.4T | ~42 |
| LLaMA 65B | 65B | 1.4T | ~22 |
Note: LLaMA actually trained beyond Chinchilla-optimal (more tokens per parameter than 20). Meta found that performance continued improving past the Chinchilla point β the model hadn't fully saturated on the data.
This led to a revision: 20 tokens/parameter is a minimum, not a maximum. More data almost always helps if you can afford the compute.
Industry Shift
After Chinchilla, the industry pivoted:
- Data collection scaled up massively (Common Crawl, RedPajama, The Pile)
- Smaller, better-trained models became competitive with larger ones
- Inference costs dropped as labs built efficient models instead of maximum-size ones
- Open-source models became viable (you can fit a 7B or 13B model on consumer hardware)
Why This Matters for the Series
The Economics of AI
Every article in this series so far has focused on techniques β attention, BERT, GPT-2, instruction tuning, RLHF, chain-of-thought. Scaling laws explain the economics behind these techniques:
- Why is GPT-4 that size? Because scaling laws predict the optimal allocation of OpenAI's compute budget.
- Why did LLaMA succeed? Because Meta followed Chinchilla-optimal scaling instead of just building the biggest model.
- Why are newer models getting smaller but better? Because the industry learned that data matters as much as parameters.
Prediction, Not Guessing
Before scaling laws, model design was trial-and-error. After scaling laws, labs could:
- Set a compute budget
- Calculate the optimal model size
- Calculate the required dataset size
- Predict the expected performance
This turned AI from "let's build the biggest model we can afford" into "let's build the most efficient model we can afford."
Limitations and Open Questions
Does Scaling Hit a Wall?
Power laws predict smooth improvement forever. But does performance actually plateau at some point? Current evidence suggests:
- No wall has been hit yet for language modeling loss
- But downstream task improvements can saturate even as loss continues dropping
- The relationship between loss and "usefulness" is complex
Data Quality vs. Quantity
Chinchilla assumed all tokens are equally valuable. In practice, high-quality data (curated text, books, academic papers) is worth much more than low-quality data (web scraps, duplicates). Recent work suggests that data quality scaling laws may be even more impactful than quantity scaling.
Beyond Language
Scaling laws have been observed in:
- Vision models (ViT)
- Multimodal models (CLIP)
- Code generation (Codex)
- Protein folding (AlphaFold)
The universality of power-law scaling across domains suggests a deeper mathematical principle that isn't fully understood yet.
Inference-Time Compute
Recent work (OpenAI's o1, reasoning models) suggests a new dimension: scaling compute at inference time (thinking longer) can substitute for scaling model parameters. This may rewrite the scaling laws for reasoning-heavy tasks.
Series Navigation
- Attention Is All You Need: The Paper That Changed AI
- BERT: How AI Learned to Truly Read
- GPT-2: How AI Learned to Write
- FLAN: How AI Learned to Follow Instructions
- InstructGPT: How AI Learned What Humans Actually Want
- Chain-of-Thought: How AI Learned to Show Its Work
- Scaling Laws: Why Bigger Isn't Always Better β You are here
Last Updated: March 31, 2026
Author: RESEARCHER
Category: Research
Difficulty: Intermediate
Papers:
- Kaplan et al., "Scaling Laws for Neural Language Models" (arXiv:2001.08361, January 2020)
- Hoffmann et al., "Training Compute-Optimal Large Language Models" (arXiv:2203.15556, March 2022)