Scaling Laws: Why Bigger Isn't Always Better

The Question Nobody Was Asking

By 2020, AI labs were in an arms race. Bigger models, more parameters, more compute. GPT-2 had 1.5 billion parameters. GPT-3 had 175 billion. Google's models were pushing even further. The assumption was simple: bigger model = better performance.

But nobody had rigorously asked: Is this the best way to spend our compute budget?

If you have a fixed amount of money (and therefore compute), should you:

Train a huge model on a modest amount of data?
Train a smaller model on much more data?
Something in between?

Two papers answered this question — and they initially disagreed.

Paper 1: Kaplan et al. (2020) — The First Scaling Laws

Paper: "Scaling Laws for Neural Language Models"
Authors: Jared Kaplan, Sam McCandlish, et al. (OpenAI)
Published: January 2020
ArXiv: 2001.08361

The Discovery

OpenAI researchers trained hundreds of language models of different sizes and measured their performance (cross-entropy loss). They found something remarkable: performance follows a power law.

The relationship is:

L(N) ∝ N^(-0.076)

Where L is the loss (lower = better) and N is the number of parameters.

In plain language: doubling the model size reduces loss by about 5%. This might sound small, but loss improvements compound — and cross-entropy loss relates exponentially to perplexity (a common measure of model quality).

Three Power Laws

Kaplan found that performance scales predictably with three independent variables:

Model size (N): More parameters → lower loss
Dataset size (D): More training data → lower loss
Compute (C): More training FLOPs → lower loss

Each follows its own power law. Performance improves smoothly and predictably as you increase any of them — there are no sudden jumps or plateaus.

The Key Conclusion

Given a fixed compute budget, Kaplan's analysis suggested that model size should be scaled faster than dataset size. In other words: build a bigger model and train it on relatively less data.

This recommendation shaped the industry. GPT-3 (175B parameters) was trained on only 300 billion tokens — a ratio of roughly 1.7 tokens per parameter. The prevailing wisdom became: go big on parameters.

Why It Mattered

Before Kaplan, scaling decisions were largely based on intuition and trial-and-error. After Kaplan, labs had mathematical formulas to predict how much performance they'd gain from a given investment. This turned model design from art into engineering.

Paper 2: Chinchilla (2022) — The Correction

Paper: "Training Compute-Optimal Large Language Models"
Authors: Jordan Hoffmann, Sebastian Borgeaud, et al. (DeepMind)
Published: March 2022
ArXiv: 2203.15556

Two years after Kaplan, DeepMind ran a much larger experiment and reached a different conclusion.

The Experiment

Hoffmann's team trained over 400 language models ranging from 70 million to 16 billion parameters, on datasets from 5 billion to 500 billion tokens. They systematically varied the ratio of model size to training data to find the optimal balance.

The Key Finding

For every doubling of model size, the number of training tokens should also be doubled.

The optimal ratio is approximately 20 tokens per parameter.

This directly contradicted the industry's practice. Most large models were significantly undertrained — too many parameters, not enough data.

The Proof: Chinchilla vs. Gopher

DeepMind proved their point with a dramatic experiment. They took the same compute budget used to train Gopher (their 280B parameter model trained on 300B tokens) and instead trained Chinchilla — a 70B parameter model on 1.4 trillion tokens.

The results:

Model	Parameters	Training Tokens	Tokens/Param	MMLU Accuracy
Gopher	280B	300B	~1.1	60.0%
GPT-3	175B	300B	~1.7	43.9%
Jurassic-1	178B	300B	~1.7	—
Megatron-Turing	530B	270B	~0.5	—
Chinchilla	70B	1.4T	20	67.5%

Chinchilla — 4× smaller than Gopher — outperformed it on virtually every benchmark. On MMLU (Massive Multitask Language Understanding), Chinchilla scored 67.5% compared to Gopher's 60%, a 7+ percentage point improvement.

The conclusion was stark: Gopher, GPT-3, Jurassic-1, and Megatron-Turing were all undertrained. They had too many parameters relative to their training data.

What Went Wrong with Kaplan?

The disagreement between Kaplan and Chinchilla was later traced to methodological differences in the Kaplan study:

Embedding parameters excluded: Kaplan did not count parameters in the token embedding layer. For smaller models, this creates a significant bias in the scaling coefficients.
Smaller models studied: Kaplan's experiments used smaller models than Chinchilla's, limiting the accuracy of extrapolation.
Warmup and optimizer tuning: Differences in learning rate warmup and scale-dependent optimizer settings affected the results.

When these factors are corrected, the two studies converge toward the Chinchilla conclusion: scale model size and data equally.

The Implications

Most Models Were Undertrained

The Chinchilla paper implied that virtually every large model in existence was suboptimal:

Model	Actual Tokens	Chinchilla-Optimal Tokens	Undertrained By
GPT-3 (175B)	300B	~3.5T	~12×
Gopher (280B)	300B	~5.6T	~19×
Megatron (530B)	270B	~10.6T	~39×

These models could have achieved the same performance at a fraction of their size — or much better performance at the same size — if trained on more data.

The Data Problem

If models need 20 tokens per parameter, then:

A 70B model needs 1.4 trillion tokens
A 175B model needs 3.5 trillion tokens
A 500B model needs 10 trillion tokens

Where does all this data come from? This shifted the industry's bottleneck from compute to data quality. Suddenly, curating massive, high-quality text datasets became as important as building bigger GPUs.

Inference Cost

Chinchilla had a practical bonus: smaller models are cheaper to run. A 70B model uses 4× less memory and runs 4× faster at inference time compared to a 280B model. Same performance, much lower serving costs.

This matters enormously for production deployment. Running GPT-3 (175B) costs roughly 4× more per query than running a Chinchilla-optimal 70B model that achieves the same or better results.

The Power Law Formulas

For those interested in the mathematics, here are the core relationships.

Kaplan's Laws (2020)

Performance as a function of model size (with sufficient data):

L(N) = (N_c / N)^α_N     where α_N ≈ 0.076

Performance as a function of dataset size (with sufficient model):

L(D) = (D_c / D)^α_D     where α_D ≈ 0.095

Performance as a function of compute:

L(C) = (C_c / C)^α_C     where α_C ≈ 0.050

Chinchilla's Laws (2022)

Combined loss function:

L(N, D) = E + A/N^α + B/D^β

Where:

E is the irreducible loss (entropy of natural language, ~1.69 nats)
A/N^α captures the model size contribution
B/D^β captures the dataset size contribution
α ≈ 0.34, β ≈ 0.28

The optimal allocation for a compute budget C:

N_opt ∝ C^0.50    (parameters scale as square root of compute)
D_opt ∝ C^0.50    (tokens scale as square root of compute)

Both scale equally — confirming the "double both" rule.

Real-World Impact

LLaMA (Meta, 2023)

Meta's LLaMA models were explicitly designed using Chinchilla-optimal scaling:

Model	Parameters	Training Tokens	Tokens/Param
LLaMA 7B	7B	1T	~143
LLaMA 13B	13B	1T	~77
LLaMA 33B	33B	1.4T	~42
LLaMA 65B	65B	1.4T	~22

Note: LLaMA actually trained beyond Chinchilla-optimal (more tokens per parameter than 20). Meta found that performance continued improving past the Chinchilla point — the model hadn't fully saturated on the data.

This led to a revision: 20 tokens/parameter is a minimum, not a maximum. More data almost always helps if you can afford the compute.

Industry Shift

After Chinchilla, the industry pivoted:

Data collection scaled up massively (Common Crawl, RedPajama, The Pile)
Smaller, better-trained models became competitive with larger ones
Inference costs dropped as labs built efficient models instead of maximum-size ones
Open-source models became viable (you can fit a 7B or 13B model on consumer hardware)

Why This Matters for the Series

The Economics of AI

Every article in this series so far has focused on techniques — attention, BERT, GPT-2, instruction tuning, RLHF, chain-of-thought. Scaling laws explain the economics behind these techniques:

Why is GPT-4 that size? Because scaling laws predict the optimal allocation of OpenAI's compute budget.
Why did LLaMA succeed? Because Meta followed Chinchilla-optimal scaling instead of just building the biggest model.
Why are newer models getting smaller but better? Because the industry learned that data matters as much as parameters.

Prediction, Not Guessing

Before scaling laws, model design was trial-and-error. After scaling laws, labs could:

Set a compute budget
Calculate the optimal model size
Calculate the required dataset size
Predict the expected performance

This turned AI from "let's build the biggest model we can afford" into "let's build the most efficient model we can afford."

Limitations and Open Questions

Does Scaling Hit a Wall?

Power laws predict smooth improvement forever. But does performance actually plateau at some point? Current evidence suggests:

No wall has been hit yet for language modeling loss
But downstream task improvements can saturate even as loss continues dropping
The relationship between loss and "usefulness" is complex

Data Quality vs. Quantity

Chinchilla assumed all tokens are equally valuable. In practice, high-quality data (curated text, books, academic papers) is worth much more than low-quality data (web scraps, duplicates). Recent work suggests that data quality scaling laws may be even more impactful than quantity scaling.

Beyond Language

Scaling laws have been observed in:

Vision models (ViT)
Multimodal models (CLIP)
Code generation (Codex)
Protein folding (AlphaFold)

The universality of power-law scaling across domains suggests a deeper mathematical principle that isn't fully understood yet.

Inference-Time Compute

Recent work (OpenAI's o1, reasoning models) suggests a new dimension: scaling compute at inference time (thinking longer) can substitute for scaling model parameters. This may rewrite the scaling laws for reasoning-heavy tasks.

Attention Is All You Need: The Paper That Changed AI
BERT: How AI Learned to Truly Read
GPT-2: How AI Learned to Write
FLAN: How AI Learned to Follow Instructions
InstructGPT: How AI Learned What Humans Actually Want
Chain-of-Thought: How AI Learned to Show Its Work
Scaling Laws: Why Bigger Isn't Always Better ← You are here

Last Updated: March 31, 2026
Author: RESEARCHER
Category: Research
Difficulty: Intermediate
Papers:

Kaplan et al., "Scaling Laws for Neural Language Models" (arXiv:2001.08361, January 2020)
Hoffmann et al., "Training Compute-Optimal Large Language Models" (arXiv:2203.15556, March 2022)

Scaling Laws: Why Bigger Isn't Always Better

The Question Nobody Was Asking

But nobody had rigorously asked: Is this the best way to spend our compute budget?

If you have a fixed amount of money (and therefore compute), should you:

Train a huge model on a modest amount of data?
Train a smaller model on much more data?
Something in between?

Two papers answered this question — and they initially disagreed.

Paper 1: Kaplan et al. (2020) — The First Scaling Laws

Paper: "Scaling Laws for Neural Language Models"
Authors: Jared Kaplan, Sam McCandlish, et al. (OpenAI)
Published: January 2020
ArXiv: 2001.08361

The Discovery

OpenAI researchers trained hundreds of language models of different sizes and measured their performance (cross-entropy loss). They found something remarkable: performance follows a power law.

The relationship is:

L(N) ∝ N^(-0.076)

Where L is the loss (lower = better) and N is the number of parameters.

Three Power Laws

Kaplan found that performance scales predictably with three independent variables:

Model size (N): More parameters → lower loss
Dataset size (D): More training data → lower loss
Compute (C): More training FLOPs → lower loss

Each follows its own power law. Performance improves smoothly and predictably as you increase any of them — there are no sudden jumps or plateaus.

The Key Conclusion

Given a fixed compute budget, Kaplan's analysis suggested that model size should be scaled faster than dataset size. In other words: build a bigger model and train it on relatively less data.

Why It Mattered

Paper 2: Chinchilla (2022) — The Correction

Paper: "Training Compute-Optimal Large Language Models"
Authors: Jordan Hoffmann, Sebastian Borgeaud, et al. (DeepMind)
Published: March 2022
ArXiv: 2203.15556

Two years after Kaplan, DeepMind ran a much larger experiment and reached a different conclusion.

The Experiment

The Key Finding

For every doubling of model size, the number of training tokens should also be doubled.

The optimal ratio is approximately 20 tokens per parameter.

This directly contradicted the industry's practice. Most large models were significantly undertrained — too many parameters, not enough data.

The Proof: Chinchilla vs. Gopher

The results:

Model	Parameters	Training Tokens	Tokens/Param	MMLU Accuracy
Gopher	280B	300B	~1.1	60.0%
GPT-3	175B	300B	~1.7	43.9%
Jurassic-1	178B	300B	~1.7	—
Megatron-Turing	530B	270B	~0.5	—
Chinchilla	70B	1.4T	20	67.5%

The conclusion was stark: Gopher, GPT-3, Jurassic-1, and Megatron-Turing were all undertrained. They had too many parameters relative to their training data.

What Went Wrong with Kaplan?

The disagreement between Kaplan and Chinchilla was later traced to methodological differences in the Kaplan study:

Embedding parameters excluded: Kaplan did not count parameters in the token embedding layer. For smaller models, this creates a significant bias in the scaling coefficients.
Smaller models studied: Kaplan's experiments used smaller models than Chinchilla's, limiting the accuracy of extrapolation.
Warmup and optimizer tuning: Differences in learning rate warmup and scale-dependent optimizer settings affected the results.

When these factors are corrected, the two studies converge toward the Chinchilla conclusion: scale model size and data equally.

The Implications

Most Models Were Undertrained

The Chinchilla paper implied that virtually every large model in existence was suboptimal:

Model	Actual Tokens	Chinchilla-Optimal Tokens	Undertrained By
GPT-3 (175B)	300B	~3.5T	~12×
Gopher (280B)	300B	~5.6T	~19×
Megatron (530B)	270B	~10.6T	~39×

These models could have achieved the same performance at a fraction of their size — or much better performance at the same size — if trained on more data.

The Data Problem

If models need 20 tokens per parameter, then:

A 70B model needs 1.4 trillion tokens
A 175B model needs 3.5 trillion tokens
A 500B model needs 10 trillion tokens

Inference Cost

This matters enormously for production deployment. Running GPT-3 (175B) costs roughly 4× more per query than running a Chinchilla-optimal 70B model that achieves the same or better results.

The Power Law Formulas

For those interested in the mathematics, here are the core relationships.

Kaplan's Laws (2020)

Performance as a function of model size (with sufficient data):

L(N) = (N_c / N)^α_N     where α_N ≈ 0.076

Performance as a function of dataset size (with sufficient model):

L(D) = (D_c / D)^α_D     where α_D ≈ 0.095

Performance as a function of compute:

L(C) = (C_c / C)^α_C     where α_C ≈ 0.050

Chinchilla's Laws (2022)

Combined loss function:

L(N, D) = E + A/N^α + B/D^β

Where:

E is the irreducible loss (entropy of natural language, ~1.69 nats)
A/N^α captures the model size contribution
B/D^β captures the dataset size contribution
α ≈ 0.34, β ≈ 0.28

The optimal allocation for a compute budget C:

N_opt ∝ C^0.50    (parameters scale as square root of compute)
D_opt ∝ C^0.50    (tokens scale as square root of compute)

Both scale equally — confirming the "double both" rule.

Real-World Impact

LLaMA (Meta, 2023)

Meta's LLaMA models were explicitly designed using Chinchilla-optimal scaling:

Model	Parameters	Training Tokens	Tokens/Param
LLaMA 7B	7B	1T	~143
LLaMA 13B	13B	1T	~77
LLaMA 33B	33B	1.4T	~42
LLaMA 65B	65B	1.4T	~22

This led to a revision: 20 tokens/parameter is a minimum, not a maximum. More data almost always helps if you can afford the compute.

Industry Shift

After Chinchilla, the industry pivoted:

Data collection scaled up massively (Common Crawl, RedPajama, The Pile)
Smaller, better-trained models became competitive with larger ones
Inference costs dropped as labs built efficient models instead of maximum-size ones
Open-source models became viable (you can fit a 7B or 13B model on consumer hardware)

Why This Matters for the Series

The Economics of AI

Every article in this series so far has focused on techniques — attention, BERT, GPT-2, instruction tuning, RLHF, chain-of-thought. Scaling laws explain the economics behind these techniques:

Why is GPT-4 that size? Because scaling laws predict the optimal allocation of OpenAI's compute budget.
Why did LLaMA succeed? Because Meta followed Chinchilla-optimal scaling instead of just building the biggest model.
Why are newer models getting smaller but better? Because the industry learned that data matters as much as parameters.

Prediction, Not Guessing

Before scaling laws, model design was trial-and-error. After scaling laws, labs could:

Set a compute budget
Calculate the optimal model size
Calculate the required dataset size
Predict the expected performance

This turned AI from "let's build the biggest model we can afford" into "let's build the most efficient model we can afford."

Limitations and Open Questions

Does Scaling Hit a Wall?

Power laws predict smooth improvement forever. But does performance actually plateau at some point? Current evidence suggests:

No wall has been hit yet for language modeling loss
But downstream task improvements can saturate even as loss continues dropping
The relationship between loss and "usefulness" is complex

Data Quality vs. Quantity

Beyond Language

Scaling laws have been observed in:

Vision models (ViT)
Multimodal models (CLIP)
Code generation (Codex)
Protein folding (AlphaFold)

The universality of power-law scaling across domains suggests a deeper mathematical principle that isn't fully understood yet.

Inference-Time Compute

Attention Is All You Need: The Paper That Changed AI
BERT: How AI Learned to Truly Read
GPT-2: How AI Learned to Write
FLAN: How AI Learned to Follow Instructions
InstructGPT: How AI Learned What Humans Actually Want
Chain-of-Thought: How AI Learned to Show Its Work
Scaling Laws: Why Bigger Isn't Always Better ← You are here

Last Updated: March 31, 2026
Author: RESEARCHER
Category: Research
Difficulty: Intermediate
Papers:

Kaplan et al., "Scaling Laws for Neural Language Models" (arXiv:2001.08361, January 2020)
Hoffmann et al., "Training Compute-Optimal Large Language Models" (arXiv:2203.15556, March 2022)

Scaling Laws: Why Bigger Isn't Always Better

The Question Nobody Was Asking

Paper 1: Kaplan et al. (2020) — The First Scaling Laws

The Discovery

Three Power Laws

The Key Conclusion

Why It Mattered

Paper 2: Chinchilla (2022) — The Correction

The Experiment

The Key Finding

The Proof: Chinchilla vs. Gopher

What Went Wrong with Kaplan?

The Implications

Most Models Were Undertrained

The Data Problem

Inference Cost

The Power Law Formulas

Kaplan's Laws (2020)

Chinchilla's Laws (2022)

Real-World Impact

LLaMA (Meta, 2023)

Industry Shift

Why This Matters for the Series

The Economics of AI

Prediction, Not Guessing

Limitations and Open Questions

Does Scaling Hit a Wall?

Data Quality vs. Quantity

Beyond Language

Inference-Time Compute

Series Navigation

Scaling Laws: Why Bigger Isn't Always Better

The Question Nobody Was Asking

Paper 1: Kaplan et al. (2020) — The First Scaling Laws

The Discovery

Three Power Laws

The Key Conclusion

Why It Mattered

Paper 2: Chinchilla (2022) — The Correction

The Experiment

The Key Finding

The Proof: Chinchilla vs. Gopher

What Went Wrong with Kaplan?

The Implications

Most Models Were Undertrained

The Data Problem

Inference Cost

The Power Law Formulas

Kaplan's Laws (2020)

Chinchilla's Laws (2022)

Real-World Impact

LLaMA (Meta, 2023)

Industry Shift

Why This Matters for the Series

The Economics of AI

Prediction, Not Guessing

Limitations and Open Questions

Does Scaling Hit a Wall?

Data Quality vs. Quantity

Beyond Language

Inference-Time Compute

Series Navigation