Mixture of Experts: How AI Learned to Cheat the Scaling Laws
What if you could have a model with 671 billion parameters but only pay to run 37 billion? Mixture of Experts is the architecture trick behind GPT-4, Mixtral, and DeepSeek β models that are simultaneously massive and efficient. Three landmark papers explain how.
Mixture of Experts: How AI Learned to Cheat the Scaling Laws
The Scaling Dilemma
In the last article, we learned that AI performance follows predictable power laws. Bigger models perform better. More data helps. But there's a catch:
Bigger models cost more to run.
A 280B parameter model needs ~560 GB of memory and uses 4Γ more electricity per query than a 70B model. Scaling laws tell you how much improvement you'll get from a bigger model, but they don't help you avoid the bill.
What if there was a way to have a model with hundreds of billions of parameters β capturing all that knowledge and capability β but only activate a fraction of them for each query?
That's exactly what Mixture of Experts does.
The Core Idea
A standard Transformer processes every token through every parameter. If you have a 70B model, all 70 billion parameters do work for every single token. This is called a dense model.
A Mixture of Experts (MoE) model replaces some of the Transformer's layers with multiple parallel "expert" networks. For each token, a router (also called a gate) decides which experts to activate. Only the selected experts process that token. The rest sit idle.
Dense Model (every token uses all parameters):
Token β [ββββββββββββββββββββββββ] β Output
All 70B parameters active
MoE Model (each token uses only selected experts):
Token β Router β Expert 3 ββββββ β Output
Expert 7 ββββββ
(6 other experts idle)
Only ~13B parameters active
The result: a model that stores knowledge across many more parameters than it uses per query. You get the capacity of a huge model with the inference cost of a small one.
Paper 1: The Origin β Sparsely-Gated Mixture of Experts (2017)
Paper: "Outrageously Large Neural Networks: The Sparsely-Gated Mixture-of-Experts Layer"
Authors: Noam Shazeer, Azalia Mirhoseini, Krzysztof Maziarz, Andy Davis, Quoc Le, Geoffrey Hinton, Jeff Dean (Google)
Published: January 2017
ArXiv: 1701.06538
The Breakthrough
Shazeer and team (including Geoffrey Hinton and Jeff Dean) built a model with 137 billion parameters in 2017 β years before GPT-3. They achieved this by applying MoE layers between stacked LSTM layers (the dominant architecture before Transformers).
The key innovation was the sparsely-gated design. Previous MoE approaches activated all experts for every input, which defeated the purpose. Shazeer introduced a gating network that selects only the top-k experts (typically k=2), keeping the rest at zero. This made the model sparse β most parameters are idle for any given input.
The Gating Mechanism
The router is a small neural network that takes a token's representation and outputs a probability distribution over all experts:
Gate(x) = Softmax(x Β· W_gate)
To select only the top-k experts, they use a "noisy top-k" approach:
- Add learnable noise to the gate outputs (helps exploration)
- Keep only the top-k values, set everything else to ββ
- Apply softmax to get the final weights
The selected experts process the token, and their outputs are combined using the gate weights as coefficients.
The Load Balancing Problem
Without intervention, the router quickly learns to send most tokens to the same few "popular" experts. Those experts get more training, become even better, attract more tokens β a vicious cycle. Soon you have a 137B model where only 2 experts do any work.
The solution: an auxiliary loss that penalises uneven expert utilisation. This extra term in the training objective encourages the router to distribute tokens roughly equally across all experts. It's a balancing act β you want specialisation, but not monopoly.
Results
The 137B MoE model achieved better results on language modelling and machine translation than dense models of comparable compute cost, demonstrating that sparse models could be both larger and faster.
However, the approach had practical problems: high communication costs across devices, training instability, and difficulty with fine-tuning.
Paper 2: Switch Transformers (2021)
Paper: "Switch Transformers: Scaling to Trillion Parameter Models with Simple and Efficient Sparsity"
Authors: William Fedus, Barret Zoph, Noam Shazeer (Google)
Published: January 2021 (Journal of Machine Learning Research, 2022)
ArXiv: 2101.03961
Four years after Shazeer's original paper, the same team (with new members) applied the MoE concept to Transformers and scaled it dramatically.
The Key Simplification: Top-1 Routing
Shazeer's original paper used top-2 routing (activate 2 experts per token). The prevailing belief was that routing to at least 2 experts was necessary for the gate to learn effectively.
Switch Transformers challenged this assumption: top-1 routing works just as well, and it's simpler. Each token goes to exactly one expert. This halves the compute per MoE layer compared to top-2, and eliminates the need to combine two expert outputs.
Top-2 routing (Shazeer 2017):
Token β Expert A (weight 0.6) + Expert B (weight 0.4) β Combined
Top-1 routing (Switch Transformer):
Token β Expert A β Output
(simpler, faster, works just as well)
The Results
Switch Transformers achieved remarkable training speedups:
- 7Γ speedup over T5-Base and T5-Large at the same compute cost
- 4Γ speedup over T5-XXL (Google's previously largest model)
- Scaled to 1.6 trillion parameters β the largest language model at the time
The 1.6T model achieved the same perplexity as T5-XXL while reaching that performance far faster during pretraining. The compute per token was similar to a much smaller dense model, because each token only activated one expert.
Where MoE Layers Go
In a standard Transformer, each layer has two main components:
- Attention β tokens communicate with each other
- Feed-Forward Network (FFN) β each token is processed independently
MoE replaces only the FFN layers. The attention layers remain dense (every token uses the full attention mechanism). This makes sense: attention is about relationships between tokens, while FFNs are about processing individual tokens β the part where specialisation helps most.
Standard Transformer Layer:
Input β [Attention] β [FFN] β Output
(dense) (dense)
MoE Transformer Layer:
Input β [Attention] β [Router β Expertβ|Expertβ|...|Expert_n] β Output
(dense) (sparse β only 1 expert active per token)
Simplified Load Balancing
Switch Transformers introduced a cleaner load-balancing loss:
L_balance = Ξ± Β· N Β· Ξ£α΅’ (fα΅’ Β· Pα΅’)
Where fα΅’ is the fraction of tokens routed to expert i, and Pα΅’ is the average router probability for expert i. When both are uniform (each expert gets 1/N of the tokens), the loss is minimised. The coefficient Ξ± controls how strongly to enforce balance.
This replaced Shazeer's more complex noise-based balancing with a simple, differentiable penalty term.
From Papers to Products: The MoE Revolution
Mixtral 8x7B (Mistral, December 2023)
Mixtral brought MoE to the open-source community and demonstrated its power in a way anyone could verify.
Architecture:
- 8 experts per layer, top-2 routing (2 experts active per token)
- 46.7B total parameters, 12.9B active per token
- Same base architecture as Mistral 7B, but with MoE FFN layers
Results:
- MMLU: 70.6% β matching or exceeding LLaMA 2 70B (a model with 5.4Γ more active parameters)
- Outperformed LLaMA 2 70B on reasoning (GSM8K: 58.4% vs 56.8%) and knowledge (ARC-Challenge: 66.4% vs 64.6%)
- Also outperformed GPT-3.5 on most benchmarks
- ~6Γ faster inference than dense models of similar quality
The numbers tell a striking story:
| Model | Total Params | Active Params | MMLU | Inference Cost |
|---|---|---|---|---|
| LLaMA 2 70B | 70B | 70B (dense) | ~68.9% | Baseline |
| Mixtral 8x7B | 46.7B | 12.9B | 70.6% | ~5-6Γ cheaper |
| GPT-3.5 | ~175B (est) | ~175B (dense) | ~70% | API pricing |
Mixtral matched a 70B dense model while activating only 12.9B parameters per token. That's the MoE promise delivered.
DeepSeek-V3 (DeepSeek, December 2024)
DeepSeek took MoE to its logical extreme and stunned the AI industry with its efficiency.
Architecture:
- 256 routed experts per layer + 1 shared expert (processes all tokens)
- 8 experts active per token out of 256
- 671B total parameters, 37B active per token
- 14.8 trillion training tokens
Cost:
- Total training: 2.788 million H800 GPU hours
- At $2/GPU hour: $5.576 million total training cost
- For context, GPT-4's training is estimated at $100+ million
Results:
- Matched or exceeded GPT-4o and Claude 3.5 Sonnet on many benchmarks
- Outperformed other open-source models on MMLU, coding, and math
- State-of-the-art performance among open models at a fraction of the cost
DeepSeek-V3 also introduced auxiliary-loss-free load balancing β a technique that achieves balanced expert utilisation without the explicit penalty term that can distort training. Instead, it uses a bias mechanism in the routing to dynamically adjust load distribution.
How the Router Works
The router is perhaps the most interesting component of a MoE model. It's a small learned network that makes a crucial decision: which expert should handle this token?
The Routing Process
1. Token arrives as a vector (e.g., dimension 4096)
2. Router multiplies by weight matrix: scores = token Β· W_router
(W_router shape: 4096 Γ num_experts)
3. Apply softmax to get probabilities over experts
4. Select top-k experts by probability
5. Route token to selected experts
6. Combine expert outputs (weighted by router probabilities)
What Do Experts Learn?
Research has shown that experts develop genuine specialisation. In multilingual models, different experts handle different languages. In code models, some experts specialise in syntax while others handle logic.
From the Switch Transformers paper, analysis of token routing revealed:
- Some experts specialise in punctuation and formatting
- Others handle content words in specific domains
- The routing is not random β it reflects meaningful linguistic patterns
However, expert specialisation is typically soft rather than hard. An "English expert" might handle 60% English tokens and 40% mixed β it's a preference, not an exclusive assignment.
The Shared Expert Trick
DeepSeek introduced the concept of shared experts β one or more experts that process every token regardless of routing. This handles common patterns (basic grammar, frequent words) while routed experts handle specialised knowledge.
DeepSeek-V3 routing:
Token β Shared Expert (always active) β Combined Output
β Router β 8 of 256 Routed Experts β
This improves stability and ensures no token gets poor-quality processing even if the router makes a suboptimal choice.
The Tradeoffs
MoE isn't free. There are real costs and challenges.
Memory
A 46.7B parameter MoE model needs memory for all 46.7B parameters, even though only 12.9B are active at a time. All experts must be loaded and ready. You need the RAM of a 47B model but get the compute of a 13B model.
For DeepSeek-V3 (671B total), this is extreme β you need enough memory to hold 671B parameters even though inference only uses 37B per token.
Communication
When experts are spread across multiple GPUs (expert parallelism), tokens must be sent to whichever GPU holds the selected expert. This creates an all-to-all communication pattern that can bottleneck training and inference.
Switch Transformers introduced techniques to limit communication: restricting how many tokens can go to any single expert (capacity factor) and co-locating experts that frequently handle similar tokens.
Training Instability
MoE models are notoriously harder to train than dense models. The router must learn simultaneously with the experts, creating complex dynamics. Experts can "collapse" (one expert dominates), routing can oscillate, and gradients through the discrete routing decisions are challenging.
Switch Transformers addressed this with:
- Selective precision: Using float32 for the router while keeping experts in bfloat16
- Smaller parameter initialisation to reduce early instability
- Expert dropout during fine-tuning to combat overfitting
Fine-Tuning Challenges
MoE models historically struggle with fine-tuning. The sparse routing means each expert sees fewer examples during fine-tuning, leading to overfitting. Dense models, where every parameter sees every example, fine-tune more stably.
This gap has narrowed with better techniques, but it remains a practical consideration.
Dense vs. Sparse: When to Use What
| Property | Dense | MoE (Sparse) |
|---|---|---|
| Training speed | Baseline | 4-7Γ faster (same compute) |
| Inference speed | All params active | Only k experts active |
| Memory | = active params | >> active params |
| Fine-tuning | Stable | Historically tricky |
| Knowledge capacity | Limited by size | Much larger per FLOP |
| Serving complexity | Simple | Needs expert parallelism |
| Total parameters | = what you pay for | Much more than what you pay for |
Use dense when:
- You need simple deployment
- Memory is the bottleneck (not compute)
- Fine-tuning stability matters most
Use MoE when:
- You want maximum capability per FLOP
- You can afford the memory overhead
- Serving infrastructure supports expert parallelism
- Training speed matters (pretraining at scale)
The Bigger Picture
MoE and the Scaling Laws
In our scaling laws article, we showed that performance scales as a power law with model size and data. MoE doesn't change these laws β it changes the economics.
A 671B MoE model doesn't perform like a 671B dense model. It performs somewhere between its active parameter count (37B) and its total parameter count, depending on the task. But crucially, it achieves this performance at the inference cost of a 37B model.
MoE shifts the scaling curve: you can access a higher point on the capability axis without proportionally increasing the compute axis.
The Industry Today
MoE is no longer experimental. It's the architecture behind some of the most capable AI systems in production:
- GPT-4 β Widely reported to use MoE architecture (unconfirmed by OpenAI, but leaked details suggest a sparse expert design)
- Mixtral 8x7B / 8x22B β Open-source MoE models from Mistral
- DeepSeek-V3 / R1 β State-of-the-art open MoE with 671B total params
- Grok-1 β xAI's 314B parameter MoE model (open-sourced)
- Arctic β Snowflake's 480B MoE for enterprise tasks
The trend is clear: the largest and most capable models are increasingly sparse. Dense models persist at smaller scales (Mistral 7B, LLaMA 8B) where the memory overhead of MoE isn't justified.
From Dense to Sparse: A Timeline
| Year | Milestone | Total Params | Active Params | Key Innovation |
|---|---|---|---|---|
| 2017 | Shazeer MoE | 137B | Top-2 | Sparsely-gated routing |
| 2021 | Switch Transformer | 1.6T | Top-1 | Simplified routing, scale |
| 2023 | Mixtral 8x7B | 46.7B | 12.9B | Open-source MoE |
| 2024 | DeepSeek-V3 | 671B | 37B | 256 experts, shared experts |
| 2024 | Grok-1 | 314B | ~86B | First open MoE from xAI |
Why This Matters for the Series
This article completes a trilogy about the economics of AI:
- Scaling Laws showed that performance is predictable β given a compute budget, you know what to expect.
- Mixture of Experts showed how to cheat β get more capability per dollar by making models sparse.
- Together, they explain why modern AI looks the way it does: huge models that run fast, trained on massive data, following mathematical rules that make the whole enterprise predictable.
The Transformer architecture from Article 1 gave us the foundation. BERT and GPT-2 showed what to do with it. Instruction tuning and RLHF taught models to be useful. Chain-of-thought unlocked reasoning. Scaling laws explained the economics. And Mixture of Experts showed how to push past the limits β bigger capacity, smaller bill.
Series Navigation
- Attention Is All You Need: The Paper That Changed AI
- BERT: How AI Learned to Truly Read
- GPT-2: How AI Learned to Write
- FLAN: How AI Learned to Follow Instructions
- InstructGPT: How AI Learned What Humans Actually Want
- Chain-of-Thought: How AI Learned to Show Its Work
- Scaling Laws: Why Bigger Isn't Always Better
- Mixture of Experts: How AI Learned to Cheat the Scaling Laws β You are here
Last Updated: April 1, 2026
Author: RESEARCHER
Category: Research
Difficulty: Intermediate
Papers:
- Shazeer et al., "Outrageously Large Neural Networks: The Sparsely-Gated Mixture-of-Experts Layer" (arXiv:1701.06538, January 2017)
- Fedus et al., "Switch Transformers: Scaling to Trillion Parameter Models with Simple and Efficient Sparsity" (arXiv:2101.03961, January 2021)
- Jiang et al., "Mixtral of Experts" (arXiv:2401.04088, January 2024)
- DeepSeek-AI, "DeepSeek-V3 Technical Report" (arXiv:2412.19437, December 2024)