Mixture of Experts: How AI Learned to Cheat the Scaling Laws

The Scaling Dilemma

In the last article, we learned that AI performance follows predictable power laws. Bigger models perform better. More data helps. But there's a catch:

Bigger models cost more to run.

A 280B parameter model needs ~560 GB of memory and uses 4× more electricity per query than a 70B model. Scaling laws tell you how much improvement you'll get from a bigger model, but they don't help you avoid the bill.

What if there was a way to have a model with hundreds of billions of parameters — capturing all that knowledge and capability — but only activate a fraction of them for each query?

That's exactly what Mixture of Experts does.

The Core Idea

A standard Transformer processes every token through every parameter. If you have a 70B model, all 70 billion parameters do work for every single token. This is called a dense model.

A Mixture of Experts (MoE) model replaces some of the Transformer's layers with multiple parallel "expert" networks. For each token, a router (also called a gate) decides which experts to activate. Only the selected experts process that token. The rest sit idle.

Dense Model (every token uses all parameters):
  Token → [████████████████████████] → Output
           All 70B parameters active

MoE Model (each token uses only selected experts):
  Token → Router → Expert 3  ██████  → Output
                   Expert 7  ██████
                   (6 other experts idle)
           Only ~13B parameters active

The result: a model that stores knowledge across many more parameters than it uses per query. You get the capacity of a huge model with the inference cost of a small one.

Paper 1: The Origin — Sparsely-Gated Mixture of Experts (2017)

Paper: "Outrageously Large Neural Networks: The Sparsely-Gated Mixture-of-Experts Layer"
Authors: Noam Shazeer, Azalia Mirhoseini, Krzysztof Maziarz, Andy Davis, Quoc Le, Geoffrey Hinton, Jeff Dean (Google)
Published: January 2017
ArXiv: 1701.06538

The Breakthrough

Shazeer and team (including Geoffrey Hinton and Jeff Dean) built a model with 137 billion parameters in 2017 — years before GPT-3. They achieved this by applying MoE layers between stacked LSTM layers (the dominant architecture before Transformers).

The key innovation was the sparsely-gated design. Previous MoE approaches activated all experts for every input, which defeated the purpose. Shazeer introduced a gating network that selects only the top-k experts (typically k=2), keeping the rest at zero. This made the model sparse — most parameters are idle for any given input.

The Gating Mechanism

The router is a small neural network that takes a token's representation and outputs a probability distribution over all experts:

Gate(x) = Softmax(x · W_gate)

To select only the top-k experts, they use a "noisy top-k" approach:

Add learnable noise to the gate outputs (helps exploration)
Keep only the top-k values, set everything else to −∞
Apply softmax to get the final weights

The selected experts process the token, and their outputs are combined using the gate weights as coefficients.

The Load Balancing Problem

Without intervention, the router quickly learns to send most tokens to the same few "popular" experts. Those experts get more training, become even better, attract more tokens — a vicious cycle. Soon you have a 137B model where only 2 experts do any work.

The solution: an auxiliary loss that penalises uneven expert utilisation. This extra term in the training objective encourages the router to distribute tokens roughly equally across all experts. It's a balancing act — you want specialisation, but not monopoly.

Results

The 137B MoE model achieved better results on language modelling and machine translation than dense models of comparable compute cost, demonstrating that sparse models could be both larger and faster.

However, the approach had practical problems: high communication costs across devices, training instability, and difficulty with fine-tuning.

Paper 2: Switch Transformers (2021)

Paper: "Switch Transformers: Scaling to Trillion Parameter Models with Simple and Efficient Sparsity"
Authors: William Fedus, Barret Zoph, Noam Shazeer (Google)
Published: January 2021 (Journal of Machine Learning Research, 2022)
ArXiv: 2101.03961

Four years after Shazeer's original paper, the same team (with new members) applied the MoE concept to Transformers and scaled it dramatically.

The Key Simplification: Top-1 Routing

Shazeer's original paper used top-2 routing (activate 2 experts per token). The prevailing belief was that routing to at least 2 experts was necessary for the gate to learn effectively.

Switch Transformers challenged this assumption: top-1 routing works just as well, and it's simpler. Each token goes to exactly one expert. This halves the compute per MoE layer compared to top-2, and eliminates the need to combine two expert outputs.

Top-2 routing (Shazeer 2017):
  Token → Expert A (weight 0.6) + Expert B (weight 0.4) → Combined

Top-1 routing (Switch Transformer):
  Token → Expert A → Output
  (simpler, faster, works just as well)

The Results

Switch Transformers achieved remarkable training speedups:

7× speedup over T5-Base and T5-Large at the same compute cost
4× speedup over T5-XXL (Google's previously largest model)
Scaled to 1.6 trillion parameters — the largest language model at the time

The 1.6T model achieved the same perplexity as T5-XXL while reaching that performance far faster during pretraining. The compute per token was similar to a much smaller dense model, because each token only activated one expert.

Where MoE Layers Go

In a standard Transformer, each layer has two main components:

Attention — tokens communicate with each other
Feed-Forward Network (FFN) — each token is processed independently

MoE replaces only the FFN layers. The attention layers remain dense (every token uses the full attention mechanism). This makes sense: attention is about relationships between tokens, while FFNs are about processing individual tokens — the part where specialisation helps most.

Standard Transformer Layer:
  Input → [Attention] → [FFN] → Output
           (dense)      (dense)

MoE Transformer Layer:
  Input → [Attention] → [Router → Expert₁|Expert₂|...|Expert_n] → Output
           (dense)      (sparse — only 1 expert active per token)

Simplified Load Balancing

Switch Transformers introduced a cleaner load-balancing loss:

L_balance = α · N · Σᵢ (fᵢ · Pᵢ)

Where fᵢ is the fraction of tokens routed to expert i, and Pᵢ is the average router probability for expert i. When both are uniform (each expert gets 1/N of the tokens), the loss is minimised. The coefficient α controls how strongly to enforce balance.

This replaced Shazeer's more complex noise-based balancing with a simple, differentiable penalty term.

From Papers to Products: The MoE Revolution

Mixtral 8x7B (Mistral, December 2023)

Mixtral brought MoE to the open-source community and demonstrated its power in a way anyone could verify.

Architecture:

8 experts per layer, top-2 routing (2 experts active per token)
46.7B total parameters, 12.9B active per token
Same base architecture as Mistral 7B, but with MoE FFN layers

Results:

MMLU: 70.6% — matching or exceeding LLaMA 2 70B (a model with 5.4× more active parameters)
Outperformed LLaMA 2 70B on reasoning (GSM8K: 58.4% vs 56.8%) and knowledge (ARC-Challenge: 66.4% vs 64.6%)
Also outperformed GPT-3.5 on most benchmarks
~6× faster inference than dense models of similar quality

The numbers tell a striking story:

Model	Total Params	Active Params	MMLU	Inference Cost
LLaMA 2 70B	70B	70B (dense)	~68.9%	Baseline
Mixtral 8x7B	46.7B	12.9B	70.6%	~5-6× cheaper
GPT-3.5	~175B (est)	~175B (dense)	~70%	API pricing

Mixtral matched a 70B dense model while activating only 12.9B parameters per token. That's the MoE promise delivered.

DeepSeek-V3 (DeepSeek, December 2024)

DeepSeek took MoE to its logical extreme and stunned the AI industry with its efficiency.

Architecture:

256 routed experts per layer + 1 shared expert (processes all tokens)
8 experts active per token out of 256
671B total parameters, 37B active per token
14.8 trillion training tokens

Cost:

Total training: 2.788 million H800 GPU hours
At $2/GPU hour: $5.576 million total training cost
For context, GPT-4's training is estimated at $100+ million

Results:

Matched or exceeded GPT-4o and Claude 3.5 Sonnet on many benchmarks
Outperformed other open-source models on MMLU, coding, and math
State-of-the-art performance among open models at a fraction of the cost

DeepSeek-V3 also introduced auxiliary-loss-free load balancing — a technique that achieves balanced expert utilisation without the explicit penalty term that can distort training. Instead, it uses a bias mechanism in the routing to dynamically adjust load distribution.

How the Router Works

The router is perhaps the most interesting component of a MoE model. It's a small learned network that makes a crucial decision: which expert should handle this token?

The Routing Process

1. Token arrives as a vector (e.g., dimension 4096)
2. Router multiplies by weight matrix: scores = token · W_router
   (W_router shape: 4096 × num_experts)
3. Apply softmax to get probabilities over experts
4. Select top-k experts by probability
5. Route token to selected experts
6. Combine expert outputs (weighted by router probabilities)

What Do Experts Learn?

Research has shown that experts develop genuine specialisation. In multilingual models, different experts handle different languages. In code models, some experts specialise in syntax while others handle logic.

From the Switch Transformers paper, analysis of token routing revealed:

Some experts specialise in punctuation and formatting
Others handle content words in specific domains
The routing is not random — it reflects meaningful linguistic patterns

However, expert specialisation is typically soft rather than hard. An "English expert" might handle 60% English tokens and 40% mixed — it's a preference, not an exclusive assignment.

The Shared Expert Trick

DeepSeek introduced the concept of shared experts — one or more experts that process every token regardless of routing. This handles common patterns (basic grammar, frequent words) while routed experts handle specialised knowledge.

DeepSeek-V3 routing:
  Token → Shared Expert (always active)     → Combined Output
        → Router → 8 of 256 Routed Experts  ↗

This improves stability and ensures no token gets poor-quality processing even if the router makes a suboptimal choice.

The Tradeoffs

MoE isn't free. There are real costs and challenges.

Memory

A 46.7B parameter MoE model needs memory for all 46.7B parameters, even though only 12.9B are active at a time. All experts must be loaded and ready. You need the RAM of a 47B model but get the compute of a 13B model.

For DeepSeek-V3 (671B total), this is extreme — you need enough memory to hold 671B parameters even though inference only uses 37B per token.

Communication

When experts are spread across multiple GPUs (expert parallelism), tokens must be sent to whichever GPU holds the selected expert. This creates an all-to-all communication pattern that can bottleneck training and inference.

Switch Transformers introduced techniques to limit communication: restricting how many tokens can go to any single expert (capacity factor) and co-locating experts that frequently handle similar tokens.

Training Instability

MoE models are notoriously harder to train than dense models. The router must learn simultaneously with the experts, creating complex dynamics. Experts can "collapse" (one expert dominates), routing can oscillate, and gradients through the discrete routing decisions are challenging.

Switch Transformers addressed this with:

Selective precision: Using float32 for the router while keeping experts in bfloat16
Smaller parameter initialisation to reduce early instability
Expert dropout during fine-tuning to combat overfitting

Fine-Tuning Challenges

MoE models historically struggle with fine-tuning. The sparse routing means each expert sees fewer examples during fine-tuning, leading to overfitting. Dense models, where every parameter sees every example, fine-tune more stably.

This gap has narrowed with better techniques, but it remains a practical consideration.

Dense vs. Sparse: When to Use What

Property	Dense	MoE (Sparse)
Training speed	Baseline	4-7× faster (same compute)
Inference speed	All params active	Only k experts active
Memory	= active params	>> active params
Fine-tuning	Stable	Historically tricky
Knowledge capacity	Limited by size	Much larger per FLOP
Serving complexity	Simple	Needs expert parallelism
Total parameters	= what you pay for	Much more than what you pay for

Use dense when:

You need simple deployment
Memory is the bottleneck (not compute)
Fine-tuning stability matters most

Use MoE when:

You want maximum capability per FLOP
You can afford the memory overhead
Serving infrastructure supports expert parallelism
Training speed matters (pretraining at scale)

The Bigger Picture

MoE and the Scaling Laws

In our scaling laws article, we showed that performance scales as a power law with model size and data. MoE doesn't change these laws — it changes the economics.

A 671B MoE model doesn't perform like a 671B dense model. It performs somewhere between its active parameter count (37B) and its total parameter count, depending on the task. But crucially, it achieves this performance at the inference cost of a 37B model.

MoE shifts the scaling curve: you can access a higher point on the capability axis without proportionally increasing the compute axis.

The Industry Today

MoE is no longer experimental. It's the architecture behind some of the most capable AI systems in production:

GPT-4 — Widely reported to use MoE architecture (unconfirmed by OpenAI, but leaked details suggest a sparse expert design)
Mixtral 8x7B / 8x22B — Open-source MoE models from Mistral
DeepSeek-V3 / R1 — State-of-the-art open MoE with 671B total params
Grok-1 — xAI's 314B parameter MoE model (open-sourced)
Arctic — Snowflake's 480B MoE for enterprise tasks

The trend is clear: the largest and most capable models are increasingly sparse. Dense models persist at smaller scales (Mistral 7B, LLaMA 8B) where the memory overhead of MoE isn't justified.

From Dense to Sparse: A Timeline

Year	Milestone	Total Params	Active Params	Key Innovation
2017	Shazeer MoE	137B	Top-2	Sparsely-gated routing
2021	Switch Transformer	1.6T	Top-1	Simplified routing, scale
2023	Mixtral 8x7B	46.7B	12.9B	Open-source MoE
2024	DeepSeek-V3	671B	37B	256 experts, shared experts
2024	Grok-1	314B	~86B	First open MoE from xAI

Why This Matters for the Series

This article completes a trilogy about the economics of AI:

Scaling Laws showed that performance is predictable — given a compute budget, you know what to expect.
Mixture of Experts showed how to cheat — get more capability per dollar by making models sparse.
Together, they explain why modern AI looks the way it does: huge models that run fast, trained on massive data, following mathematical rules that make the whole enterprise predictable.

The Transformer architecture from Article 1 gave us the foundation. BERT and GPT-2 showed what to do with it. Instruction tuning and RLHF taught models to be useful. Chain-of-thought unlocked reasoning. Scaling laws explained the economics. And Mixture of Experts showed how to push past the limits — bigger capacity, smaller bill.

Attention Is All You Need: The Paper That Changed AI
BERT: How AI Learned to Truly Read
GPT-2: How AI Learned to Write
FLAN: How AI Learned to Follow Instructions
InstructGPT: How AI Learned What Humans Actually Want
Chain-of-Thought: How AI Learned to Show Its Work
Scaling Laws: Why Bigger Isn't Always Better
Mixture of Experts: How AI Learned to Cheat the Scaling Laws ← You are here

Last Updated: April 1, 2026
Author: RESEARCHER
Category: Research
Difficulty: Intermediate
Papers:

Shazeer et al., "Outrageously Large Neural Networks: The Sparsely-Gated Mixture-of-Experts Layer" (arXiv:1701.06538, January 2017)
Fedus et al., "Switch Transformers: Scaling to Trillion Parameter Models with Simple and Efficient Sparsity" (arXiv:2101.03961, January 2021)
Jiang et al., "Mixtral of Experts" (arXiv:2401.04088, January 2024)
DeepSeek-AI, "DeepSeek-V3 Technical Report" (arXiv:2412.19437, December 2024)

Mixture of Experts: How AI Learned to Cheat the Scaling Laws

The Scaling Dilemma

In the last article, we learned that AI performance follows predictable power laws. Bigger models perform better. More data helps. But there's a catch:

Bigger models cost more to run.

What if there was a way to have a model with hundreds of billions of parameters — capturing all that knowledge and capability — but only activate a fraction of them for each query?

That's exactly what Mixture of Experts does.

The Core Idea

A standard Transformer processes every token through every parameter. If you have a 70B model, all 70 billion parameters do work for every single token. This is called a dense model.

Dense Model (every token uses all parameters):
  Token → [████████████████████████] → Output
           All 70B parameters active

MoE Model (each token uses only selected experts):
  Token → Router → Expert 3  ██████  → Output
                   Expert 7  ██████
                   (6 other experts idle)
           Only ~13B parameters active

The result: a model that stores knowledge across many more parameters than it uses per query. You get the capacity of a huge model with the inference cost of a small one.

Paper 1: The Origin — Sparsely-Gated Mixture of Experts (2017)

The Breakthrough

The Gating Mechanism

The router is a small neural network that takes a token's representation and outputs a probability distribution over all experts:

Gate(x) = Softmax(x · W_gate)

To select only the top-k experts, they use a "noisy top-k" approach:

Add learnable noise to the gate outputs (helps exploration)
Keep only the top-k values, set everything else to −∞
Apply softmax to get the final weights

The selected experts process the token, and their outputs are combined using the gate weights as coefficients.

The Load Balancing Problem

Results

However, the approach had practical problems: high communication costs across devices, training instability, and difficulty with fine-tuning.

Paper 2: Switch Transformers (2021)

Four years after Shazeer's original paper, the same team (with new members) applied the MoE concept to Transformers and scaled it dramatically.

The Key Simplification: Top-1 Routing

Shazeer's original paper used top-2 routing (activate 2 experts per token). The prevailing belief was that routing to at least 2 experts was necessary for the gate to learn effectively.

Top-2 routing (Shazeer 2017):
  Token → Expert A (weight 0.6) + Expert B (weight 0.4) → Combined

Top-1 routing (Switch Transformer):
  Token → Expert A → Output
  (simpler, faster, works just as well)

The Results

Switch Transformers achieved remarkable training speedups:

7× speedup over T5-Base and T5-Large at the same compute cost
4× speedup over T5-XXL (Google's previously largest model)
Scaled to 1.6 trillion parameters — the largest language model at the time

Where MoE Layers Go

In a standard Transformer, each layer has two main components:

Attention — tokens communicate with each other
Feed-Forward Network (FFN) — each token is processed independently

Standard Transformer Layer:
  Input → [Attention] → [FFN] → Output
           (dense)      (dense)

MoE Transformer Layer:
  Input → [Attention] → [Router → Expert₁|Expert₂|...|Expert_n] → Output
           (dense)      (sparse — only 1 expert active per token)

Simplified Load Balancing

Switch Transformers introduced a cleaner load-balancing loss:

L_balance = α · N · Σᵢ (fᵢ · Pᵢ)

This replaced Shazeer's more complex noise-based balancing with a simple, differentiable penalty term.

From Papers to Products: The MoE Revolution

Mixtral 8x7B (Mistral, December 2023)

Mixtral brought MoE to the open-source community and demonstrated its power in a way anyone could verify.

Architecture:

8 experts per layer, top-2 routing (2 experts active per token)
46.7B total parameters, 12.9B active per token
Same base architecture as Mistral 7B, but with MoE FFN layers

Results:

MMLU: 70.6% — matching or exceeding LLaMA 2 70B (a model with 5.4× more active parameters)
Outperformed LLaMA 2 70B on reasoning (GSM8K: 58.4% vs 56.8%) and knowledge (ARC-Challenge: 66.4% vs 64.6%)
Also outperformed GPT-3.5 on most benchmarks
~6× faster inference than dense models of similar quality

The numbers tell a striking story:

Model	Total Params	Active Params	MMLU	Inference Cost
LLaMA 2 70B	70B	70B (dense)	~68.9%	Baseline
Mixtral 8x7B	46.7B	12.9B	70.6%	~5-6× cheaper
GPT-3.5	~175B (est)	~175B (dense)	~70%	API pricing

Mixtral matched a 70B dense model while activating only 12.9B parameters per token. That's the MoE promise delivered.

DeepSeek-V3 (DeepSeek, December 2024)

DeepSeek took MoE to its logical extreme and stunned the AI industry with its efficiency.

Architecture:

256 routed experts per layer + 1 shared expert (processes all tokens)
8 experts active per token out of 256
671B total parameters, 37B active per token
14.8 trillion training tokens

Cost:

Total training: 2.788 million H800 GPU hours
At $2/GPU hour: $5.576 million total training cost
For context, GPT-4's training is estimated at $100+ million

Results:

Matched or exceeded GPT-4o and Claude 3.5 Sonnet on many benchmarks
Outperformed other open-source models on MMLU, coding, and math
State-of-the-art performance among open models at a fraction of the cost

How the Router Works

The router is perhaps the most interesting component of a MoE model. It's a small learned network that makes a crucial decision: which expert should handle this token?

The Routing Process

1. Token arrives as a vector (e.g., dimension 4096)
2. Router multiplies by weight matrix: scores = token · W_router
   (W_router shape: 4096 × num_experts)
3. Apply softmax to get probabilities over experts
4. Select top-k experts by probability
5. Route token to selected experts
6. Combine expert outputs (weighted by router probabilities)

What Do Experts Learn?

From the Switch Transformers paper, analysis of token routing revealed:

Some experts specialise in punctuation and formatting
Others handle content words in specific domains
The routing is not random — it reflects meaningful linguistic patterns

However, expert specialisation is typically soft rather than hard. An "English expert" might handle 60% English tokens and 40% mixed — it's a preference, not an exclusive assignment.

The Shared Expert Trick

DeepSeek-V3 routing:
  Token → Shared Expert (always active)     → Combined Output
        → Router → 8 of 256 Routed Experts  ↗

This improves stability and ensures no token gets poor-quality processing even if the router makes a suboptimal choice.

The Tradeoffs

MoE isn't free. There are real costs and challenges.

Memory

For DeepSeek-V3 (671B total), this is extreme — you need enough memory to hold 671B parameters even though inference only uses 37B per token.

Communication

Training Instability

Switch Transformers addressed this with:

Selective precision: Using float32 for the router while keeping experts in bfloat16
Smaller parameter initialisation to reduce early instability
Expert dropout during fine-tuning to combat overfitting

Fine-Tuning Challenges

This gap has narrowed with better techniques, but it remains a practical consideration.

Dense vs. Sparse: When to Use What

Property	Dense	MoE (Sparse)
Training speed	Baseline	4-7× faster (same compute)
Inference speed	All params active	Only k experts active
Memory	= active params	>> active params
Fine-tuning	Stable	Historically tricky
Knowledge capacity	Limited by size	Much larger per FLOP
Serving complexity	Simple	Needs expert parallelism
Total parameters	= what you pay for	Much more than what you pay for

Use dense when:

You need simple deployment
Memory is the bottleneck (not compute)
Fine-tuning stability matters most

Use MoE when:

You want maximum capability per FLOP
You can afford the memory overhead
Serving infrastructure supports expert parallelism
Training speed matters (pretraining at scale)

The Bigger Picture

MoE and the Scaling Laws

In our scaling laws article, we showed that performance scales as a power law with model size and data. MoE doesn't change these laws — it changes the economics.

MoE shifts the scaling curve: you can access a higher point on the capability axis without proportionally increasing the compute axis.

The Industry Today

MoE is no longer experimental. It's the architecture behind some of the most capable AI systems in production:

GPT-4 — Widely reported to use MoE architecture (unconfirmed by OpenAI, but leaked details suggest a sparse expert design)
Mixtral 8x7B / 8x22B — Open-source MoE models from Mistral
DeepSeek-V3 / R1 — State-of-the-art open MoE with 671B total params
Grok-1 — xAI's 314B parameter MoE model (open-sourced)
Arctic — Snowflake's 480B MoE for enterprise tasks

The trend is clear: the largest and most capable models are increasingly sparse. Dense models persist at smaller scales (Mistral 7B, LLaMA 8B) where the memory overhead of MoE isn't justified.

From Dense to Sparse: A Timeline

Year	Milestone	Total Params	Active Params	Key Innovation
2017	Shazeer MoE	137B	Top-2	Sparsely-gated routing
2021	Switch Transformer	1.6T	Top-1	Simplified routing, scale
2023	Mixtral 8x7B	46.7B	12.9B	Open-source MoE
2024	DeepSeek-V3	671B	37B	256 experts, shared experts
2024	Grok-1	314B	~86B	First open MoE from xAI

Why This Matters for the Series

This article completes a trilogy about the economics of AI:

Scaling Laws showed that performance is predictable — given a compute budget, you know what to expect.
Mixture of Experts showed how to cheat — get more capability per dollar by making models sparse.
Together, they explain why modern AI looks the way it does: huge models that run fast, trained on massive data, following mathematical rules that make the whole enterprise predictable.

Attention Is All You Need: The Paper That Changed AI
BERT: How AI Learned to Truly Read
GPT-2: How AI Learned to Write
FLAN: How AI Learned to Follow Instructions
InstructGPT: How AI Learned What Humans Actually Want
Chain-of-Thought: How AI Learned to Show Its Work
Scaling Laws: Why Bigger Isn't Always Better
Mixture of Experts: How AI Learned to Cheat the Scaling Laws ← You are here

Last Updated: April 1, 2026
Author: RESEARCHER
Category: Research
Difficulty: Intermediate
Papers:

Shazeer et al., "Outrageously Large Neural Networks: The Sparsely-Gated Mixture-of-Experts Layer" (arXiv:1701.06538, January 2017)
Fedus et al., "Switch Transformers: Scaling to Trillion Parameter Models with Simple and Efficient Sparsity" (arXiv:2101.03961, January 2021)
Jiang et al., "Mixtral of Experts" (arXiv:2401.04088, January 2024)
DeepSeek-AI, "DeepSeek-V3 Technical Report" (arXiv:2412.19437, December 2024)

Mixture of Experts: How AI Learned to Cheat the Scaling Laws

The Scaling Dilemma

The Core Idea

Paper 1: The Origin — Sparsely-Gated Mixture of Experts (2017)

The Breakthrough

The Gating Mechanism

The Load Balancing Problem

Results

Paper 2: Switch Transformers (2021)

The Key Simplification: Top-1 Routing

The Results

Where MoE Layers Go

Simplified Load Balancing

From Papers to Products: The MoE Revolution

Mixtral 8x7B (Mistral, December 2023)

DeepSeek-V3 (DeepSeek, December 2024)

How the Router Works

The Routing Process

What Do Experts Learn?

The Shared Expert Trick

The Tradeoffs

Memory

Communication

Training Instability

Fine-Tuning Challenges

Dense vs. Sparse: When to Use What

The Bigger Picture

MoE and the Scaling Laws

The Industry Today

From Dense to Sparse: A Timeline

Why This Matters for the Series

Series Navigation

Mixture of Experts: How AI Learned to Cheat the Scaling Laws

The Scaling Dilemma

The Core Idea

Paper 1: The Origin — Sparsely-Gated Mixture of Experts (2017)

The Breakthrough

The Gating Mechanism

The Load Balancing Problem

Results

Paper 2: Switch Transformers (2021)

The Key Simplification: Top-1 Routing

The Results

Where MoE Layers Go

Simplified Load Balancing

From Papers to Products: The MoE Revolution

Mixtral 8x7B (Mistral, December 2023)

DeepSeek-V3 (DeepSeek, December 2024)

How the Router Works

The Routing Process

What Do Experts Learn?

The Shared Expert Trick

The Tradeoffs

Memory

Communication

Training Instability

Fine-Tuning Challenges

Dense vs. Sparse: When to Use What

The Bigger Picture

MoE and the Scaling Laws

The Industry Today

From Dense to Sparse: A Timeline

Why This Matters for the Series

Series Navigation