Mixture of Experts
Welcome to Part VII, the frontier. We begin with one of the most important architectural ideas behind today's largest models: the Mixture of Experts (MoE). Scaling laws (Chapter 16) told us that bigger models are better — but bigger models cost more to run, because every parameter is used for every token. MoE breaks this link: it lets a model have FAR more parameters while using only a FRACTION of them for any given token. More capacity, almost the same compute.
The Problem MoE Solves
In a normal ('dense') model, every token passes through every parameter — all the weights are used for all the tokens. So doubling the parameters doubles the compute per token. This is the wall scaling runs into: more capacity means proportionally more cost, forever. MoE asks a radical question: what if each token only used the parameters it actually needs, rather than all of them?
Active vs Total Parameters
This gives MoE its defining characteristic: a distinction between TOTAL parameters (all the experts combined — the model's full capacity) and ACTIVE parameters (the few used for a given token — what determines compute cost). An MoE model might have 8× the total parameters of a dense model but use only a fraction per token, getting much of the quality of a huge model at the cost of a small one.
| Dense model | MoE model | |
|---|---|---|
| Total parameters | All used per token | Many (the full capacity) |
| Active per token | = total | A small fraction of total |
| Compute per token | Scales with total params | Scales with ACTIVE params |
| Capacity | Limited by compute budget | Decoupled from compute |
| Analogy | One overworked generalist | A hospital of specialists |
Where exactly does MoE go in a Transformer? Recall from Chapter 13 that each Transformer block has an attention sub-layer and a feed-forward (FFN) sub-layer. MoE replaces the single FFN with MANY FFNs — the 'experts' — plus a small 'router' that decides which experts each token uses. Attention stays the same; only the FFN becomes a mixture of experts.
Anatomy of an MoE Layer
A sparse MoE layer has two parts. The EXPERTS are several independent feed-forward networks (say 8 of them), each identical in structure to the dense FFN it replaces. The ROUTER (or 'gate') is a small network that looks at each token and decides which experts should process it. For each token, only the chosen experts run — the rest are skipped entirely. That skipping is what makes it 'sparse'.
Arch Stack: An MoE layer replaces the FFN with experts + a router
| output (weighted combination of expert outputs) | |
| Expert 0 Expert 1 ... Expert 7 | only the chosen few run |
| Router / Gate | picks top-k experts per token |
| input token | (d,) |
Which Layers Become MoE?
Typically, the FFN in EVERY Transformer block (or every other block) is replaced with an MoE layer, while attention remains dense and shared across all tokens. Since the FFN holds most of a Transformer's parameters, turning FFNs into MoE layers is where the huge parameter expansion comes from. The model becomes a stack of blocks, each with shared attention and a bank of routed experts.
The router is the brain of an MoE layer. For each token, it must decide which experts to use. The standard method is TOP-K ROUTING: the router scores all experts for the token, then sends the token to only the k highest-scoring ones (k is small — often 1 or 2). Let us see exactly how it works.
The Routing Computation
The router is just a small linear layer. For each token, it produces a score (a 'logit') for every expert. A softmax turns these into weights, and the top-k experts by weight are selected. The token is processed by those k experts, and their outputs are combined — weighted by the router's scores — into the layer's output. So the router both CHOOSES the experts and WEIGHTS their contributions.
For token x:
scores = softmax(W_router · x) # one weight per expert
top_k = indices of the k largest scores
output = Σ scores[i] · Expert_i(x) # combine only the chosen k
i ∈ top_k
# Only the k chosen experts are computed. k=2 is common (e.g. Mixtral).Visualizing Routing
Here is what routing looks like for a few tokens with 5 experts and top-2 routing — each token activates exactly 2 experts, and different tokens go to different experts:
Expert Route: Top-2 routing: each token to its 2 highest-scoring experts
Notice that each token (each row) lights up exactly 2 experts, and the assignment varies by token — the router learns to send different tokens to different specialists. Over a whole batch, all experts get used, but each individual token only pays for 2.
import torch; import torch.nn.functional as F
class MoELayer(torch.nn.Module):
def __init__(self, d, n_experts=8, k=2):
super().__init__()
self.k = k
self.router = torch.nn.Linear(d, n_experts) # scores each expert
self.experts = torch.nn.ModuleList([FFN(d) for _ in range(n_experts)])
def forward(self, x): # x: (tokens, d)
scores = F.softmax(self.router(x), dim=-1) # (tokens, n_experts)
# Pick the top-k experts per token
topk_w, topk_idx = scores.topk(self.k, dim=-1) # (tokens, k)
topk_w = topk_w / topk_w.sum(dim=-1, keepdim=True) # renormalize
out = torch.zeros_like(x)
for i in range(self.k): # for each chosen slot
for e in range(len(self.experts)):
mask = (topk_idx[:, i] == e) # tokens routed to expert e
if mask.any():
out[mask] += topk_w[mask, i:i+1] * self.experts[e](x[mask])
return out
# Only experts that received tokens actually compute. k=2 of 8 experts
# run per token -> ~1/4 the FFN compute of using all 8.Let us make the compute savings concrete, because the numbers are what make MoE compelling. The key ratio is how many experts EXIST versus how many RUN per token. With 8 experts and top-2 routing, each token uses only 2 of 8 — the FFN does roughly a quarter of the work it would if all experts ran, yet the model holds 8 experts' worth of knowledge.
A Concrete Example: Mixtral
Mixtral 8x7B (Mistral, 2023) is the canonical open MoE. It has 8 experts per layer and uses top-2 routing. Its name suggests '8×7B = 56B' but the real numbers are subtler: because attention and embeddings are shared, the TOTAL parameter count is about 47B, while the ACTIVE parameters per token are only about 13B. So Mixtral runs at roughly the cost of a 13B dense model but has the capacity of a 47B one — and it performs accordingly, rivaling much larger dense models.
| Mixtral 8x7B | Value | Meaning |
|---|---|---|
| Experts per layer | 8 | The bank of specialists |
| Active per token (k) | 2 | Top-2 routing |
| Total parameters | ~47B | The full capacity (memory) |
| Active parameters | ~13B | Compute cost per token |
| Runs like a... | ~13B dense model | Cheap to run |
| Performs like a... | Much larger model | Big capacity |
Mixtral '8x7B' is NOT 8 x 7B = 56B, because:
- attention layers are SHARED (not duplicated per expert)
- embeddings are SHARED
- only the FFN is replicated into 8 experts
Total ≈ 47B (shared parts + 8 expert FFNs)
Active ≈ 13B (shared parts + 2 expert FFNs per token)MoE has a notorious failure mode that does not exist in dense models, and understanding it is essential. Left to itself, the router tends to collapse — sending almost all tokens to just a few experts while the rest go unused. This 'expert collapse' wastes the model's capacity and is the central challenge of training MoE.
The Vicious Cycle of Collapse
Expert collapse arises from a self-reinforcing feedback loop. Early in training, by chance, the router slightly favors a few experts. Those experts get more tokens, so they train faster and get better. Because they are better, the router favors them even more, so they get even more tokens. Meanwhile, the neglected experts get few tokens, barely train, stay bad, and get neglected further. The rich get richer; most experts wither. The model ends up using only a handful of its experts — throwing away the capacity MoE was supposed to provide.
Visualizing Collapse
A collapsed MoE looks like this — nearly all tokens funnel to two overloaded experts while the rest sit idle (grayed):
Expert Route: Expert collapse: most tokens funnel to a few experts
The Fix: An Auxiliary Load-Balancing Loss
The standard cure is to ADD a penalty to the training loss that encourages BALANCED expert usage. This 'auxiliary load-balancing loss' is high when tokens are distributed unevenly across experts and low when they are spread out. By minimizing it alongside the main loss, training is pushed to use all experts roughly equally — breaking the rich-get-richer cycle. It is a gentle, constant pressure toward balance.
For each expert i, over a batch:
f_i = fraction of tokens routed to expert i
P_i = average router probability assigned to expert i
L_balance = n_experts · Σ f_i · P_i
# Minimized when load is SPREAD EVENLY across experts.
# Added to the main loss: L = L_main + α · L_balanceThe intuition: the loss is minimized when every expert gets an equal share of tokens. By penalizing imbalance, it stops any expert from monopolizing the routing, forcing the router to spread tokens out and give every expert enough traffic to train. The coefficient (alpha) sets how strongly balance is enforced — too little and you risk collapse, too much and you override the router's useful specialization.
Even with load balancing, in any given batch some experts receive more tokens than others. For efficient hardware execution, each expert is usually given a fixed CAPACITY — a maximum number of tokens it can process per batch. This creates a new wrinkle: what happens when more tokens are routed to an expert than its capacity allows?
Why Fixed Capacity?
Hardware (GPUs/TPUs) is most efficient with FIXED, predictable tensor shapes. If experts could receive any number of tokens, the computation shapes would vary unpredictably, hurting efficiency. So each expert gets a fixed buffer sized for its expected share — the CAPACITY FACTOR controls how much slack above the average each expert gets. This keeps the computation regular and fast.
Token Dropping
When more tokens are routed to an expert than its capacity, the overflow tokens are DROPPED — they skip that expert entirely (often passing through unchanged via the residual connection). This sounds alarming, but a modest drop rate is tolerable: the token still gets processed by the rest of the network and its OTHER chosen expert (with top-2 routing). Still, heavy dropping hurts quality, so the capacity factor and load balancing must keep drops low.
| Capacity factor | Effect |
|---|---|
| Low (e.g. 1.0) | Tight buffers, less memory/compute, but more token dropping |
| Medium (e.g. 1.25) | Common balance — modest slack, few drops |
| High (e.g. 2.0) | Almost no dropping, but wastes memory/compute on padding |
MoE's compute savings come with real systems costs. The experts must be STORED (lots of memory), and when experts are spread across multiple devices, tokens must be SENT to wherever their expert lives (lots of communication). These systems challenges are why MoE is harder to deploy than its compute numbers suggest, and connect directly to the distributed training of Chapter 18 and serving of Chapter 31.
Memory: You Store Everything
The first cost is memory. Even though only k experts run per token, ALL the experts must be held in memory — the router might send the next token to any of them. So an MoE model's memory footprint is its TOTAL parameter count, not its active count. A 47B-total MoE needs ~47B params' worth of memory even though it computes like a 13B model. MoE trades the scarce resource (compute) for the more available one (memory) — but memory is still a real constraint.
Expert Parallelism and Communication
When the experts don't all fit on one device, they are spread across many — 'expert parallelism' (a form of the model parallelism from Chapter 18). But then a token routed to an expert on a DIFFERENT device must be SENT there, processed, and the result sent back. This all-to-all communication — every device potentially sending tokens to every other — is a major bottleneck, and managing it efficiently is central to MoE systems.
Device Grid: Expert parallelism: experts spread across devices
| GPU 0 | GPU 1 | GPU 2 | GPU 3 | |
|---|---|---|---|---|
| Experts | E0,E1 | E2,E3 | E4,E5 | E6,E7 |
Tool Trace: A token's journey when its expert is on another device
| GPU 0 | Token arrives; router sends it to Expert 5 (on GPU 2) | → |
| Network | All-to-all: token shipped to GPU 2 | → |
| GPU 2 | Expert 5 processes the token | • |
| Network | Result shipped back to GPU 0 | ← |
| GPU 0 | Combines expert outputs; continues | ← |
The basic MoE recipe has been refined in many ways to improve quality, balance, and efficiency. Knowing the main variants helps you understand modern MoE models, which rarely use the vanilla recipe.
| Variant | What it changes |
|---|---|
| Fine-grained experts | Many smaller experts (e.g. 64) instead of a few big ones |
| Shared experts | Some experts ALWAYS run (handle common patterns); rest routed |
| Expert choice routing | Experts pick their top tokens (not tokens picking experts) |
| Top-1 routing | Only 1 expert per token (Switch Transformer) — maximal sparsity |
| Dropless MoE | Block-sparse compute; no token dropping (MegaBlocks) |
| Soft MoE | Soft, differentiable assignment instead of hard top-k |
Fine-Grained Experts
Instead of a few large experts, use MANY smaller ones (e.g. 64 experts, routing to 8). This gives finer-grained specialization — more distinct combinations of experts a token can use — and often better quality at the same active-parameter cost. DeepSeek's MoE models pioneered this fine-grained approach, finding that many small experts outperform few large ones.
Shared Experts
A clever refinement (DeepSeek-MoE and others): designate one or more SHARED experts that ALWAYS process every token, alongside the routed experts. The shared experts learn common, general patterns that all tokens need, freeing the routed experts to specialize on the rest. This reduces redundancy (the routed experts don't each have to relearn common patterns) and improves both balance and quality.
Expert Route: Shared + routed experts: E0 always runs, plus top-2 routed
MoE models are trained much like dense models (Part IV), but the routing adds quirks and instabilities that require care. Understanding these helps explain why MoE, despite its appeal, took years to become reliable at scale.
Training Instabilities
MoE training is less stable than dense training. The discrete top-k routing decision is non-differentiable (you can't smoothly backprop through 'pick the top 2'), which is handled with approximations that add noise. Routing can oscillate, experts can collapse (Section 32.5), and the auxiliary loss must be tuned. MoE models are also more sensitive to hyperparameters and prone to loss spikes. Much of the engineering is about keeping training stable.
Router Z-Loss and Stabilization
Beyond the load-balancing loss, MoE training often adds a 'router z-loss' that keeps the router's logits from growing too large (which would make routing overconfident and unstable). Together with careful initialization, gradient clipping (Chapter 15), and tuned auxiliary-loss weights, these stabilizers make large MoE training tractable. The recurring theme: the router needs babysitting that dense FFNs never required.
| MoE training concern | Mitigation |
|---|---|
| Expert collapse | Auxiliary load-balancing loss |
| Overconfident routing | Router z-loss |
| Non-differentiable top-k | Noisy gating / straight-through estimators |
| Loss spikes / instability | Careful init, grad clipping, lower lr |
| Token dropping | Capacity factor tuning, dropless kernels |
Fine-Tuning MoE: A Known Difficulty
MoE models are notoriously harder to FINE-TUNE than dense models. With far more total parameters but limited fine-tuning data, MoE models overfit more easily, and the routing learned during pretraining can be disrupted by fine-tuning. Techniques like freezing the router, using higher auxiliary-loss weights during fine-tuning, or fine-tuning only some experts help, but MoE fine-tuning remains finickier than dense fine-tuning — a real practical consideration when choosing MoE.
MoE is not a research curiosity — it powers many of the most capable models in production. A tour of real MoE models grounds the concepts and shows how the variants combine in practice.
| Model | MoE design |
|---|---|
| Switch Transformer | Google's early large MoE; top-1 routing, up to thousands of experts |
| GLaM | Google; 1.2T total params, ~8% active per token |
| Mixtral 8x7B / 8x22B | Mistral; 8 experts, top-2; the popular open MoE |
| DeepSeek-MoE / V2 / V3 | Fine-grained + shared experts; very efficient |
| GPT-4 (reported) | Widely believed to be a large MoE |
| Grok, others | Many frontier models reportedly use MoE |
The Switch Transformer: Simplifying to Top-1
The Switch Transformer (Fedus et al., 2021) was a landmark that simplified MoE by routing each token to just ONE expert (top-1), maximizing sparsity. It showed MoE could scale to trillions of parameters and trained stably with the right load balancing. It established many of the techniques — the auxiliary loss, capacity factors — still used today, and demonstrated that simpler routing could work at massive scale.
Mixtral: MoE Goes Mainstream Open Source
Mixtral 8x7B (2023) brought MoE to the open community. With 8 experts and top-2 routing, it matched or beat much larger dense models while running at ~13B active cost. Its open release let everyone experiment with MoE, study routing, and build on it — making it the reference open MoE and the model most people first encounter when learning MoE hands-on.
GPT-4 and the Frontier
While architectures of closed frontier models aren't officially disclosed, GPT-4 is widely reported to be a large MoE, and many other frontier models are believed to use experts. The reason is exactly this chapter's thesis: at the largest scales, MoE is how you get enormous capacity at a serving cost that remains feasible. MoE has become a standard tool in the frontier toolkit.
MoE is powerful but not always the right choice. Weighing it against a dense model is an important judgment, and the answer depends heavily on scale, hardware, and use case.
| MoE wins | Dense wins |
|---|---|
| Large-scale pretraining | Smaller models |
| Serving a general model cheaply | Heavy fine-tuning / customization |
| Compute is the bottleneck | Memory is the bottleneck |
| Fast interconnect available | Limited / single-device hardware |
| Maximize capacity per FLOP | Simplicity and predictability |
| Capacity matters most | Stability matters most |
The Core Trade-off, Summarized
MoE gives more capacity per unit of compute, paid for with more memory, more systems complexity, more training instability, and harder fine-tuning. If you are pretraining or serving a very large general model on good hardware and compute-per-token is your constraint, MoE is often the right call. If you are working at smaller scale, on limited hardware, or need to fine-tune frequently, a dense model's simplicity and stability may win despite its lower capacity-per-FLOP.
Let us consolidate the chapter into one coherent picture of how a Mixture of Experts works and what it takes to make it succeed.
Pipeline Flow: The complete MoE picture
| 1 | Replace FFN | Swap each FFN for a bank of expert FFNs + a router |
| 2 | Route | Router scores experts; each token goes to its top-k |
| 3 | Compute sparsely | Only the k chosen experts run per token |
| 4 | Balance | Auxiliary loss spreads tokens across experts (avoid collapse) |
| 5 | Distribute | Spread experts across devices; all-to-all communication |
| 6 | Result | Huge total capacity, small active compute per token |
The Three Ideas to Remember
If you remember three things about MoE: First, MoE DECOUPLES capacity from compute — many total parameters, few active per token, by routing each token to a subset of experts. Second, ROUTING IS THE HEART AND THE HAZARD — a learned router enables specialization but risks expert collapse, fixed by load-balancing losses. Third, the COST IS RELOCATED, NOT REMOVED — MoE trades compute for memory and systems complexity, which is a great trade at the frontier and a questionable one at small scale.
MoE Quick-Reference
| Concept | Key idea | Remember |
|---|---|---|
| Core idea | More params, same compute | Decouple capacity from cost |
| MoE layer | Experts + router replace the FFN | Attention stays dense |
| Top-k routing | Each token to its top-k experts | k=1 or 2; learned router |
| Active vs total | Few active, many total | Compute ∝ active; memory ∝ total |
| Expert collapse | Router favors a few experts | Rich-get-richer; wastes capacity |
| Load-balancing loss | Penalize uneven usage | Essential but delicate |
| Capacity / dropping | Fixed buffers; overflow dropped | Dropless kernels fix it |
| Systems cost | Store all, communicate tokens | Memory + all-to-all bottleneck |
| Variants | Fine-grained, shared experts | Modern models combine them |
Exercises
Exercises 1–10 are pen-and-paper or derivations; 11–20 require code.
Further reading: “Adaptive Mixtures of Local Experts” (Jacobs et al., 1991) — the original idea. “Outrageously Large Neural Networks: The Sparsely-Gated Mixture-of-Experts Layer” (Shazeer et al., 2017) — MoE for deep learning. “Switch Transformers” (Fedus et al., 2021). “GLaM” (Du et al., 2021). “Mixtral of Experts” (Jiang et al., 2024). “DeepSeekMoE” (Dai et al., 2024) for fine-grained and shared experts. “MegaBlocks” (Gale et al., 2022) for dropless MoE. “Expert Choice Routing” (Zhou et al., 2022).
Next → Chapter 33: Long Context & Memory
MoE scaled a model's PARAMETERS efficiently. The next frontier is scaling how much a model can ATTEND to — its context length. Chapter 33 tackles long context and memory: why attention's quadratic cost makes long contexts expensive, the position-encoding tricks (RoPE scaling, ALiBi, YaRN) that let models extend far beyond their training length, efficient attention variants, and the external-memory approaches that let models recall information across vast or unbounded contexts. Having grown the model's capacity with experts, we now grow its window onto the world — and confront the 'lost in the middle' and quadratic-scaling limits that long context must overcome.