Part VII: Frontier Techniques & Future

Chapter 32

Mixture of Experts

Scaling parameter count without scaling compute: sparse MoE layers, top-k routing, the load-balancing problem and expert collapse, and how models like Mixtral and GPT-4 use experts to grow capacity affordably.

20 Exercises

Learning Objectives

1.	Explain the motivation for MoE: more parameters at constant compute.
2.	Understand the sparse MoE layer: experts plus a router.
3.	Understand top-k routing and how tokens are dispatched to experts.
4.	Distinguish active parameters from total parameters.
5.	Understand the load-balancing problem and expert collapse.
6.	Apply the auxiliary load-balancing loss.
7.	Reason about the systems challenges of MoE (memory, communication).
8.	Understand fine-grained and shared-expert variants.
9.	Explain how Mixtral and GPT-4-style models use MoE.
10.	Weigh the trade-offs of MoE versus dense models.

Welcome to Part VII, the frontier. We begin with one of the most important architectural ideas behind today's largest models: the Mixture of Experts (MoE). Scaling laws (Chapter 16) told us that bigger models are better — but bigger models cost more to run, because every parameter is used for every token. MoE breaks this link: it lets a model have FAR more parameters while using only a FRACTION of them for any given token. More capacity, almost the same compute.

The Problem MoE Solves

In a normal ('dense') model, every token passes through every parameter — all the weights are used for all the tokens. So doubling the parameters doubles the compute per token. This is the wall scaling runs into: more capacity means proportionally more cost, forever. MoE asks a radical question: what if each token only used the parameters it actually needs, rather than all of them?

✧

Intuition: A Hospital of Specialists, Not One Overworked Generalist

Imagine a hospital. A DENSE model is like one doctor who must personally know everything and see every patient — to handle more conditions, that single doctor must become impossibly knowledgeable, and every patient waits for the same overloaded person. An MoE model is like a hospital of SPECIALISTS: a receptionist (the router) directs each patient to the few relevant specialists. The hospital collectively knows vastly more than any one doctor, but each patient only consults a couple of them.

So the hospital's total knowledge (total parameters) can be enormous, while any single patient's visit (the compute per token) stays small. This is the essence of MoE: grow the total capacity by adding specialists, but route each token to only a few. Capacity scales; per-token cost does not.

Active vs Total Parameters

This gives MoE its defining characteristic: a distinction between TOTAL parameters (all the experts combined — the model's full capacity) and ACTIVE parameters (the few used for a given token — what determines compute cost). An MoE model might have 8× the total parameters of a dense model but use only a fraction per token, getting much of the quality of a huge model at the cost of a small one.

	Dense model	MoE model
Total parameters	All used per token	Many (the full capacity)
Active per token	= total	A small fraction of total
Compute per token	Scales with total params	Scales with ACTIVE params
Capacity	Limited by compute budget	Decoupled from compute
Analogy	One overworked generalist	A hospital of specialists

✧

MoE Note: The Headline Trade-off

MoE buys capacity with MEMORY rather than compute. You must STORE all the experts (lots of memory — the total parameter count), but you only COMPUTE with a few per token (little compute — the active parameter count). Since memory is often cheaper and more available than compute for inference, this is frequently a winning trade. It is why the largest frontier models — reportedly including GPT-4 — use MoE: it is how you get a trillion-parameter model that is affordable to run.

Keep this trade in mind throughout the chapter: MoE does not give capacity for free. It trades the abundant resource (memory) for the scarce one (compute per token). The rest of the chapter is about making this trade work — routing well, balancing load, and managing the memory and communication costs.

Where exactly does MoE go in a Transformer? Recall from Chapter 13 that each Transformer block has an attention sub-layer and a feed-forward (FFN) sub-layer. MoE replaces the single FFN with MANY FFNs — the 'experts' — plus a small 'router' that decides which experts each token uses. Attention stays the same; only the FFN becomes a mixture of experts.

Anatomy of an MoE Layer

A sparse MoE layer has two parts. The EXPERTS are several independent feed-forward networks (say 8 of them), each identical in structure to the dense FFN it replaces. The ROUTER (or 'gate') is a small network that looks at each token and decides which experts should process it. For each token, only the chosen experts run — the rest are skipped entirely. That skipping is what makes it 'sparse'.

Arch Stack: An MoE layer replaces the FFN with experts + a router

output (weighted combination of expert outputs)
Expert 0 Expert 1 ... Expert 7	only the chosen few run
Router / Gate	picks top-k experts per token
input token	(d,)

Expert

One of several independent feed-forward networks in an MoE layer. Each is the same size as the dense FFN it replaces; only a few experts process any given token.

Router (gate)

A small learned network that, for each token, decides which experts should process it and with what weight. Routing is the heart of MoE.

Which Layers Become MoE?

Typically, the FFN in EVERY Transformer block (or every other block) is replaced with an MoE layer, while attention remains dense and shared across all tokens. Since the FFN holds most of a Transformer's parameters, turning FFNs into MoE layers is where the huge parameter expansion comes from. The model becomes a stack of blocks, each with shared attention and a bank of routed experts.

✧

MoE Note: MoE Is a Drop-In FFN Replacement

Conceptually, MoE is a localized change: take the FFN — a single big neural network applied to every token — and replace it with a SET of smaller FFNs plus a router that picks which ones to use. Everything else about the Transformer (attention, residuals, layer norm, the training objective) is unchanged. This modularity is why MoE could be adopted so readily: it slots into the existing Transformer without redesigning it.

Because the change is localized to the FFN, all the machinery of earlier parts — pretraining, fine-tuning, alignment, inference optimization — still applies. MoE adds capacity and a routing problem, but it does not throw away the Transformer; it extends it.

The router is the brain of an MoE layer. For each token, it must decide which experts to use. The standard method is TOP-K ROUTING: the router scores all experts for the token, then sends the token to only the k highest-scoring ones (k is small — often 1 or 2). Let us see exactly how it works.

The Routing Computation

The router is just a small linear layer. For each token, it produces a score (a 'logit') for every expert. A softmax turns these into weights, and the top-k experts by weight are selected. The token is processed by those k experts, and their outputs are combined — weighted by the router's scores — into the layer's output. So the router both CHOOSES the experts and WEIGHTS their contributions.

text•Top-k routing
For token x:
  scores  = softmax(W_router · x)        # one weight per expert
  top_k   = indices of the k largest scores
  output  = Σ   scores[i] · Expert_i(x)   # combine only the chosen k
          i ∈ top_k

# Only the k chosen experts are computed. k=2 is common (e.g. Mixtral).

Visualizing Routing

Here is what routing looks like for a few tokens with 5 experts and top-2 routing — each token activates exactly 2 experts, and different tokens go to different experts:

Expert Route: Top-2 routing: each token to its 2 highest-scoring experts

Notice that each token (each row) lights up exactly 2 experts, and the assignment varies by token — the router learns to send different tokens to different specialists. Over a whole batch, all experts get used, but each individual token only pays for 2.

Python•Top-k routing from scratch
import torch; import torch.nn.functional as F

class MoELayer(torch.nn.Module):
    def __init__(self, d, n_experts=8, k=2):
        super().__init__()
        self.k = k
        self.router  = torch.nn.Linear(d, n_experts)   # scores each expert
        self.experts = torch.nn.ModuleList([FFN(d) for _ in range(n_experts)])

    def forward(self, x):                       # x: (tokens, d)
        scores = F.softmax(self.router(x), dim=-1)   # (tokens, n_experts)
        # Pick the top-k experts per token
        topk_w, topk_idx = scores.topk(self.k, dim=-1)  # (tokens, k)
        topk_w = topk_w / topk_w.sum(dim=-1, keepdim=True)  # renormalize

        out = torch.zeros_like(x)
        for i in range(self.k):                  # for each chosen slot
            for e in range(len(self.experts)):
                mask = (topk_idx[:, i] == e)        # tokens routed to expert e
                if mask.any():
                    out[mask] += topk_w[mask, i:i+1] * self.experts[e](x[mask])
        return out

# Only experts that received tokens actually compute. k=2 of 8 experts
# run per token -> ~1/4 the FFN compute of using all 8.

✧

MoE Note: Routing Is Learned, Not Fixed

A crucial point: the router is TRAINED along with everything else. Nobody assigns 'this expert handles verbs, that one handles numbers' — the router and experts learn their specializations together during training, driven only by the goal of predicting tokens well. What each expert ends up specializing in is often surprising and not cleanly interpretable; the division of labor emerges from optimization, much like the emergent behaviours we saw in reasoning (Chapter 25).

This learned routing is also what makes MoE tricky. Because routing is learned and discrete (top-k is a hard choice), training can go wrong in ways dense models cannot — most notably expert collapse (Section 32.5), where the router learns to ignore most experts. Managing the router is the central challenge of MoE.

Let us make the compute savings concrete, because the numbers are what make MoE compelling. The key ratio is how many experts EXIST versus how many RUN per token. With 8 experts and top-2 routing, each token uses only 2 of 8 — the FFN does roughly a quarter of the work it would if all experts ran, yet the model holds 8 experts' worth of knowledge.

A Concrete Example: Mixtral

Mixtral 8x7B (Mistral, 2023) is the canonical open MoE. It has 8 experts per layer and uses top-2 routing. Its name suggests '8×7B = 56B' but the real numbers are subtler: because attention and embeddings are shared, the TOTAL parameter count is about 47B, while the ACTIVE parameters per token are only about 13B. So Mixtral runs at roughly the cost of a 13B dense model but has the capacity of a 47B one — and it performs accordingly, rivaling much larger dense models.

Mixtral 8x7B	Value	Meaning
Experts per layer	8	The bank of specialists
Active per token (k)	2	Top-2 routing
Total parameters	~47B	The full capacity (memory)
Active parameters	~13B	Compute cost per token
Runs like a...	~13B dense model	Cheap to run
Performs like a...	Much larger model	Big capacity

text•Why total != experts x expert-size
Mixtral '8x7B' is NOT 8 x 7B = 56B, because:
  - attention layers are SHARED (not duplicated per expert)
  - embeddings are SHARED
  - only the FFN is replicated into 8 experts

Total ≈ 47B (shared parts + 8 expert FFNs)
Active ≈ 13B (shared parts + 2 expert FFNs per token)

✧

MoE Changes the Scaling Equation

Scaling laws (Chapter 16) related performance to compute and parameters. MoE partially DECOUPLES them: you can grow the total parameter count (and capacity) by adding experts, while holding the per-token compute roughly fixed. This gives a new axis of scaling — 'sparse scaling' — where you buy capability with memory and total parameters rather than with compute per token.

This is why MoE became central at the frontier: as compute per token became the binding constraint on serving cost, MoE offered a way to keep growing capacity without growing that cost proportionally. It is one of the main reasons the largest deployed models could get so large while remaining (relatively) affordable to run.

MoE has a notorious failure mode that does not exist in dense models, and understanding it is essential. Left to itself, the router tends to collapse — sending almost all tokens to just a few experts while the rest go unused. This 'expert collapse' wastes the model's capacity and is the central challenge of training MoE.

The Vicious Cycle of Collapse

Expert collapse arises from a self-reinforcing feedback loop. Early in training, by chance, the router slightly favors a few experts. Those experts get more tokens, so they train faster and get better. Because they are better, the router favors them even more, so they get even more tokens. Meanwhile, the neglected experts get few tokens, barely train, stay bad, and get neglected further. The rich get richer; most experts wither. The model ends up using only a handful of its experts — throwing away the capacity MoE was supposed to provide.

Expert collapse

A failure mode where the router learns to route most tokens to a small number of experts, leaving the others under-trained and unused, wasting the model's capacity. Caused by a rich-get-richer feedback loop in routing.

Visualizing Collapse

A collapsed MoE looks like this — nearly all tokens funnel to two overloaded experts while the rest sit idle (grayed):

Expert Route: Expert collapse: most tokens funnel to a few experts

The Fix: An Auxiliary Load-Balancing Loss

The standard cure is to ADD a penalty to the training loss that encourages BALANCED expert usage. This 'auxiliary load-balancing loss' is high when tokens are distributed unevenly across experts and low when they are spread out. By minimizing it alongside the main loss, training is pushed to use all experts roughly equally — breaking the rich-get-richer cycle. It is a gentle, constant pressure toward balance.

text•Auxiliary load-balancing loss (sketch)
For each expert i, over a batch:
  f_i = fraction of tokens routed to expert i
  P_i = average router probability assigned to expert i

  L_balance = n_experts · Σ f_i · P_i

# Minimized when load is SPREAD EVENLY across experts.
# Added to the main loss: L = L_main + α · L_balance

The intuition: the loss is minimized when every expert gets an equal share of tokens. By penalizing imbalance, it stops any expert from monopolizing the routing, forcing the router to spread tokens out and give every expert enough traffic to train. The coefficient (alpha) sets how strongly balance is enforced — too little and you risk collapse, too much and you override the router's useful specialization.

⚠️

Load Balancing Is a Delicate Trade-off

The load-balancing loss is essential but delicate. Too WEAK, and the model collapses to a few experts — wasting capacity. Too STRONG, and you force balance so hard that the router can no longer specialize meaningfully — tokens get spread evenly but not SENSIBLY, hurting quality. Tuning this balance is one of the trickiest parts of training MoE, and much MoE research is about achieving good balance without sacrificing useful specialization.

This tension — between balanced utilization and meaningful specialization — is fundamental to MoE. You want all experts USED (balance) AND each expert doing something USEFUL (specialization), and these pull against each other. Getting both is the art of MoE training.

Even with load balancing, in any given batch some experts receive more tokens than others. For efficient hardware execution, each expert is usually given a fixed CAPACITY — a maximum number of tokens it can process per batch. This creates a new wrinkle: what happens when more tokens are routed to an expert than its capacity allows?

Why Fixed Capacity?

Hardware (GPUs/TPUs) is most efficient with FIXED, predictable tensor shapes. If experts could receive any number of tokens, the computation shapes would vary unpredictably, hurting efficiency. So each expert gets a fixed buffer sized for its expected share — the CAPACITY FACTOR controls how much slack above the average each expert gets. This keeps the computation regular and fast.

Expert capacity

The maximum number of tokens an expert will process in a batch, fixed for hardware efficiency. Set by a capacity factor (e.g. 1.25 = 25% above the average even share).

Token Dropping

When more tokens are routed to an expert than its capacity, the overflow tokens are DROPPED — they skip that expert entirely (often passing through unchanged via the residual connection). This sounds alarming, but a modest drop rate is tolerable: the token still gets processed by the rest of the network and its OTHER chosen expert (with top-2 routing). Still, heavy dropping hurts quality, so the capacity factor and load balancing must keep drops low.

Capacity factor	Effect
Low (e.g. 1.0)	Tight buffers, less memory/compute, but more token dropping
Medium (e.g. 1.25)	Common balance — modest slack, few drops
High (e.g. 2.0)	Almost no dropping, but wastes memory/compute on padding

✧

MoE Note: Dropless MoE

Token dropping is an artifact of fixed-shape hardware efficiency, not a fundamental necessity. Newer approaches ('dropless' MoE, e.g. MegaBlocks) use clever sparse computation that handles variable expert loads WITHOUT dropping any tokens or wasting capacity on padding. These use block-sparse matrix operations to process exactly the tokens each expert received, efficiently, whatever the load.

This is a good example of how MoE's challenges are often SYSTEMS problems with systems solutions. The capacity/dropping issue arises from how GPUs like regular shapes; better kernels (dropless MoE) solve it. Much of making MoE practical is this kind of careful systems engineering around the routing, which we turn to next.

MoE's compute savings come with real systems costs. The experts must be STORED (lots of memory), and when experts are spread across multiple devices, tokens must be SENT to wherever their expert lives (lots of communication). These systems challenges are why MoE is harder to deploy than its compute numbers suggest, and connect directly to the distributed training of Chapter 18 and serving of Chapter 31.

Memory: You Store Everything

The first cost is memory. Even though only k experts run per token, ALL the experts must be held in memory — the router might send the next token to any of them. So an MoE model's memory footprint is its TOTAL parameter count, not its active count. A 47B-total MoE needs ~47B params' worth of memory even though it computes like a 13B model. MoE trades the scarce resource (compute) for the more available one (memory) — but memory is still a real constraint.

Expert Parallelism and Communication

When the experts don't all fit on one device, they are spread across many — 'expert parallelism' (a form of the model parallelism from Chapter 18). But then a token routed to an expert on a DIFFERENT device must be SENT there, processed, and the result sent back. This all-to-all communication — every device potentially sending tokens to every other — is a major bottleneck, and managing it efficiently is central to MoE systems.

Device Grid: Expert parallelism: experts spread across devices

	GPU 0	GPU 1	GPU 2	GPU 3
Experts	E0,E1	E2,E3	E4,E5	E6,E7

Tool Trace: A token's journey when its expert is on another device

GPU 0	Token arrives; router sends it to Expert 5 (on GPU 2)	→
Network	All-to-all: token shipped to GPU 2	→
GPU 2	Expert 5 processes the token	•
Network	Result shipped back to GPU 0	←
GPU 0	Combines expert outputs; continues	←

⚠️

Communication Can Eat the Compute Savings

The all-to-all communication of expert parallelism can be so expensive that it eats into MoE's compute savings, especially at smaller scales or with slow interconnects. This is why MoE shines most in large, well-connected clusters with fast interconnects (NVLink, InfiniBand) where the communication is affordable. On modest hardware, the routing overhead can make MoE less attractive than its theoretical compute savings suggest.

The lesson echoes Chapters 18 and 31: at the frontier, the algorithm and the systems are inseparable. MoE's elegant compute savings are real, but realizing them requires solving hard distributed-systems problems — memory, communication, load balancing — that determine whether MoE actually pays off in practice.

The basic MoE recipe has been refined in many ways to improve quality, balance, and efficiency. Knowing the main variants helps you understand modern MoE models, which rarely use the vanilla recipe.

Variant	What it changes
Fine-grained experts	Many smaller experts (e.g. 64) instead of a few big ones
Shared experts	Some experts ALWAYS run (handle common patterns); rest routed
Expert choice routing	Experts pick their top tokens (not tokens picking experts)
Top-1 routing	Only 1 expert per token (Switch Transformer) — maximal sparsity
Dropless MoE	Block-sparse compute; no token dropping (MegaBlocks)
Soft MoE	Soft, differentiable assignment instead of hard top-k

Fine-Grained Experts

Instead of a few large experts, use MANY smaller ones (e.g. 64 experts, routing to 8). This gives finer-grained specialization — more distinct combinations of experts a token can use — and often better quality at the same active-parameter cost. DeepSeek's MoE models pioneered this fine-grained approach, finding that many small experts outperform few large ones.

Shared Experts

A clever refinement (DeepSeek-MoE and others): designate one or more SHARED experts that ALWAYS process every token, alongside the routed experts. The shared experts learn common, general patterns that all tokens need, freeing the routed experts to specialize on the rest. This reduces redundancy (the routed experts don't each have to relearn common patterns) and improves both balance and quality.

Expert Route: Shared + routed experts: E0 always runs, plus top-2 routed

✧

MoE Note: Expert Choice: Flipping the Routing

A neat alternative is EXPERT CHOICE routing: instead of each token picking its top-k experts, each EXPERT picks its top tokens (up to its capacity). This guarantees perfect load balance by construction — every expert gets exactly its capacity, no more, no less — sidestepping the collapse problem entirely. The trade-off is that some tokens might be picked by many experts and others by none, so it is used carefully.

These variants show that the basic 'tokens pick top-k experts' recipe is just one point in a design space. Modern MoE models mix and match — fine-grained experts, shared experts, careful routing — to balance the competing pressures of specialization, balance, and efficiency that define MoE engineering.

MoE models are trained much like dense models (Part IV), but the routing adds quirks and instabilities that require care. Understanding these helps explain why MoE, despite its appeal, took years to become reliable at scale.

Training Instabilities

MoE training is less stable than dense training. The discrete top-k routing decision is non-differentiable (you can't smoothly backprop through 'pick the top 2'), which is handled with approximations that add noise. Routing can oscillate, experts can collapse (Section 32.5), and the auxiliary loss must be tuned. MoE models are also more sensitive to hyperparameters and prone to loss spikes. Much of the engineering is about keeping training stable.

Router Z-Loss and Stabilization

Beyond the load-balancing loss, MoE training often adds a 'router z-loss' that keeps the router's logits from growing too large (which would make routing overconfident and unstable). Together with careful initialization, gradient clipping (Chapter 15), and tuned auxiliary-loss weights, these stabilizers make large MoE training tractable. The recurring theme: the router needs babysitting that dense FFNs never required.

MoE training concern	Mitigation
Expert collapse	Auxiliary load-balancing loss
Overconfident routing	Router z-loss
Non-differentiable top-k	Noisy gating / straight-through estimators
Loss spikes / instability	Careful init, grad clipping, lower lr
Token dropping	Capacity factor tuning, dropless kernels

Fine-Tuning MoE: A Known Difficulty

MoE models are notoriously harder to FINE-TUNE than dense models. With far more total parameters but limited fine-tuning data, MoE models overfit more easily, and the routing learned during pretraining can be disrupted by fine-tuning. Techniques like freezing the router, using higher auxiliary-loss weights during fine-tuning, or fine-tuning only some experts help, but MoE fine-tuning remains finickier than dense fine-tuning — a real practical consideration when choosing MoE.

✧

MoE Note: The Trade-off Sharpens at Fine-Tuning Time

A practical decision point: if you plan to heavily fine-tune or specialize a model, a dense model may be easier to work with than an MoE one, despite MoE's serving efficiency. MoE shines for large-scale pretraining and serving a general model cheaply; its fine-tuning difficulties can offset those gains if your workflow involves frequent customization. As always, match the architecture to the use case — MoE is a powerful tool, not a universal upgrade.

This connects to the recurring Part VI–VII theme: every frontier technique has trade-offs. MoE buys serving efficiency at the cost of memory, systems complexity, training instability, and fine-tuning difficulty. Whether that trade is worth it depends entirely on your scale, hardware, and how you will use the model.

MoE is not a research curiosity — it powers many of the most capable models in production. A tour of real MoE models grounds the concepts and shows how the variants combine in practice.

Model	MoE design
Switch Transformer	Google's early large MoE; top-1 routing, up to thousands of experts
GLaM	Google; 1.2T total params, ~8% active per token
Mixtral 8x7B / 8x22B	Mistral; 8 experts, top-2; the popular open MoE
DeepSeek-MoE / V2 / V3	Fine-grained + shared experts; very efficient
GPT-4 (reported)	Widely believed to be a large MoE
Grok, others	Many frontier models reportedly use MoE

The Switch Transformer: Simplifying to Top-1

The Switch Transformer (Fedus et al., 2021) was a landmark that simplified MoE by routing each token to just ONE expert (top-1), maximizing sparsity. It showed MoE could scale to trillions of parameters and trained stably with the right load balancing. It established many of the techniques — the auxiliary loss, capacity factors — still used today, and demonstrated that simpler routing could work at massive scale.

Mixtral: MoE Goes Mainstream Open Source

Mixtral 8x7B (2023) brought MoE to the open community. With 8 experts and top-2 routing, it matched or beat much larger dense models while running at ~13B active cost. Its open release let everyone experiment with MoE, study routing, and build on it — making it the reference open MoE and the model most people first encounter when learning MoE hands-on.

GPT-4 and the Frontier

While architectures of closed frontier models aren't officially disclosed, GPT-4 is widely reported to be a large MoE, and many other frontier models are believed to use experts. The reason is exactly this chapter's thesis: at the largest scales, MoE is how you get enormous capacity at a serving cost that remains feasible. MoE has become a standard tool in the frontier toolkit.

✧

History: From 1991 to the Frontier

The Mixture of Experts idea is old — it dates to work by Jacobs, Jordan, Hinton, and others around 1991, long before Transformers. The notion of specialized sub-networks with a gating function is decades old. What changed was scale: when models grew large enough that compute-per-token became the binding constraint, the old MoE idea found its moment, combined with the Transformer and modern distributed systems.

This is a recurring pattern in deep learning (echoing the histories in Parts II–III): foundational ideas wait, sometimes for decades, until scale and complementary advances make them shine. MoE is a striking example — a 1991 idea that became central to the 2020s frontier.

MoE is powerful but not always the right choice. Weighing it against a dense model is an important judgment, and the answer depends heavily on scale, hardware, and use case.

MoE wins	Dense wins
Large-scale pretraining	Smaller models
Serving a general model cheaply	Heavy fine-tuning / customization
Compute is the bottleneck	Memory is the bottleneck
Fast interconnect available	Limited / single-device hardware
Maximize capacity per FLOP	Simplicity and predictability
Capacity matters most	Stability matters most

The Core Trade-off, Summarized

MoE gives more capacity per unit of compute, paid for with more memory, more systems complexity, more training instability, and harder fine-tuning. If you are pretraining or serving a very large general model on good hardware and compute-per-token is your constraint, MoE is often the right call. If you are working at smaller scale, on limited hardware, or need to fine-tune frequently, a dense model's simplicity and stability may win despite its lower capacity-per-FLOP.

✧

Intuition: There Is No Free Lunch — Only Different Lunches

MoE does not give capacity for free; it relocates the cost from compute to memory and complexity. Whether that relocation helps depends on which resource is scarce for YOU. At the frontier, where compute is the binding constraint and memory and engineering talent are available, MoE is a great trade. For a hobbyist on one GPU who wants to fine-tune, the relocated costs may dominate.

This is the mature view of every technique in this book: there is no universally best architecture, only the best fit for a given set of constraints. Understanding MoE deeply — what it costs and what it buys — is what lets you make that judgment well, rather than reaching for it because it is fashionable.

Let us consolidate the chapter into one coherent picture of how a Mixture of Experts works and what it takes to make it succeed.

Pipeline Flow: The complete MoE picture

1	Replace FFN	Swap each FFN for a bank of expert FFNs + a router
2	Route	Router scores experts; each token goes to its top-k
3	Compute sparsely	Only the k chosen experts run per token
4	Balance	Auxiliary loss spreads tokens across experts (avoid collapse)
5	Distribute	Spread experts across devices; all-to-all communication
6	Result	Huge total capacity, small active compute per token

The Three Ideas to Remember

If you remember three things about MoE: First, MoE DECOUPLES capacity from compute — many total parameters, few active per token, by routing each token to a subset of experts. Second, ROUTING IS THE HEART AND THE HAZARD — a learned router enables specialization but risks expert collapse, fixed by load-balancing losses. Third, the COST IS RELOCATED, NOT REMOVED — MoE trades compute for memory and systems complexity, which is a great trade at the frontier and a questionable one at small scale.

✧

MoE Note: MoE Embodies the Frontier's Theme

MoE captures the spirit of Part VII: the frontier is about pushing past the limits of the straightforward approach (here, dense scaling) with cleverer architectures that change the fundamental trade-offs. MoE didn't make models better by brute force; it changed the EQUATION — decoupling capacity from compute. The frontier techniques in the coming chapters (long context, agents) similarly push past limits by rethinking what seemed fixed.

And like all frontier techniques, MoE is not a finished story. Better routing, better balancing, better systems, and entirely new sparse architectures are active research. You now understand the foundation; the field is still building on it.

MoE Quick-Reference

Concept	Key idea	Remember
Core idea	More params, same compute	Decouple capacity from cost
MoE layer	Experts + router replace the FFN	Attention stays dense
Top-k routing	Each token to its top-k experts	k=1 or 2; learned router
Active vs total	Few active, many total	Compute ∝ active; memory ∝ total
Expert collapse	Router favors a few experts	Rich-get-richer; wastes capacity
Load-balancing loss	Penalize uneven usage	Essential but delicate
Capacity / dropping	Fixed buffers; overflow dropped	Dropless kernels fix it
Systems cost	Store all, communicate tokens	Memory + all-to-all bottleneck
Variants	Fine-grained, shared experts	Modern models combine them

Exercises

Exercises 1–10 are pen-and-paper or derivations; 11–20 require code.

✎

Exercise 1: Pen & Paper

Explain how MoE decouples total parameters from compute per token. Use the hospital-of-specialists analogy in your own words.

✎

Exercise 2: Pen & Paper

Distinguish active and total parameters. For Mixtral 8x7B, why is the total ~47B (not 56B) and the active ~13B?

✎

Exercise 3: Pen & Paper

Describe the anatomy of a sparse MoE layer. What does it replace in the Transformer, and what stays the same?

✎

Exercise 4: Pen & Paper

Explain top-k routing step by step. What does the router compute, and how are the chosen experts' outputs combined?

✎

Exercise 5: Derive

For E experts, top-k routing, and shared attention, write the active and total parameter counts in terms of the per-expert FFN size and shared size.

✎

Exercise 6: Pen & Paper

Explain expert collapse and the rich-get-richer feedback loop that causes it. Why can't it happen in a dense model?

✎

Exercise 7: Pen & Paper

Explain the auxiliary load-balancing loss and what it is minimized by. Why is the coefficient a delicate trade-off?

✎

Exercise 8: Pen & Paper

Explain expert capacity and token dropping. Why do fixed capacities exist, and how does dropless MoE avoid dropping?

✎

Exercise 9: Pen & Paper

Describe the systems costs of MoE (memory and communication). Why can all-to-all communication eat the compute savings?

✎

Exercise 10: Pen & Paper

Compare MoE and dense models. Give two scenarios where MoE wins and two where dense wins, and justify each.

✎

Exercise 11: Code

Implement a top-k MoE layer from scratch (router + experts). Verify that only k experts run per token and the outputs are correctly weighted.

✎

Exercise 12: Code

Count parameters: build an MoE layer and a dense FFN of matching active size. Report total vs active parameters and the compute ratio.

✎

Exercise 13: Code

Visualize routing: feed a batch through your MoE layer and plot how many tokens each expert receives. Identify any imbalance.

✎

Exercise 14: Code Lab

Reproduce expert collapse: train a small MoE WITHOUT load balancing on a toy task and show the routing collapses to a few experts. Plot expert usage over training.

✎

Exercise 15: Code

Implement the auxiliary load-balancing loss. Add it to the training from Exercise 14 and show it restores balanced expert usage.

✎

Exercise 16: Code

Implement expert capacity and token dropping. Vary the capacity factor and measure the drop rate and its effect on a toy task's loss.

✎

Exercise 17: Code

Implement shared experts: designate one expert that always runs alongside the top-k routed experts. Compare quality and balance to vanilla top-k.

✎

Exercise 18: Code

Implement expert-choice routing (experts pick their top tokens). Show it achieves perfect load balance by construction, and discuss its trade-off.

✎

Exercise 19: Code Lab

Simulate expert parallelism: place experts on different (simulated) devices and measure the all-to-all communication cost as a function of the number of devices.

✎

Exercise 20: Code (Challenge)

Build a small MoE Transformer and train it on a language-modeling task. Implement top-k routing, the load-balancing loss, the router z-loss, and capacity/dropping. Compare it to a dense model with matched ACTIVE parameters on quality, and to one with matched TOTAL parameters on compute. Then deliberately remove the load-balancing loss and show expert collapse, and ablate the number of experts and k to map the capacity/compute/quality trade-off.

Further reading: “Adaptive Mixtures of Local Experts” (Jacobs et al., 1991) — the original idea. “Outrageously Large Neural Networks: The Sparsely-Gated Mixture-of-Experts Layer” (Shazeer et al., 2017) — MoE for deep learning. “Switch Transformers” (Fedus et al., 2021). “GLaM” (Du et al., 2021). “Mixtral of Experts” (Jiang et al., 2024). “DeepSeekMoE” (Dai et al., 2024) for fine-grained and shared experts. “MegaBlocks” (Gale et al., 2022) for dropless MoE. “Expert Choice Routing” (Zhou et al., 2022).

Next → Chapter 33: Long Context & Memory

MoE scaled a model's PARAMETERS efficiently. The next frontier is scaling how much a model can ATTEND to — its context length. Chapter 33 tackles long context and memory: why attention's quadratic cost makes long contexts expensive, the position-encoding tricks (RoPE scaling, ALiBi, YaRN) that let models extend far beyond their training length, efficient attention variants, and the external-memory approaches that let models recall information across vast or unbounded contexts. Having grown the model's capacity with experts, we now grow its window onto the world — and confront the 'lost in the middle' and quadratic-scaling limits that long context must overcome.

✎ 20 Exercises in this chapter

Attempt each exercise before checking the worked solutions.

View Solutions →

←

PreviousCh 31. Serving at Scale

NextCh 33. Long Context & Memory

→