Solutions Appendix

Chapter 32

Mixture of Experts

20 Solutions

Detailed solutions for the exercises in Chapter 32. Try solving them yourself before checking the answers.

Exercise 1Pen & Paper

How does MoE decouple total parameters from compute per token? Use the hospital analogy.

Solution

In a dense model every parameter processes every token, so more parameters means proportionally more compute. MoE routes each token to only k of E experts, so total CAPACITY grows with E (all experts exist) while per-token COMPUTE grows only with k (only k run). Hospital analogy: a dense model is one overworked generalist who must see every patient; an MoE model is a hospital of specialists where a receptionist (router) sends each patient to just the relevant few. The hospital's total knowledge is huge, but each visit consults only a couple of doctors — capacity scales, per-token cost doesn't.

Exercise 2Pen & Paper

Active vs total parameters; for Mixtral 8x7B why is total ~47B (not 56B) and active ~13B?

Solution

Total parameters = the full model (all experts) = capacity; active parameters = those used per token = compute cost. Mixtral '8x7B' is not 8×7B=56B because attention layers and embeddings are SHARED across experts — only the FFNs are replicated into 8 experts. So total ≈ 47B (shared parts + 8 expert FFNs) and active ≈ 13B (shared parts + the 2 expert FFNs that run per token under top-2 routing). Mixtral thus runs at ~13B cost with ~47B capacity.

Exercise 3Pen & Paper

Anatomy of a sparse MoE layer; what does it replace, what stays the same?

Solution

A sparse MoE layer replaces the single FEED-FORWARD network of a Transformer block with (a) several independent FFNs ('experts') and (b) a small router/gate that picks which experts each token uses. Attention, residuals, layer norm, embeddings, and the training objective all stay the same — MoE is a drop-in replacement for the FFN. Only the chosen k experts run per token, which is what makes it 'sparse'.

Exercise 4Pen & Paper

Explain top-k routing step by step; what does the router compute, how are outputs combined?

Solution

For each token: (1) the router (a small linear layer) produces a score for every expert; (2) a softmax turns scores into weights; (3) the top-k experts by weight are selected; (4) only those k experts process the token; (5) their outputs are combined as a weighted sum using the router's (renormalized) weights. So the router both CHOOSES the experts and WEIGHTS their contributions; the rest of the experts are skipped entirely.

Exercise 5Derive

For E experts, top-k routing, shared attention: write active and total parameter counts.

Solution

Let S = shared parameters (attention, embeddings, norms) and f = parameters per expert FFN. Then:

Total = S + E·f Active = S + k·f

Capacity (total) grows linearly with the number of experts E, while compute (active) grows only with k. The ratio of total to active FFN parameters is E/k — for Mixtral, 8/2 = 4× more FFN capacity than compute. The shared S is paid in both.

Exercise 6Pen & Paper

Explain expert collapse and the rich-get-richer loop; why can't it happen in a dense model?

Solution

Expert collapse: the router learns to send most tokens to a few experts, leaving the rest under-trained and unused. The loop is self-reinforcing — early random favoritism gives some experts more tokens, so they train faster and improve, so the router favors them more, so they get even more tokens, while neglected experts stay bad and ignored. A dense model has no router and no separate experts — every parameter is always used — so there is nothing to collapse onto; collapse is unique to the routed, discrete-choice structure of MoE.

Exercise 7Pen & Paper

Explain the auxiliary load-balancing loss; what minimizes it; why is the coefficient delicate?

Solution

The auxiliary loss penalizes uneven expert usage — typically proportional to Σ_i f_i·P_i (fraction of tokens routed to expert i times its mean router probability) — and is minimized when load is spread EVENLY across experts. Added to the main loss, it pushes the router to use all experts, breaking the rich-get-richer cycle. The coefficient is delicate: too small risks collapse (wasted capacity); too large forces balance so hard that the router can't specialize meaningfully (tokens spread evenly but not sensibly), hurting quality. It must balance utilization against useful specialization.

Exercise 8Pen & Paper

Explain expert capacity and token dropping; why fixed capacities; how does dropless MoE avoid it?

Solution

For hardware efficiency, each expert is given a fixed CAPACITY (max tokens per batch), set by a capacity factor, so tensor shapes are regular and predictable. When more tokens route to an expert than its capacity, the overflow is DROPPED (skips that expert, often passing through via the residual). Fixed capacities exist because GPUs are fastest with fixed shapes. Dropless MoE (e.g. MegaBlocks) uses block-sparse matrix operations that process exactly the tokens each expert received — whatever the count — without padding or dropping, removing the artifact entirely through better kernels.

Exercise 9Pen & Paper

Describe MoE's systems costs (memory and communication); why can all-to-all eat the savings?

Solution

Memory: although only k experts run per token, ALL experts must be STORED (the router might send any token to any of them), so the memory footprint is the TOTAL parameter count, not the active. Communication: when experts are spread across devices (expert parallelism), a token must be SENT to wherever its expert lives, processed, and the result sent back — an all-to-all exchange every MoE layer. This all-to-all communication can be so expensive (especially on slow interconnects) that it offsets the compute savings, which is why MoE shines most on large, well-connected clusters.

Exercise 10Pen & Paper

Compare MoE and dense; two scenarios where MoE wins, two where dense wins.

Solution

MoE wins: (1) large-scale pretraining/serving of a general model where compute-per-token is the binding constraint — it buys capacity cheaply; (2) when fast interconnect and ample memory are available to absorb its systems costs. Dense wins: (1) heavy fine-tuning/customization — MoE is notoriously harder to fine-tune and prone to overfitting; (2) small-scale or limited/single-device hardware where MoE's memory and communication overheads dominate and its routing instabilities aren't worth it. The choice hinges on which resource (compute vs memory/engineering) is scarce for you.

Exercise 11Code

Implement a top-k MoE layer; verify only k experts run and outputs are correctly weighted.

Solution

Compute router softmax, select the top-k experts per token, run only those, and combine with renormalized weights (Exercise 4). Verifying that exactly k experts execute per token and that the weighted combination matches a manual computation confirms the sparse routing and gating are correct.

Exercise 12Code

Count parameters: MoE layer vs matched-active dense FFN; report total/active and compute ratio.

Solution

Building an MoE layer and a dense FFN of the same ACTIVE size and reporting total vs active parameters (Exercise 5) shows the MoE has E/k times more total FFN parameters at the same per-token compute — quantifying the capacity-for-memory trade that defines MoE.

Exercise 13Code

Visualize routing: plot tokens-per-expert for a batch; identify imbalance.

Solution

Feeding a batch and histogramming how many tokens each expert receives reveals whether routing is balanced or skewed. An untrained or unbalanced router shows a lopsided distribution — the visual signature of (incipient) expert collapse (Exercise 6), motivating the load-balancing loss.

Exercise 14Code Lab

Reproduce expert collapse: train a small MoE without load balancing; plot expert usage over training.

Solution

Training without the auxiliary loss shows the tokens-per-expert distribution collapsing over time onto a few experts (Exercise 6) — a direct reproduction of the rich-get-richer dynamic, with most experts going dark.

Exercise 15Code

Implement the auxiliary load-balancing loss; show it restores balance.

Solution

Adding the Σ f_i·P_i penalty (Exercise 7) to the training from Exercise 14 keeps the tokens-per-expert distribution roughly uniform throughout training — demonstrating that the load-balancing loss prevents collapse and restores full capacity utilization.

Exercise 16Code

Implement expert capacity and token dropping; vary the capacity factor; measure drop rate and loss effect.

Solution

Imposing a fixed per-expert capacity and dropping overflow (Exercise 8), then sweeping the capacity factor, shows low factors causing high drop rates (and worse loss) while higher factors reduce drops at the cost of wasted padding — quantifying the capacity/efficiency trade-off.

Exercise 17Code

Implement shared experts; compare quality and balance to vanilla top-k.

Solution

Designating one always-on shared expert (alongside top-k routed ones) lets it absorb common patterns, so the routed experts specialize on the rest. Comparing to vanilla top-k typically shows improved quality and more balanced routing — the DeepSeek-MoE refinement in miniature.

Exercise 18Code

Implement expert-choice routing; show perfect load balance by construction; discuss the trade-off.

Solution

In expert-choice routing each expert picks its top tokens (up to capacity) rather than tokens picking experts, so every expert gets exactly its capacity — perfect balance by construction, sidestepping collapse. The trade-off: some tokens may be chosen by many experts and others by none, so token coverage is uneven — a different problem traded for the balance guarantee.

Exercise 19Code Lab

Simulate expert parallelism; measure all-to-all communication cost vs number of devices.

Solution

Placing experts on different simulated devices and routing tokens to them shows the all-to-all communication volume growing as experts spread across more devices (Exercise 9) — demonstrating why the interconnect, not just compute, governs whether MoE's savings are realized.

Exercise 20Code (Challenge)

Build a small MoE Transformer (routing, load-balance loss, z-loss, capacity); compare to dense at matched active and matched total; remove balancing to show collapse; ablate E and k.

Solution

A full small MoE shows it matching a dense model of its ACTIVE size on compute while approaching a dense model of its TOTAL size on quality — the core MoE value. Removing the load-balancing loss reproduces collapse (quality drops, experts go unused), and ablating E and k maps the capacity/compute/quality surface: more experts add capacity (and systems cost), higher k adds compute. The integrated demonstration of every concept in the chapter.

←

ReturnAppendix Index

ReviewBack to Chapter 32

→