Mixture of Experts
Detailed solutions for the exercises in Chapter 32. Try solving them yourself before checking the answers.
Solution
In a dense model every parameter processes every token, so more parameters means proportionally more compute. MoE routes each token to only k of E experts, so total CAPACITY grows with E (all experts exist) while per-token COMPUTE grows only with k (only k run). Hospital analogy: a dense model is one overworked generalist who must see every patient; an MoE model is a hospital of specialists where a receptionist (router) sends each patient to just the relevant few. The hospital's total knowledge is huge, but each visit consults only a couple of doctors — capacity scales, per-token cost doesn't.
Solution
Total parameters = the full model (all experts) = capacity; active parameters = those used per token = compute cost. Mixtral '8x7B' is not 8×7B=56B because attention layers and embeddings are SHARED across experts — only the FFNs are replicated into 8 experts. So total ≈ 47B (shared parts + 8 expert FFNs) and active ≈ 13B (shared parts + the 2 expert FFNs that run per token under top-2 routing). Mixtral thus runs at ~13B cost with ~47B capacity.
Solution
A sparse MoE layer replaces the single FEED-FORWARD network of a Transformer block with (a) several independent FFNs ('experts') and (b) a small router/gate that picks which experts each token uses. Attention, residuals, layer norm, embeddings, and the training objective all stay the same — MoE is a drop-in replacement for the FFN. Only the chosen k experts run per token, which is what makes it 'sparse'.
Solution
For each token: (1) the router (a small linear layer) produces a score for every expert; (2) a softmax turns scores into weights; (3) the top-k experts by weight are selected; (4) only those k experts process the token; (5) their outputs are combined as a weighted sum using the router's (renormalized) weights. So the router both CHOOSES the experts and WEIGHTS their contributions; the rest of the experts are skipped entirely.
Solution
Let S = shared parameters (attention, embeddings, norms) and f = parameters per expert FFN. Then:
Capacity (total) grows linearly with the number of experts E, while compute (active) grows only with k. The ratio of total to active FFN parameters is E/k — for Mixtral, 8/2 = 4× more FFN capacity than compute. The shared S is paid in both.
Solution
Expert collapse: the router learns to send most tokens to a few experts, leaving the rest under-trained and unused. The loop is self-reinforcing — early random favoritism gives some experts more tokens, so they train faster and improve, so the router favors them more, so they get even more tokens, while neglected experts stay bad and ignored. A dense model has no router and no separate experts — every parameter is always used — so there is nothing to collapse onto; collapse is unique to the routed, discrete-choice structure of MoE.
Solution
The auxiliary loss penalizes uneven expert usage — typically proportional to Σ_i f_i·P_i (fraction of tokens routed to expert i times its mean router probability) — and is minimized when load is spread EVENLY across experts. Added to the main loss, it pushes the router to use all experts, breaking the rich-get-richer cycle. The coefficient is delicate: too small risks collapse (wasted capacity); too large forces balance so hard that the router can't specialize meaningfully (tokens spread evenly but not sensibly), hurting quality. It must balance utilization against useful specialization.
Solution
For hardware efficiency, each expert is given a fixed CAPACITY (max tokens per batch), set by a capacity factor, so tensor shapes are regular and predictable. When more tokens route to an expert than its capacity, the overflow is DROPPED (skips that expert, often passing through via the residual). Fixed capacities exist because GPUs are fastest with fixed shapes. Dropless MoE (e.g. MegaBlocks) uses block-sparse matrix operations that process exactly the tokens each expert received — whatever the count — without padding or dropping, removing the artifact entirely through better kernels.
Solution
Memory: although only k experts run per token, ALL experts must be STORED (the router might send any token to any of them), so the memory footprint is the TOTAL parameter count, not the active. Communication: when experts are spread across devices (expert parallelism), a token must be SENT to wherever its expert lives, processed, and the result sent back — an all-to-all exchange every MoE layer. This all-to-all communication can be so expensive (especially on slow interconnects) that it offsets the compute savings, which is why MoE shines most on large, well-connected clusters.
Solution
MoE wins: (1) large-scale pretraining/serving of a general model where compute-per-token is the binding constraint — it buys capacity cheaply; (2) when fast interconnect and ample memory are available to absorb its systems costs. Dense wins: (1) heavy fine-tuning/customization — MoE is notoriously harder to fine-tune and prone to overfitting; (2) small-scale or limited/single-device hardware where MoE's memory and communication overheads dominate and its routing instabilities aren't worth it. The choice hinges on which resource (compute vs memory/engineering) is scarce for you.
Solution
Compute router softmax, select the top-k experts per token, run only those, and combine with renormalized weights (Exercise 4). Verifying that exactly k experts execute per token and that the weighted combination matches a manual computation confirms the sparse routing and gating are correct.
Solution
Building an MoE layer and a dense FFN of the same ACTIVE size and reporting total vs active parameters (Exercise 5) shows the MoE has E/k times more total FFN parameters at the same per-token compute — quantifying the capacity-for-memory trade that defines MoE.
Solution
Feeding a batch and histogramming how many tokens each expert receives reveals whether routing is balanced or skewed. An untrained or unbalanced router shows a lopsided distribution — the visual signature of (incipient) expert collapse (Exercise 6), motivating the load-balancing loss.
Solution
Training without the auxiliary loss shows the tokens-per-expert distribution collapsing over time onto a few experts (Exercise 6) — a direct reproduction of the rich-get-richer dynamic, with most experts going dark.
Solution
Adding the Σ f_i·P_i penalty (Exercise 7) to the training from Exercise 14 keeps the tokens-per-expert distribution roughly uniform throughout training — demonstrating that the load-balancing loss prevents collapse and restores full capacity utilization.
Solution
Imposing a fixed per-expert capacity and dropping overflow (Exercise 8), then sweeping the capacity factor, shows low factors causing high drop rates (and worse loss) while higher factors reduce drops at the cost of wasted padding — quantifying the capacity/efficiency trade-off.
Solution
Designating one always-on shared expert (alongside top-k routed ones) lets it absorb common patterns, so the routed experts specialize on the rest. Comparing to vanilla top-k typically shows improved quality and more balanced routing — the DeepSeek-MoE refinement in miniature.
Solution
In expert-choice routing each expert picks its top tokens (up to capacity) rather than tokens picking experts, so every expert gets exactly its capacity — perfect balance by construction, sidestepping collapse. The trade-off: some tokens may be chosen by many experts and others by none, so token coverage is uneven — a different problem traded for the balance guarantee.
Solution
Placing experts on different simulated devices and routing tokens to them shows the all-to-all communication volume growing as experts spread across more devices (Exercise 9) — demonstrating why the interconnect, not just compute, governs whether MoE's savings are realized.
Solution
A full small MoE shows it matching a dense model of its ACTIVE size on compute while approaching a dense model of its TOTAL size on quality — the core MoE value. Removing the load-balancing loss reproduces collapse (quality drops, experts go unused), and ablating E and k maps the capacity/compute/quality surface: more experts add capacity (and systems cost), higher k adds compute. The integrated demonstration of every concept in the chapter.