Solutions Appendix

Chapter 18

Distributed Training

20 Solutions

Detailed solutions for the exercises in Chapter 18. Try solving them yourself before checking the answers.

Exercise 1Pen & Paper

Derive the 16-bytes-per-parameter rule for mixed-precision AdamW.

Solution

Per parameter: bf16 weight (2 bytes) + bf16 gradient (2) + fp32 master weight copy (4) + fp32 Adam first moment m (4) + fp32 Adam second moment v (4) = 16 bytes. So model + optimizer state needs ≈16 bytes per parameter — the rule of thumb for sizing training memory (a 1B-param model needs ~16 GB just for these, before activations).

Exercise 2Pen & Paper

13B model: params+optimizer memory under DDP, ZeRO-1/2/3 with P=32.

Solution

Total state = 16 bytes × 13×10⁹ ≈ 208 GB. DDP replicates all of it on every GPU → 208 GB each (infeasible on one device). ZeRO-1 shards the 12 bytes of optimizer state across 32: ≈ 13e9×(4 + 12/32) ≈ 57 GB/GPU. ZeRO-2 also shards gradients: ≈ 13e9×(2 + 14/32) ≈ 32 GB/GPU. ZeRO-3 shards parameters too: ≈ 208/32 ≈ 6.5 GB/GPU. Sharding progressively trades communication for a near-linear memory reduction in P.

Exercise 3Pen & Paper

Why does all-reduce = reduce-scatter + all-gather, and why is ring all-reduce cost GPU-count-independent?

Solution

All-reduce (every GPU ends with the sum) decomposes into reduce-scatter (each GPU ends owning the reduced value of one shard) followed by all-gather (every GPU collects all shards). The ring algorithm passes shards around a ring in P−1 steps each way; each GPU sends/receives ~2×(data size)×(P−1)/P ≈ 2×data total, independent of P. So communication VOLUME per GPU is constant regardless of how many GPUs participate — the property that makes ring all-reduce scale.

Exercise 4Pen & Paper

Why is plain data parallelism memory-wasteful? Quantify for a 7B model on 64 GPUs; how does ZeRO fix it?

Solution

DDP replicates the full model + optimizer state on every GPU, so 64 GPUs store 64 identical copies — for a 7B model that is 64×≈12 GB ≈ a redundancy of 63 wasted copies. ZeRO eliminates this by SHARDING parameters, gradients, and optimizer state across the 64 GPUs (each holds 1/64), gathering them on demand — trading some extra communication for an ~64× reduction in redundant memory.

Exercise 5Derive

Derive the pipeline bubble fraction (P−1)/(m+P−1); compute for P=8,m=8 and P=8,m=64.

Solution

With P stages and m micro-batches, the pipeline takes m+P−1 stage-steps to drain (P−1 steps to fill, m to process, overlapping). Each stage is idle for P−1 of those steps, so the idle fraction is (P−1)/(m+P−1). For P=8,m=8: 7/15 ≈ 47% wasted. For P=8,m=64: 7/71 ≈ 10%. More micro-batches amortize the fill/drain bubble — the reason pipeline parallelism uses many micro-batches.

Exercise 6Pen & Paper

Why is the Megatron FFN column-parallel then row-parallel? Show one all-reduce per FFN.

Solution

The FFN is up-projection then down-projection. Splitting the up-projection COLUMN-wise lets each GPU compute its slice of the (larger) hidden activation independently with no communication. Feeding those slices into a ROW-parallel down-projection means each GPU produces a partial sum of the output, and a single all-reduce combines them. This column-then-row layout requires exactly ONE all-reduce per FFN (and one per attention block), minimizing communication in tensor parallelism.

Exercise 7Pen & Paper

Why is tensor parallelism confined to one node while pipeline parallelism can span nodes?

Solution

Tensor parallelism communicates (all-reduce) on EVERY layer, in the critical path of each forward/backward — it demands the very high bandwidth and low latency of intra-node interconnect (NVLink). Pipeline parallelism only passes activations BETWEEN stage boundaries (a few times per step), tolerating the lower bandwidth/higher latency of inter-node links (InfiniBand/Ethernet). So TP stays within a node; PP spans nodes — matching each scheme's communication frequency to the available interconnect.

Exercise 8Pen & Paper

512 GPUs (64 nodes × 8): propose TP, PP, DP degrees for a 70B model and justify.

Solution

A common choice: TP=8 (within each node, exploiting NVLink for the per-layer all-reduces), PP=8 (across 8 nodes, passing activations over InfiniBand at stage boundaries), and DP=8 (8 such TP×PP groups, data-parallel with sharded optimizer state). 8×8×8 = 512. This keeps the bandwidth-hungry TP intra-node, spreads the model across nodes via PP, and uses DP for throughput — the canonical 3D-parallel layout.

Exercise 9Pen & Paper

Define MFU. Four reasons a real run hits only 40% despite the 6ND count.

Solution

Model FLOPs Utilization = (useful FLOPs achieved)/(hardware peak FLOPs), where useful = 6ND. Reasons for <100%: (1) memory-bandwidth-bound ops (LayerNorm, softmax, attention) that don't saturate the matmul units; (2) communication overhead (all-reduce, pipeline bubbles) stalling compute; (3) kernel launch / Python / non-overlapped overheads; (4) suboptimal kernels, padding, and load imbalance. Real large runs typically land at 30–50% MFU.

Exercise 10Pen & Paper

Why does ZeRO-3 add communication ZeRO-1 doesn't? Estimate the extra cost.

Solution

ZeRO-3 shards the PARAMETERS themselves, so before each layer's forward (and backward) the full parameters of that layer must be gathered (all-gather) from all GPUs and then discarded — communication ZeRO-1 (which keeps full params resident) avoids. The extra traffic is roughly an all-gather of the parameters each step (and again in backward), adding communication on the order of the model size per step — the price of ZeRO-3's larger memory savings.

Exercise 11Code

Data-parallel training via torch.distributed: manually all-reduce gradients; verify vs single-GPU.

Solution

After the backward pass, all-reduce each gradient across processes and divide by world size to get the averaged gradient, then step. With the same data and seed, the multi-process run matches single-GPU training to numerical precision — demonstrating that data parallelism is just gradient averaging.

Exercise 12Code

Implement the five collectives from send/recv for 4 simulated ranks.

Solution

Broadcast = root sends to all; reduce = all send to root which sums; all-reduce = reduce then broadcast (or reduce-scatter + all-gather); all-gather = each rank shares its shard with all; reduce-scatter = sum then each keeps one shard. Building these from point-to-point primitives and verifying the results clarifies exactly what each collective computes (the basis of Exercise 3).

Exercise 13Code

Implement ring all-reduce from scratch; verify the sum; confirm ~2× data communication volume.

Solution

The ring performs P−1 reduce-scatter steps then P−1 all-gather steps, each passing 1/P of the data. Verifying the final result equals the true sum, and measuring that each GPU sends ~2×(data size), confirms both correctness and the GPU-count-independent volume of Exercise 3.

Exercise 14Code Lab

Wrap a small Transformer in FSDP with block-level auto-wrap; compare peak memory to DDP.

Solution

FSDP (PyTorch's ZeRO-3) shards parameters across GPUs and gathers per-block on demand. Compared to DDP on the same model, peak memory drops substantially (toward total/P), at the cost of extra all-gather communication — the practical realization of Exercises 2 and 10.

Exercise 15Code

Implement column- and row-parallel linear layers; build a TP FFN across 2 GPUs; verify vs single-GPU.

Solution

The column-parallel up-projection splits output features across GPUs; the row-parallel down-projection splits input features and all-reduces the partial sums (Exercise 6). Verifying the 2-GPU tensor-parallel FFN produces the same output as a single-GPU FFN confirms the layout is numerically exact, not an approximation.

Exercise 16Code

Simulate the pipeline bubble: count idle vs busy stage-steps; plot bubble fraction vs m; confirm the formula.

Solution

Simulating the fill/steady/drain phases and counting idle stage-steps reproduces the bubble fraction (P−1)/(m+P−1). Plotting it against m shows the bubble shrinking as micro-batches increase, matching Exercise 5's formula.

Exercise 17Code Lab

Implement a 1F1B pipeline schedule for a 2-stage model; verify gradients match non-pipelined.

Solution

The one-forward-one-backward schedule interleaves micro-batch forwards and backwards to limit in-flight activations. Verifying that the resulting gradients (and updated weights) match non-pipelined training confirms pipelining is an execution-order optimization, not a change to the math.

Exercise 18Code

Measure communication/computation overlap in DDP: overlapped vs post-backward all-reduce.

Solution

Overlapping gradient all-reduce with the ongoing backward pass (bucketed, as DDP does by default) hides much of the communication behind computation, giving a measurable wall-clock speedup versus waiting for the full backward to finish before reducing — demonstrating why overlap is essential for data-parallel scaling.

Exercise 19Code

Implement sharded asynchronous checkpointing; verify reconstruction and resume.

Solution

Each rank writes only its parameter/optimizer shard (in parallel, asynchronously), and a loader reassembles them. Verifying that the reconstructed model resumes training identically confirms sharded checkpointing correctly captures distributed state without gathering everything to one rank — essential at scale.

Exercise 20Code (Challenge)

Build a mini 2D-parallel trainer (TP=2, DP=2); verify equivalence to single-GPU; report memory/throughput.

Solution

Combining tensor parallelism (within a pair) and data parallelism (across pairs) over 4 processes, and verifying numerical equivalence to single-GPU training, demonstrates composed parallelism. Reporting memory and throughput per configuration shows TP reduces per-GPU memory while DP increases throughput — the building blocks of real large-scale training.

←

ReturnAppendix Index

ReviewBack to Chapter 18

→