Distributed Training
Detailed solutions for the exercises in Chapter 18. Try solving them yourself before checking the answers.
Solution
Per parameter: bf16 weight (2 bytes) + bf16 gradient (2) + fp32 master weight copy (4) + fp32 Adam first moment m (4) + fp32 Adam second moment v (4) = 16 bytes. So model + optimizer state needs ≈16 bytes per parameter — the rule of thumb for sizing training memory (a 1B-param model needs ~16 GB just for these, before activations).
Solution
Total state = 16 bytes × 13×10⁹ ≈ 208 GB. DDP replicates all of it on every GPU → 208 GB each (infeasible on one device). ZeRO-1 shards the 12 bytes of optimizer state across 32: ≈ 13e9×(4 + 12/32) ≈ 57 GB/GPU. ZeRO-2 also shards gradients: ≈ 13e9×(2 + 14/32) ≈ 32 GB/GPU. ZeRO-3 shards parameters too: ≈ 208/32 ≈ 6.5 GB/GPU. Sharding progressively trades communication for a near-linear memory reduction in P.
Solution
All-reduce (every GPU ends with the sum) decomposes into reduce-scatter (each GPU ends owning the reduced value of one shard) followed by all-gather (every GPU collects all shards). The ring algorithm passes shards around a ring in P−1 steps each way; each GPU sends/receives ~2×(data size)×(P−1)/P ≈ 2×data total, independent of P. So communication VOLUME per GPU is constant regardless of how many GPUs participate — the property that makes ring all-reduce scale.
Solution
DDP replicates the full model + optimizer state on every GPU, so 64 GPUs store 64 identical copies — for a 7B model that is 64×≈12 GB ≈ a redundancy of 63 wasted copies. ZeRO eliminates this by SHARDING parameters, gradients, and optimizer state across the 64 GPUs (each holds 1/64), gathering them on demand — trading some extra communication for an ~64× reduction in redundant memory.
Solution
With P stages and m micro-batches, the pipeline takes m+P−1 stage-steps to drain (P−1 steps to fill, m to process, overlapping). Each stage is idle for P−1 of those steps, so the idle fraction is (P−1)/(m+P−1). For P=8,m=8: 7/15 ≈ 47% wasted. For P=8,m=64: 7/71 ≈ 10%. More micro-batches amortize the fill/drain bubble — the reason pipeline parallelism uses many micro-batches.
Solution
The FFN is up-projection then down-projection. Splitting the up-projection COLUMN-wise lets each GPU compute its slice of the (larger) hidden activation independently with no communication. Feeding those slices into a ROW-parallel down-projection means each GPU produces a partial sum of the output, and a single all-reduce combines them. This column-then-row layout requires exactly ONE all-reduce per FFN (and one per attention block), minimizing communication in tensor parallelism.
Solution
Tensor parallelism communicates (all-reduce) on EVERY layer, in the critical path of each forward/backward — it demands the very high bandwidth and low latency of intra-node interconnect (NVLink). Pipeline parallelism only passes activations BETWEEN stage boundaries (a few times per step), tolerating the lower bandwidth/higher latency of inter-node links (InfiniBand/Ethernet). So TP stays within a node; PP spans nodes — matching each scheme's communication frequency to the available interconnect.
Solution
A common choice: TP=8 (within each node, exploiting NVLink for the per-layer all-reduces), PP=8 (across 8 nodes, passing activations over InfiniBand at stage boundaries), and DP=8 (8 such TP×PP groups, data-parallel with sharded optimizer state). 8×8×8 = 512. This keeps the bandwidth-hungry TP intra-node, spreads the model across nodes via PP, and uses DP for throughput — the canonical 3D-parallel layout.
Solution
Model FLOPs Utilization = (useful FLOPs achieved)/(hardware peak FLOPs), where useful = 6ND. Reasons for <100%: (1) memory-bandwidth-bound ops (LayerNorm, softmax, attention) that don't saturate the matmul units; (2) communication overhead (all-reduce, pipeline bubbles) stalling compute; (3) kernel launch / Python / non-overlapped overheads; (4) suboptimal kernels, padding, and load imbalance. Real large runs typically land at 30–50% MFU.
Solution
ZeRO-3 shards the PARAMETERS themselves, so before each layer's forward (and backward) the full parameters of that layer must be gathered (all-gather) from all GPUs and then discarded — communication ZeRO-1 (which keeps full params resident) avoids. The extra traffic is roughly an all-gather of the parameters each step (and again in backward), adding communication on the order of the model size per step — the price of ZeRO-3's larger memory savings.
Solution
After the backward pass, all-reduce each gradient across processes and divide by world size to get the averaged gradient, then step. With the same data and seed, the multi-process run matches single-GPU training to numerical precision — demonstrating that data parallelism is just gradient averaging.
Solution
Broadcast = root sends to all; reduce = all send to root which sums; all-reduce = reduce then broadcast (or reduce-scatter + all-gather); all-gather = each rank shares its shard with all; reduce-scatter = sum then each keeps one shard. Building these from point-to-point primitives and verifying the results clarifies exactly what each collective computes (the basis of Exercise 3).
Solution
The ring performs P−1 reduce-scatter steps then P−1 all-gather steps, each passing 1/P of the data. Verifying the final result equals the true sum, and measuring that each GPU sends ~2×(data size), confirms both correctness and the GPU-count-independent volume of Exercise 3.
Solution
FSDP (PyTorch's ZeRO-3) shards parameters across GPUs and gathers per-block on demand. Compared to DDP on the same model, peak memory drops substantially (toward total/P), at the cost of extra all-gather communication — the practical realization of Exercises 2 and 10.
Solution
The column-parallel up-projection splits output features across GPUs; the row-parallel down-projection splits input features and all-reduces the partial sums (Exercise 6). Verifying the 2-GPU tensor-parallel FFN produces the same output as a single-GPU FFN confirms the layout is numerically exact, not an approximation.
Solution
Simulating the fill/steady/drain phases and counting idle stage-steps reproduces the bubble fraction (P−1)/(m+P−1). Plotting it against m shows the bubble shrinking as micro-batches increase, matching Exercise 5's formula.
Solution
The one-forward-one-backward schedule interleaves micro-batch forwards and backwards to limit in-flight activations. Verifying that the resulting gradients (and updated weights) match non-pipelined training confirms pipelining is an execution-order optimization, not a change to the math.
Solution
Overlapping gradient all-reduce with the ongoing backward pass (bucketed, as DDP does by default) hides much of the communication behind computation, giving a measurable wall-clock speedup versus waiting for the full backward to finish before reducing — demonstrating why overlap is essential for data-parallel scaling.
Solution
Each rank writes only its parameter/optimizer shard (in parallel, asynchronously), and a loader reassembles them. Verifying that the reconstructed model resumes training identically confirms sharded checkpointing correctly captures distributed state without gathering everything to one rank — essential at scale.
Solution
Combining tensor parallelism (within a pair) and data parallelism (across pairs) over 4 processes, and verifying numerical equivalence to single-GPU training, demonstrates composed parallelism. Reporting memory and throughput per configuration shows TP reduces per-GPU memory while DP increases throughput — the building blocks of real large-scale training.