Solutions Appendix
Chapter 16

Scaling Laws

18 Solutions

Detailed solutions for the exercises in Chapter 16. Try solving them yourself before checking the answers.

Exercise 1Pen & Paper
Derive the 6ND rule; account for 2 FLOPs/param forward and 4 backward.

Solution

Each parameter participates in a multiply-accumulate (2 FLOPs: one multiply, one add) per token in the forward pass. The backward pass costs about twice the forward (computing gradients w.r.t. both inputs and weights), ≈ 4 FLOPs/param/token. Summing gives ~6 FLOPs per parameter per token, so total training compute ≈ 6·N·D for N parameters and D tokens — the workhorse estimate for training cost.

Exercise 2Pen & Paper
13B params, 2T tokens: training FLOPs? A100-days at 156 TFLOP/s effective?

Solution

FLOPs = 6·N·D = 6·(13×10⁹)·(2×10¹²) = 1.56×10²³. At 156 TFLOP/s = 1.56×10¹⁴ FLOP/s effective: time = 1.56×10²³ / 1.56×10¹⁴ = 10⁹ seconds. Dividing by 86,400 s/day ≈ 11,600 A100-days — roughly 32 A100-years of compute, illustrating why frontier pretraining needs thousands of GPUs in parallel.

Exercise 3Derive
From L=E+A/N^α+B/D^β and C=6ND, derive that optimal N and D scale as √C.

Solution

Minimize L subject to D = C/(6N). Substituting and differentiating w.r.t. N, the two reducible terms A/N^α and B/(C/6N)^β are balanced at the optimum. With Chinchilla's near-equal exponents (α≈β), the stationary condition gives N ∝ C^{β/(α+β)} and D ∝ C^{α/(α+β)}; for α≈β both exponents are ≈½, so N ∝ √C and D ∝ √C. Compute-optimal scaling grows parameters and data together, each as the square root of compute.

Exercise 4Pen & Paper
Show the compute-optimal D/N is roughly constant; why the ~20:1 rule?

Solution

Since both N ∝ √C and D ∝ √C (Exercise 3), their ratio D/N is independent of the compute budget C — a constant set by the loss-curve coefficients A, B and exponents. Empirically Chinchilla found that constant to be about 20 tokens per parameter, giving the ~20:1 rule: whatever your budget, train on roughly 20× as many tokens as you have parameters.

Exercise 5Pen & Paper
GPT-3 (175B params, 300B tokens): tokens/param ratio? Over- or under-trained?

Solution

Ratio = 300×10⁹ / 175×10⁹ ≈ 1.7 tokens per parameter — far below the Chinchilla-optimal ~20:1. GPT-3 was therefore significantly UNDER-trained (over-parameterized for its data) by roughly a factor of ~12 in tokens; a compute-matched Chinchilla-style model would be smaller and trained on far more tokens, achieving lower loss for the same compute. This insight reshaped how models were sized after 2022.

Exercise 6Pen & Paper
Why did Kaplan and Chinchilla reach different allocation conclusions?

Solution

Kaplan et al. (2020) concluded you should mostly grow the model (data secondary); Chinchilla (2022) found model and data should grow equally. The discrepancy came from methodology: Kaplan used a fixed/sub-optimal learning-rate schedule and didn't properly decay it for each run length, biasing the apparent returns toward size; Chinchilla varied both N and D with properly tuned schedules across many runs, revealing the balanced √C scaling. A subtle experimental-design difference flipped the conclusion.

Exercise 7Pen & Paper
Why do LLaMA models train far past Chinchilla-optimal? Frame via lifetime compute.

Solution

Chinchilla optimizes TRAINING compute only. But a deployed model also incurs INFERENCE compute over its lifetime, and a smaller model is cheaper to serve forever. Training a smaller model on extra tokens (past 20:1) costs more upfront but yields a model that is cheaper at every inference call — so when expected inference volume is large, total lifetime compute is minimized by a smaller, 'over-trained' model. LLaMA deliberately makes this trade for deployability.

Exercise 8Pen & Paper
L(C)=1.7+(1e15/C)^0.05 at C=1e21,1e23,1e25; compute to halve loss-above-irreducible.

Solution

At C=10²¹: (10⁻⁶)^{0.05}=10^{−0.3}≈0.501 → L≈2.20. At 10²³: 10^{−0.4}≈0.398 → L≈2.10. At 10²⁵: 10^{−0.5}≈0.316 → L≈2.02. To halve the reducible part from 0.501 to 0.25 requires (10¹⁵/C)^{0.05}=0.25 → 0.05(15−log₁₀C)=log₁₀ 0.25 → log₁₀C≈27.0, i.e. C≈1×10²⁷ — about a MILLION-fold more compute. The tiny 0.05 exponent means halving the residual loss costs astronomically more compute — the brutal economics of scaling.

Exercise 9Pen & Paper
Explain the emergent-abilities debate; how a harsh metric fakes a sudden jump.

Solution

The debate: do abilities truly appear discontinuously at scale, or does the metric create the illusion? Example: a multi-step task scored by EXACT MATCH (all steps correct) can look flat then jump, even if per-step accuracy improves smoothly — because exact-match = (per-step accuracy)^{steps}, which stays near 0 until per-step accuracy is high, then rises sharply. Under a smooth metric (e.g. per-step accuracy or log-likelihood) the same model improves gradually. Much 'emergence' is partly a metric artifact.

Exercise 10Pen & Paper
Describe the data wall; why does Chinchilla make it pressing; three responses.

Solution

The data wall is the finite supply of high-quality human text. Chinchilla-optimal scaling demands ~20 tokens/param, so as models grow, the required data outpaces what exists — making the ceiling pressing. Three responses: (1) synthetic/model-generated data (risky — can amplify errors); (2) higher sample efficiency / better algorithms (more capability per token); (3) new data modalities and sources (multimodal, interaction, code, proprietary corpora). The wall is shifting the field from brute-scale toward efficiency.

Exercise 11Code
Implement the 6ND calculator; reproduce GPT-3, Chinchilla, LLaMA-2 training compute.

Solution

A function returning 6·N·D, fed each model's public N and D, reproduces the widely-cited training-FLOP figures (GPT-3 ≈3.1×10²³, Chinchilla ≈5.8×10²³, etc.), validating the estimate and letting you sanity-check any model's compute from two numbers.

Exercise 12Code
Fit L(C)=E+(Cc/C)^alpha to synthetic (compute, loss) points; plot log-log; report exponent.

Solution

Fitting the power law (e.g. via least squares on the reducible part) recovers the irreducible loss E and the exponent alpha; on log-log axes the reducible loss is a straight line with slope −alpha. The fitted exponent matches the data-generating value, demonstrating how scaling laws are estimated from a ladder of runs.

Exercise 13Code Lab
Compute-optimal allocation: given a budget, solve for N,D minimizing Chinchilla loss; verify ~20:1.

Solution

Numerically minimizing L = E + A/N^α + B/D^β subject to 6ND = C across budgets yields N,D whose ratio D/N stays near 20 regardless of C — confirming Exercises 3–4. The solver is the practical tool for sizing a model to a compute budget.

Exercise 14Code
Reproduce the Chinchilla table for budgets 1e19–1e25; plot N,D vs compute (log-log).

Solution

Computing optimal N, D, and predicted loss across the budget range and plotting on log-log axes gives two parallel lines of slope ≈½ (both ∝ √C), with the loss steadily decreasing — reproducing the Chinchilla allocation table and visualizing balanced scaling.

Exercise 15Code
Build the planning tool: for $10k–$10M, report optimal model size, data, predicted perplexity.

Solution

Converting dollar budgets to FLOPs (via $/GPU-hour and effective throughput) and feeding them through the compute-optimal solver yields, for each budget, a recommended parameter count, token count, and predicted perplexity — a concrete bridge from money to a training plan.

Exercise 16Code Lab
Train a mini scaling ladder (4–5 sizes); fit a scaling law; extrapolate and compare to the next size.

Solution

Training several small models, fitting L(N) at fixed data (or L(C)), and extrapolating predicts the loss of the next-larger model; training that model and comparing shows the prediction is usually close — a hands-on demonstration that scaling laws genuinely forecast performance, the basis for planning large runs.

Exercise 17Code
Demonstrate emergence-as-artifact: smooth per-token accuracy vs discontinuous exact-match.

Solution

Simulating a model whose per-token accuracy rises smoothly, then plotting both per-token accuracy and exact-match (= per-token^{steps}) shows the exact-match curve staying flat then jumping sharply — reproducing the illusion of Exercise 9 and showing emergence can be a metric artifact.

Exercise 18Code (Challenge)
Training-optimal vs inference-aware sizing: minimize total lifetime compute; show optimum shrinks with inference volume.

Solution

Adding inference cost (≈ 2·N per generated token × expected tokens served) to training cost (6ND) and minimizing the TOTAL over N shows the optimal model is smaller than Chinchilla-optimal, and shrinks further as projected inference volume grows — quantifying why heavily-served models (Exercise 7) are deliberately over-trained and undersized relative to training-only optimality.