Scaling Laws
Detailed solutions for the exercises in Chapter 16. Try solving them yourself before checking the answers.
Solution
Each parameter participates in a multiply-accumulate (2 FLOPs: one multiply, one add) per token in the forward pass. The backward pass costs about twice the forward (computing gradients w.r.t. both inputs and weights), ≈ 4 FLOPs/param/token. Summing gives ~6 FLOPs per parameter per token, so total training compute ≈ 6·N·D for N parameters and D tokens — the workhorse estimate for training cost.
Solution
FLOPs = 6·N·D = 6·(13×10⁹)·(2×10¹²) = 1.56×10²³. At 156 TFLOP/s = 1.56×10¹⁴ FLOP/s effective: time = 1.56×10²³ / 1.56×10¹⁴ = 10⁹ seconds. Dividing by 86,400 s/day ≈ 11,600 A100-days — roughly 32 A100-years of compute, illustrating why frontier pretraining needs thousands of GPUs in parallel.
Solution
Minimize L subject to D = C/(6N). Substituting and differentiating w.r.t. N, the two reducible terms A/N^α and B/(C/6N)^β are balanced at the optimum. With Chinchilla's near-equal exponents (α≈β), the stationary condition gives N ∝ C^{β/(α+β)} and D ∝ C^{α/(α+β)}; for α≈β both exponents are ≈½, so N ∝ √C and D ∝ √C. Compute-optimal scaling grows parameters and data together, each as the square root of compute.
Solution
Since both N ∝ √C and D ∝ √C (Exercise 3), their ratio D/N is independent of the compute budget C — a constant set by the loss-curve coefficients A, B and exponents. Empirically Chinchilla found that constant to be about 20 tokens per parameter, giving the ~20:1 rule: whatever your budget, train on roughly 20× as many tokens as you have parameters.
Solution
Ratio = 300×10⁹ / 175×10⁹ ≈ 1.7 tokens per parameter — far below the Chinchilla-optimal ~20:1. GPT-3 was therefore significantly UNDER-trained (over-parameterized for its data) by roughly a factor of ~12 in tokens; a compute-matched Chinchilla-style model would be smaller and trained on far more tokens, achieving lower loss for the same compute. This insight reshaped how models were sized after 2022.
Solution
Kaplan et al. (2020) concluded you should mostly grow the model (data secondary); Chinchilla (2022) found model and data should grow equally. The discrepancy came from methodology: Kaplan used a fixed/sub-optimal learning-rate schedule and didn't properly decay it for each run length, biasing the apparent returns toward size; Chinchilla varied both N and D with properly tuned schedules across many runs, revealing the balanced √C scaling. A subtle experimental-design difference flipped the conclusion.
Solution
Chinchilla optimizes TRAINING compute only. But a deployed model also incurs INFERENCE compute over its lifetime, and a smaller model is cheaper to serve forever. Training a smaller model on extra tokens (past 20:1) costs more upfront but yields a model that is cheaper at every inference call — so when expected inference volume is large, total lifetime compute is minimized by a smaller, 'over-trained' model. LLaMA deliberately makes this trade for deployability.
Solution
At C=10²¹: (10⁻⁶)^{0.05}=10^{−0.3}≈0.501 → L≈2.20. At 10²³: 10^{−0.4}≈0.398 → L≈2.10. At 10²⁵: 10^{−0.5}≈0.316 → L≈2.02. To halve the reducible part from 0.501 to 0.25 requires (10¹⁵/C)^{0.05}=0.25 → 0.05(15−log₁₀C)=log₁₀ 0.25 → log₁₀C≈27.0, i.e. C≈1×10²⁷ — about a MILLION-fold more compute. The tiny 0.05 exponent means halving the residual loss costs astronomically more compute — the brutal economics of scaling.
Solution
The debate: do abilities truly appear discontinuously at scale, or does the metric create the illusion? Example: a multi-step task scored by EXACT MATCH (all steps correct) can look flat then jump, even if per-step accuracy improves smoothly — because exact-match = (per-step accuracy)^{steps}, which stays near 0 until per-step accuracy is high, then rises sharply. Under a smooth metric (e.g. per-step accuracy or log-likelihood) the same model improves gradually. Much 'emergence' is partly a metric artifact.
Solution
The data wall is the finite supply of high-quality human text. Chinchilla-optimal scaling demands ~20 tokens/param, so as models grow, the required data outpaces what exists — making the ceiling pressing. Three responses: (1) synthetic/model-generated data (risky — can amplify errors); (2) higher sample efficiency / better algorithms (more capability per token); (3) new data modalities and sources (multimodal, interaction, code, proprietary corpora). The wall is shifting the field from brute-scale toward efficiency.
Solution
A function returning 6·N·D, fed each model's public N and D, reproduces the widely-cited training-FLOP figures (GPT-3 ≈3.1×10²³, Chinchilla ≈5.8×10²³, etc.), validating the estimate and letting you sanity-check any model's compute from two numbers.
Solution
Fitting the power law (e.g. via least squares on the reducible part) recovers the irreducible loss E and the exponent alpha; on log-log axes the reducible loss is a straight line with slope −alpha. The fitted exponent matches the data-generating value, demonstrating how scaling laws are estimated from a ladder of runs.
Solution
Numerically minimizing L = E + A/N^α + B/D^β subject to 6ND = C across budgets yields N,D whose ratio D/N stays near 20 regardless of C — confirming Exercises 3–4. The solver is the practical tool for sizing a model to a compute budget.
Solution
Computing optimal N, D, and predicted loss across the budget range and plotting on log-log axes gives two parallel lines of slope ≈½ (both ∝ √C), with the loss steadily decreasing — reproducing the Chinchilla allocation table and visualizing balanced scaling.
Solution
Converting dollar budgets to FLOPs (via $/GPU-hour and effective throughput) and feeding them through the compute-optimal solver yields, for each budget, a recommended parameter count, token count, and predicted perplexity — a concrete bridge from money to a training plan.
Solution
Training several small models, fitting L(N) at fixed data (or L(C)), and extrapolating predicts the loss of the next-larger model; training that model and comparing shows the prediction is usually close — a hands-on demonstration that scaling laws genuinely forecast performance, the basis for planning large runs.
Solution
Simulating a model whose per-token accuracy rises smoothly, then plotting both per-token accuracy and exact-match (= per-token^{steps}) shows the exact-match curve staying flat then jumping sharply — reproducing the illusion of Exercise 9 and showing emergence can be a metric artifact.
Solution
Adding inference cost (≈ 2·N per generated token × expected tokens served) to training cost (6ND) and minimizing the TOTAL over N shows the optimal model is smaller than Chinchilla-optimal, and shrinks further as projected inference volume grows — quantifying why heavily-served models (Exercise 7) are deliberately over-trained and undersized relative to training-only optimality.