Scaling Laws
One of the most consequential discoveries in modern machine learning is also one of the simplest to state: as you scale up model size, data, and compute, the loss of a language model decreases in a smooth, predictable power law. This regularity is so reliable that you can forecast the performance of a model costing millions of dollars from a handful of small, cheap experiments. Scaling laws turned the question 'how big should the model be?' from guesswork into engineering.
What a Power Law Looks Like
A power law relates loss L to a resource X (parameters, data, or compute) as L ≈ (X₀/X)^α plus an irreducible floor. On a log-log plot, this is a straight line — and the empirical loss of language models falls almost exactly on such lines across many orders of magnitude. This is the central empirical fact of the chapter.
L(N) ≈ (N_c / N)^α_N # loss vs model size N (params)
L(D) ≈ (D_c / D)^α_D # loss vs dataset size D (tokens)
L(C) ≈ (C_c / C)^α_C # loss vs compute C (FLOPs)
# Each α ≈ 0.05-0.10: a small but relentless improvement with scaleBefore relating compute to loss, we need to measure compute. The fundamental unit is the floating-point operation (FLOP), and there is a beautifully simple rule for how many FLOPs it takes to train a Transformer: approximately 6 times the number of parameters, per token of training data.
C ≈ 6 · N · D
C = total training compute (FLOPs)
N = number of model parameters
D = number of training tokensWhere the 6 Comes From
The factor of 6 decomposes cleanly. Each parameter participates in roughly 2 FLOPs per token in the forward pass (one multiply, one add in the matrix multiplications). The backward pass costs about twice the forward pass — it computes gradients with respect to both inputs and weights — adding 4 FLOPs. Total: 2 (forward) + 4 (backward) = 6 FLOPs per parameter per token.
| Pass | FLOPs per param per token | Why |
|---|---|---|
| Forward | 2 | One multiply + one add per weight |
| Backward (input grad) | 2 | Gradient w.r.t. activations |
| Backward (weight grad) | 2 | Gradient w.r.t. weights |
| Total | 6 | 2 forward + 4 backward |
def training_flops(n_params, n_tokens):
"""Approximate total training FLOPs via the 6ND rule."""
return 6 * n_params * n_tokens
# GPT-3: 175B params, 300B tokens
C = training_flops(175e9, 300e9)
print(f"GPT-3: {C:.1e} FLOPs") # 3.1e23 FLOPs
# How long on 1024 A100s at 50% utilization?
# A100 peak: ~312 TFLOP/s (bf16); effective ~156 TFLOP/s at 50%
gpu_flops_per_sec = 156e12
n_gpus = 1024
seconds = C / (gpu_flops_per_sec * n_gpus)
print(f"Time: {seconds/86400:.1f} GPU-days of wall-clock")
# Time: ~22 days (matches GPT-3's reported training time)
# Cost estimate at ~$1/GPU-hour
cost = n_gpus * (seconds / 3600) * 1.0
print(f"Approx cost: ${cost/1e6:.1f}M") # ~$0.5M of computeKaplan et al. (2020) trained hundreds of models across orders of magnitude in size and data, and fit power laws to the results. Their findings established the quantitative foundation of large-model development and held up remarkably well — with one important correction we cover next.
The Key Findings
The Combined Law
Kaplan proposed a combined formula capturing how loss depends jointly on model size and data, with the loss limited by whichever is the binding constraint:
L(N, D) ≈ [ (N_c/N)^(α_N/α_D) + D_c/D ]^α_D
# Loss is high if EITHER N is small OR D is small.
# To improve, you must grow the binding constraint.Kaplan's compute-allocation conclusion: given a 10× increase in compute, spend most of it on a bigger model and relatively little on more data. This recommendation — 'make the model much bigger, train on modestly more data' — guided the GPT-3 generation. It was reasonable given the data available, but it turned out to be suboptimal, as Chinchilla revealed.
Hoffmann et al. (2022) revisited the compute-allocation question with more careful methodology and reached a landmark conclusion: at a fixed compute budget, model size N and dataset size D should scale roughly in equal proportion. The implication was that the flagship models of the era were drastically under-trained — too big for the amount of data they saw.
The Chinchilla Loss Law
L(N, D) = E + A/N^α + B/D^β
E ≈ 1.69 # irreducible loss (entropy of natural language)
A, α # model-size term (α ≈ 0.34)
B, β # data-size term (β ≈ 0.28)This additive form is more interpretable than Kaplan's. There is an irreducible loss E — the entropy of language itself, which no model can beat. Above that, two terms decrease with model size and data size respectively. Minimizing the loss for a fixed compute budget C = 6ND becomes a clean constrained optimization.
The Headline Result: ~20 Tokens per Parameter
Solving the optimization yields the compute-optimal recipe: for every parameter, train on roughly 20 tokens. A compute-optimal 10B-parameter model should see about 200B tokens; a 70B model about 1.4T tokens. This 20:1 token-to-parameter ratio is the single most cited number in modern LLM training.
| Model | Params | Tokens (actual) | Tokens/param |
|---|---|---|---|
| GPT-3 (2020) | 175B | 300B | ~1.7 (under-trained) |
| Gopher (2021) | 280B | 300B | ~1.1 (under-trained) |
| Chinchilla (2022) | 70B | 1.4T | 20 (compute-optimal) |
| LLaMA-1 (2023) | 65B | 1.4T | ~22 |
| LLaMA-2 (2023) | 70B | 2.0T | ~29 (past optimal) |
| LLaMA-3 (2024) | 70B | 15T | ~214 (far past optimal) |
The core practical use of scaling laws is allocation: given a fixed compute budget C, how do you split it between a bigger model (more N) and more training (more D)? The Chinchilla law turns this into a solvable optimization, and the answer is a specific N and D for every budget.
The Optimization
minimize L(N, D) = E + A/N^α + B/D^β
subject to C = 6 N D
Solution: N_opt ∝ C^a, D_opt ∝ C^b with a ≈ b ≈ 0.5
⇒ split each 10× of compute as ~3.2× bigger model, ~3.2× more dataThe key result: both the optimal model size and the optimal dataset size grow as roughly the square root of compute. Each time you get 10× more compute, you should make the model about 3.2× bigger AND train on about 3.2× more data — not pour it all into size as Kaplan suggested.
import numpy as np
from scipy.optimize import minimize_scalar
# Chinchilla parametric loss
E, A, B, alpha, beta = 1.69, 406.4, 410.7, 0.34, 0.28
def loss(N, D): return E + A/N**alpha + B/D**beta
def optimal_allocation(compute):
"""Given compute budget C, find N, D minimizing loss s.t. C=6ND."""
# Parametrize by N; D is then determined by the constraint
def loss_at(log_N):
N = 10**log_N
D = compute / (6 * N) # from C = 6ND
return loss(N, D)
res = minimize_scalar(loss_at, bounds=(7, 12), method='bounded')
N_opt = 10**res.x
D_opt = compute / (6 * N_opt)
return N_opt, D_opt
# Allocate three compute budgets
for C in [1e21, 1e23, 1e25]:
N, D = optimal_allocation(C)
print(f"C={C:.0e}: N={N/1e9:.1f}B params, D={D/1e9:.0f}B tokens, ratio={D/N:.0f}")
# C=1e+21: N=0.4B params, D=42B tokens, ratio=~20
# C=1e+23: N=5.6B params, D=320B tokens, ratio=~20
# C=1e+25: N=67B params, D=2.5T tokens, ratio=~20
# The ~20 tokens/param ratio holds across budgets, as Chinchilla predicts.The most powerful practical application of scaling laws is forecasting. Before committing to a huge, expensive training run, you train a series of small models, fit a scaling law to their losses, and extrapolate to predict the loss of the large model. This de-risks enormous investments and is now standard practice at every frontier lab.
The Forecasting Procedure
# 1. Train a ladder of small models at increasing scale
for scale in [tiny, small, medium, ...]:
train compute-optimally; record (compute, loss)
# 2. Fit a power law to the (compute, loss) points
L(C) ≈ E + (C_c / C)^α_C # least-squares fit on log-log
# 3. Extrapolate to the target large-scale compute
predicted_loss = L(C_target)
# 4. Decide: is the predicted gain worth the cost?import numpy as np
from scipy.optimize import curve_fit
# Suppose we trained 5 small models and measured their loss
compute = np.array([1e18, 1e19, 1e20, 1e21, 1e22])
loss = np.array([3.21, 2.88, 2.61, 2.39, 2.21])
# Fit L(C) = E + (Cc/C)^alpha
def scaling_law(C, E, Cc, alpha):
return E + (Cc / C)**alpha
params, _ = curve_fit(scaling_law, compute, loss,
p0=[1.7, 1e15, 0.05], maxfev=100000)
E, Cc, alpha = params
print(f"Fit: E={E:.2f}, alpha={alpha:.3f}")
# Extrapolate to a run 1000x larger than our biggest experiment
C_target = 1e25
predicted = scaling_law(C_target, *params)
print(f"Predicted loss at C={C_target:.0e}: {predicted:.3f}")
# Predicted loss at C=1e+25: ~1.92
# Frontier labs report these forecasts match actual large-run loss to ~1%.Scaling laws tell us that three things reliably move the loss: parameters, data, and compute. Equally important is what does NOT much affect the loss — because that tells you where not to spend your effort.
| Factor | Effect on loss |
|---|---|
| Parameters (N) | Strong power-law improvement |
| Training tokens (D) | Strong power-law improvement |
| Compute (C) | Strong power-law (the master variable) |
| Depth vs width | Weak — many shapes give similar loss at fixed N |
| Number of attention heads | Weak within a sensible range |
| Exact activation/norm choice | Weak — a few % at most |
| Data QUALITY | Strong, but hard to put on the same axis |
The Quality Dimension
The classic scaling laws treat all tokens as equal, but data quality matters enormously. Better-filtered, deduplicated, higher-quality data shifts the entire scaling curve downward — you reach a lower loss at the same compute. The Phi model series demonstrated that carefully curated 'textbook-quality' data can let a small model punch far above its parameter count. Data curation (Chapter 17) is the lever that scaling laws assume away but practitioners obsess over.
Scaling laws predict loss smoothly. Yet some capabilities — multi-step arithmetic, instruction following, chain-of-thought reasoning — appear to switch on suddenly at a certain scale, absent below it and present above. These 'emergent abilities' (Wei et al., 2022) are among the most discussed and contested phenomena in the field.
The Two Sides of the Debate
| Emergence is real | Emergence is a measurement artifact |
|---|---|
| Some tasks show sharp capability jumps | Sharp jumps come from harsh metrics |
| Below threshold: ~0%; above: high | Exact-match accuracy is all-or-nothing |
| Qualitatively new behaviour at scale | Smooth metrics reveal smooth improvement |
| Hard to predict from small models | Per-token loss improves continuously |
| Wei et al. (2022) | Schaeffer et al. (2023): 'a mirage' |
The skeptical view (Schaeffer et al., 2023) is compelling: many 'emergent' jumps are artifacts of the metric. If you score a multi-step task with exact-match (all steps correct or zero credit), then a model whose per-step accuracy improves smoothly will show a sudden jump in exact-match once per-step accuracy crosses a threshold. Switch to a smooth, partial-credit metric and the emergence often dissolves into a smooth curve.
Scaling laws are power laws, and power laws have a sobering property: improvements shrink as you climb. Each halving of the loss-above-irreducible costs exponentially more compute. And there are hard limits — finite data, finite compute, the irreducible entropy floor — that bound how far pure scaling can go.
The Data Wall
The most discussed limit is data. Chinchilla-optimal training of ever-larger models demands ever more high-quality tokens — but the supply of high-quality human text is finite. Estimates (Villalobos et al., 2022) suggest the stock of high-quality public text could be exhausted by training runs in the mid-to-late 2020s. This 'data wall' is driving intense interest in synthetic data, multi-epoch training, and data efficiency.
| Limit | Nature | Response |
|---|---|---|
| Irreducible loss E | Entropy of language itself | Cannot beat; aim to approach it |
| Data wall | Finite high-quality text | Synthetic data, multi-epoch, quality |
| Compute cost | Exponential for linear loss gain | Efficiency, better algorithms |
| Diminishing returns | Power-law flattening | New capabilities beyond loss |
| Inference economics | Serving cost at scale | Smaller models, distillation, MoE |
These limits are reshaping the field's direction. The frontier is shifting from pure scale toward better data (Chapter 17), test-time compute and reasoning (Chapter 24), sparse architectures like Mixture-of-Experts (Chapter 32), and post-training techniques (Part V) that extract more capability from a fixed base model. Scaling laws are not dead, but the era of 'just make it bigger' is giving way to a more multidimensional optimization.
Scaling laws are not just descriptive science — they are a decision-making tool. Here is how a practitioner actually uses them, from a fixed budget to a concrete training plan.
The Decision Workflow
| Question | How scaling laws answer it |
|---|---|
| How big a model can I afford? | From budget C and target tokens D: N = C/(6D) |
| How much data do I need? | Chinchilla: D ≈ 20N for compute-optimal training |
| What loss will I get? | Fit a small-scale ladder, extrapolate L(C) |
| Should I train longer or bigger? | Compute-optimal split: grow both as √C |
| Is over-training worth it? | Yes if inference volume is high (cheaper serving) |
| Will it be good enough? | Predict loss; map loss to downstream metrics |
import numpy as np
def plan_training_run(budget_usd, gpu_flops_per_sec=156e12,
gpu_cost_per_hour=1.0, tokens_per_param=20):
"""From a dollar budget, derive a compute-optimal training plan."""
# 1. Budget -> compute
gpu_hours = budget_usd / gpu_cost_per_hour
C = gpu_hours * 3600 * gpu_flops_per_sec # total FLOPs
# 2. Compute-optimal N, D with C = 6ND and D = 20N
# => C = 6 * N * 20N = 120 N^2 => N = sqrt(C/120)
N = np.sqrt(C / (6 * tokens_per_param))
D = tokens_per_param * N
# 3. Predicted loss (Chinchilla parametric)
E, A, B, alpha, beta = 1.69, 406.4, 410.7, 0.34, 0.28
L = E + A/N**alpha + B/D**beta
print(f"Budget: ${budget_usd:,.0f}")
print(f"Compute: {C:.1e} FLOPs")
print(f"Optimal model: {N/1e9:.1f}B params")
print(f"Optimal data: {D/1e9:.0f}B tokens")
print(f"Predicted loss: {L:.3f} (ppl {np.exp(L):.1f})")
plan_training_run(100_000) # a $100k budget
# Budget: $100,000
# Compute: 8.4e+22 FLOPs
# Optimal model: 0.8B params
# Optimal data: 17B tokens
# Predicted loss: 2.5x (ppl ~12)Scaling Laws Quick-Reference
| Concept | Formula / value | Use |
|---|---|---|
| Training FLOPs | C ≈ 6 N D | Budget any training run |
| Loss vs scale | L = E + A/N^α + B/D^β | Chinchilla parametric law |
| Irreducible loss | E ≈ 1.69 nats | Floor; entropy of language |
| Compute-optimal ratio | D ≈ 20 N | ~20 tokens per parameter |
| Optimal scaling | N, D ∝ √C | Grow both with compute |
| Forecasting | Fit ladder, extrapolate L(C) | Predict before training |
| Over-training | D ≫ 20N | Cheaper inference, pay once |
Exercises
Exercises 1–10 are pen-and-paper or derivations; 11–18 require code.
Further reading: “Scaling Laws for Neural Language Models” (Kaplan et al., 2020) — the founding paper. “Training Compute-Optimal Large Language Models” (Hoffmann et al., 2022) — the Chinchilla paper. “Emergent Abilities of Large Language Models” (Wei et al., 2022) and “Are Emergent Abilities a Mirage?” (Schaeffer et al., 2023) for the emergence debate. “Will we run out of data?” (Villalobos et al., 2022) on the data wall. Sutton's “The Bitter Lesson” (2019) for the philosophical backdrop.
Part III Complete: Deep Learning & the Transformer
| Ch. 10 | Neural Network Fundamentals | perceptrons to MLPs, activations, init, normalization, dropout — the components. |
| Ch. 11 | Backpropagation in Depth | computational graphs and a from-scratch autograd engine — how gradients flow. |
| Ch. 12 | Attention Mechanisms | scaled dot-product, self-attention, multi-head — the Transformer's core operation. |
| Ch. 13 | The Transformer Architecture | positional encodings, residual stream, the full model built from scratch. |
| Ch. 14 | Tokenization | BPE and friends — how text becomes the tokens the model consumes. |
| Ch. 15 | Training Transformers | warmup, AdamW, clipping, mixed precision — the recipe for a stable run. |
| Ch. 16 | Scaling Laws | Chinchilla, the 6N rule, compute-optimal allocation — sizing as a science. |
You have built the Transformer from first principles, learned to train it, and learned the scaling laws that tell you how big to build. But everything so far has lived on a single machine, on modest data. Part IV — Pretraining LLMs — confronts the realities of frontier-scale training: where the trillions of tokens come from and how they are curated (Chapter 17), how training is distributed across thousands of GPUs (Chapter 18), the architecture variants that make large models efficient (Chapter 19), the techniques that squeeze more out of every FLOP (Chapter 20), and how progress is measured during a months-long run (Chapter 21). The clean training loop of Chapter 15 becomes a distributed, fault-tolerant, data-hungry industrial process — and you now have the foundation to understand every part of it.