Data Collection & Curation
Detailed solutions for the exercises in Chapter 17. Try solving them yourself before checking the answers.
Solution
15×10¹² tokens × 0.75 ≈ 1.1×10¹³ words; dividing by 100,000 words/book gives ≈ 112 million books. No human-curated corpus approaches this — all the books ever written number in the low hundreds of millions and aren't digitally available — so frontier pretraining must draw on the vast but messy open web, which is why curation (this chapter) matters so much.
Solution
(1) It prevents memorization — duplicated passages get over-weighted and are regurgitated verbatim. (2) It improves generalization — the model spends capacity on diverse patterns rather than re-learning copies. (3) It reduces benchmark contamination and train/test leakage. Deduplication changes WHAT the model learns, not merely how fast, so dedup'd data yields better models at equal token budget.
Solution
Apply a random permutation to the universe of elements. The minimum-hash of a set is the element of that set appearing earliest in the permutation. The two sets share their minimum exactly when the globally-earliest element among A∪B lies in A∩B. Since the permutation is uniform, every element of A∪B is equally likely to be earliest, so the probability that it falls in A∩B is |A∩B|/|A∪B| = Jaccard(A,B). Averaging many independent hashes estimates the Jaccard similarity.
Solution
The probability two documents with Jaccard s collide in at least one band is 1−(1−s^r)^b — an S-curve. The steep transition (threshold) is near s ≈ (1/b)^{1/r} = (1/20)^{1/5} ≈ 0.55. Below ~0.55 the collision probability is low; above it, near 1. Tuning b and r positions this threshold to separate near-duplicates (high s) from distinct documents (low s) cheaply, avoiding O(N²) all-pairs comparison.
Solution
(1) Document length bounds — remove too-short/too-long junk. (2) Mean word length in a sensible range — catches gibberish/encoding errors. (3) Symbol-to-word ratio cap — removes code-dumps, tables, spam. (4) Fraction of lines ending in a bullet/ellipsis — catches navigation/boilerplate. (5) Stop-word presence (must contain common words like 'the', 'and') — removes non-prose or non-natural-language text. Each targets a distinct signature of low-quality web content.
Solution
Training a 'quality' classifier to prefer Wikipedia-like text teaches it that formal, standard-dialect English is 'good', so text in African-American Vernacular English, regional dialects, or non-standard registers gets scored as low-quality and filtered out — silently erasing those voices from the training data. Mitigation: use multiple reference corpora spanning dialects/registers, or filter on concrete quality signals (coherence, not style) rather than similarity to one privileged source.
Solution
If benchmark test items leak into training, scores are inflated (the model memorized the answers), invalidating evaluation. N-gram decontamination flags training documents that share long exact n-grams (e.g. 13-grams) with test items and removes them. It can miss PARAPHRASED contamination — a reworded version of a test question shares the meaning but not the exact n-grams — so the model can still have effectively seen the test without triggering the filter.
Solution
If a domain is repeated for E epochs, its contribution to the 15T training tokens is E×its unique-token count. So Wikipedia at 3 epochs contributes 3× its unique size while web at 1 epoch contributes 1×. Given the mixing weights, divide each domain's share of the 15T by its epoch count to recover unique tokens: a domain seen 3× has only one-third as many unique tokens as its training share suggests. The point: training-token share ≠ unique-data size when domains are repeated.
Solution
Model collapse is the degradation that occurs when models are trained on data generated by previous models, recursively. Each generation samples from the prior model, under-representing the rare events in the tails (low-probability content is sampled less and estimation error compounds), so the distribution's tails progressively shrink toward the mean — diversity collapses. Mitigation: anchor training with a substantial fraction of real human data each generation, preserving the true distribution's tails.
Solution
A model that never saw toxic content can't recognize, refuse, or de-escalate it well — it lacks the representation needed to handle toxicity safely (and may be blindsided by it). Aggressive filtering also removes legitimate discussion of sensitive topics. Part V (alignment) changes the calculus: you can pretrain on a broad distribution (so the model UNDERSTANDS toxic content) and then use SFT/RLHF to teach it not to PRODUCE it — separating knowledge from behavior, rather than trying to enforce safety by data omission alone.
Solution
trafilatura extracts the main article text, discarding navigation, ads, and boilerplate. Compared to raw HTML (full of markup) and naive tag-stripping (which leaves menu/footer text and concatenated junk), trafilatura yields clean, readable prose — demonstrating why dedicated extraction, not regex tag removal, is the first curation step.
Solution
Applying the length, symbol-ratio, stop-word, and mean-word-length rules (Exercise 5) to a labeled mix of clean text and spam yields high precision/recall at separating them. Reporting the confusion shows the heuristics catch most junk while rarely rejecting clean prose — a cheap, effective first-pass quality gate.
Solution
With k hash functions, the fraction of matching min-hashes estimates Jaccard (Exercise 3). Plotting the estimate against exact Jaccard for 100 pairs shows close agreement, with error shrinking as k grows — confirming MinHash as an unbiased, tunable similarity estimator.
Solution
Banding the MinHash signatures (Exercise 4) buckets likely-duplicate documents together, so only within-bucket pairs are compared. On a corpus with planted near-duplicates, LSH recovers the large majority of true duplicate pairs at a tiny fraction of the O(N²) comparisons — the scalability that makes web-scale dedup feasible.
Solution
Regex catches structured PII (emails, phone numbers, IPs) with high precision but misses names; adding a NER model catches person names that have no fixed pattern. Comparing coverage shows the two are complementary — regex for formats, NER for entities — and together they substantially raise PII removal recall.
Solution
Sampling domains in proportion to their mixing weights and tallying many draws yields empirical frequencies converging to the configured weights — confirming the sampler correctly realizes the intended data mixture (the practical mechanism behind Exercise 8's epoch math).
Solution
Chaining extraction → quality filtering → dedup → PII redaction → decontamination and logging the surviving token count at each stage produces a 'funnel' showing how much data each step removes (often the majority overall). The funnel is the standard way to monitor and debug a curation pipeline.
Solution
Training two identical small models — one on raw web text, one on the curated version at equal token count — shows the curated model achieves lower validation perplexity and produces noticeably more coherent samples. This directly demonstrates the chapter's thesis: curation improves the model, not just compute efficiency (Exercise 2).