Solutions Appendix
Chapter 17

Data Collection & Curation

18 Solutions

Detailed solutions for the exercises in Chapter 17. Try solving them yourself before checking the answers.

Exercise 1Pen & Paper
70B model, 15T tokens, 0.75 words/token: equivalent 100k-word books? Why no curated corpus this large?

Solution

15×10¹² tokens × 0.75 ≈ 1.1×10¹³ words; dividing by 100,000 words/book gives ≈ 112 million books. No human-curated corpus approaches this — all the books ever written number in the low hundreds of millions and aren't digitally available — so frontier pretraining must draw on the vast but messy open web, which is why curation (this chapter) matters so much.

Exercise 2Pen & Paper
Why does deduplication improve quality, not just save compute? Three mechanisms.

Solution

(1) It prevents memorization — duplicated passages get over-weighted and are regurgitated verbatim. (2) It improves generalization — the model spends capacity on diverse patterns rather than re-learning copies. (3) It reduces benchmark contamination and train/test leakage. Deduplication changes WHAT the model learns, not merely how fast, so dedup'd data yields better models at equal token budget.

Exercise 3Pen & Paper
Derive why MinHash estimates Jaccard: P(min hash of A = min hash of B) = |A∩B|/|A∪B|.

Solution

Apply a random permutation to the universe of elements. The minimum-hash of a set is the element of that set appearing earliest in the permutation. The two sets share their minimum exactly when the globally-earliest element among A∪B lies in A∩B. Since the permutation is uniform, every element of A∪B is equally likely to be earliest, so the probability that it falls in A∩B is |A∩B|/|A∪B| = Jaccard(A,B). Averaging many independent hashes estimates the Jaccard similarity.

Exercise 4Pen & Paper
LSH S-curve 1−(1−s^r)^b for b=20, r=5: sketch and find the threshold.

Solution

The probability two documents with Jaccard s collide in at least one band is 1−(1−s^r)^b — an S-curve. The steep transition (threshold) is near s ≈ (1/b)^{1/r} = (1/20)^{1/5} ≈ 0.55. Below ~0.55 the collision probability is low; above it, near 1. Tuning b and r positions this threshold to separate near-duplicates (high s) from distinct documents (low s) cheaply, avoiding O(N²) all-pairs comparison.

Exercise 5Pen & Paper
List five Gopher-style heuristic quality filters and what each targets.

Solution

(1) Document length bounds — remove too-short/too-long junk. (2) Mean word length in a sensible range — catches gibberish/encoding errors. (3) Symbol-to-word ratio cap — removes code-dumps, tables, spam. (4) Fraction of lines ending in a bullet/ellipsis — catches navigation/boilerplate. (5) Stop-word presence (must contain common words like 'the', 'and') — removes non-prose or non-natural-language text. Each targets a distinct signature of low-quality web content.

Exercise 6Pen & Paper
How can a Wikipedia-based quality classifier encode bias against minority dialects? Mitigation.

Solution

Training a 'quality' classifier to prefer Wikipedia-like text teaches it that formal, standard-dialect English is 'good', so text in African-American Vernacular English, regional dialects, or non-standard registers gets scored as low-quality and filtered out — silently erasing those voices from the training data. Mitigation: use multiple reference corpora spanning dialects/registers, or filter on concrete quality signals (coherence, not style) rather than similarity to one privileged source.

Exercise 7Pen & Paper
Why is benchmark decontamination necessary? How does n-gram overlap work and why can it miss contamination?

Solution

If benchmark test items leak into training, scores are inflated (the model memorized the answers), invalidating evaluation. N-gram decontamination flags training documents that share long exact n-grams (e.g. 13-grams) with test items and removes them. It can miss PARAPHRASED contamination — a reworded version of a test question shares the meaning but not the exact n-grams — so the model can still have effectively seen the test without triggering the filter.

Exercise 8Pen & Paper
Mixture up-weights Wikipedia to 3 epochs, web to 1; 15T total. Estimate unique tokens per domain.

Solution

If a domain is repeated for E epochs, its contribution to the 15T training tokens is E×its unique-token count. So Wikipedia at 3 epochs contributes 3× its unique size while web at 1 epoch contributes 1×. Given the mixing weights, divide each domain's share of the 15T by its epoch count to recover unique tokens: a domain seen 3× has only one-third as many unique tokens as its training share suggests. The point: training-token share ≠ unique-data size when domains are repeated.

Exercise 9Pen & Paper
Explain model collapse. Why does recursive synthetic training shrink the tails? Mitigation.

Solution

Model collapse is the degradation that occurs when models are trained on data generated by previous models, recursively. Each generation samples from the prior model, under-representing the rare events in the tails (low-probability content is sampled less and estimation error compounds), so the distribution's tails progressively shrink toward the mean — diversity collapses. Mitigation: anchor training with a substantial fraction of real human data each generation, preserving the true distribution's tails.

Exercise 10Pen & Paper
Toxicity filtering trade-off: why can removing all toxic text make a model WORSE at handling it? How does Part V change the calculus?

Solution

A model that never saw toxic content can't recognize, refuse, or de-escalate it well — it lacks the representation needed to handle toxicity safely (and may be blindsided by it). Aggressive filtering also removes legitimate discussion of sensitive topics. Part V (alignment) changes the calculus: you can pretrain on a broad distribution (so the model UNDERSTANDS toxic content) and then use SFT/RLHF to teach it not to PRODUCE it — separating knowledge from behavior, rather than trying to enforce safety by data omission alone.

Exercise 11Code
Implement WARC text extraction with trafilatura; compare to raw HTML and naive tag-stripping.

Solution

trafilatura extracts the main article text, discarding navigation, ads, and boilerplate. Compared to raw HTML (full of markup) and naive tag-stripping (which leaves menu/footer text and concatenated junk), trafilatura yields clean, readable prose — demonstrating why dedicated extraction, not regex tag removal, is the first curation step.

Exercise 12Code
Implement the Gopher heuristic filter; report precision/recall on clean + synthetic spam.

Solution

Applying the length, symbol-ratio, stop-word, and mean-word-length rules (Exercise 5) to a labeled mix of clean text and spam yields high precision/recall at separating them. Reporting the confusion shows the heuristics catch most junk while rarely rejecting clean prose — a cheap, effective first-pass quality gate.

Exercise 13Code
Implement MinHash + Jaccard estimation; verify against exact Jaccard on 100 pairs.

Solution

With k hash functions, the fraction of matching min-hashes estimates Jaccard (Exercise 3). Plotting the estimate against exact Jaccard for 100 pairs shows close agreement, with error shrinking as k grows — confirming MinHash as an unbiased, tunable similarity estimator.

Exercise 14Code Lab
Implement LSH bucketing on MinHash; measure duplicate recall vs exhaustive O(N²).

Solution

Banding the MinHash signatures (Exercise 4) buckets likely-duplicate documents together, so only within-bucket pairs are compared. On a corpus with planted near-duplicates, LSH recovers the large majority of true duplicate pairs at a tiny fraction of the O(N²) comparisons — the scalability that makes web-scale dedup feasible.

Exercise 15Code
Implement PII redaction (regex for emails/phones/IPs); add NER name detection; compare coverage.

Solution

Regex catches structured PII (emails, phone numbers, IPs) with high precision but misses names; adding a NER model catches person names that have no fixed pattern. Comparing coverage shows the two are complementary — regex for formats, NER for entities — and together they substantially raise PII removal recall.

Exercise 16Code
Implement a weighted domain sampler; verify sampled frequencies match the configured weights.

Solution

Sampling domains in proportion to their mixing weights and tallying many draws yields empirical frequencies converging to the configured weights — confirming the sampler correctly realizes the intended data mixture (the practical mechanism behind Exercise 8's epoch math).

Exercise 17Code Lab
Build the single-node curation pipeline; report funnel statistics at each stage.

Solution

Chaining extraction → quality filtering → dedup → PII redaction → decontamination and logging the surviving token count at each stage produces a 'funnel' showing how much data each step removes (often the majority overall). The funnel is the standard way to monitor and debug a curation pipeline.

Exercise 18Code (Challenge)
Data-quality ablation: train tiny LMs on unfiltered vs curated text; compare perplexity and samples.

Solution

Training two identical small models — one on raw web text, one on the curated version at equal token count — shows the curated model achieves lower validation perplexity and produces noticeably more coherent samples. This directly demonstrates the chapter's thesis: curation improves the model, not just compute efficiency (Exercise 2).