IMPLEMENTING LLMs · ZERO TO HERO

Introduction

On the difference between reading and knowing

There is a particular kind of disappointment that everyone learning about large language models eventually meets. You read a clear explanation of attention. You nod. You feel that warm glow of comprehension settle over you like a blanket. And then someone asks you to actually build it — to turn those tidy equations into code that runs — and the glow evaporates. The formula that seemed obvious on the page refuses to become a function. The intuition that felt solid turns out to be a thin film stretched over a void.

This book is built around a stubborn conviction: that the void is the whole point. You do not understand a Transformer because you have read about Transformers. You understand it the moment you have built one with your own hands, watched it fail in ways the textbook never warned you about, hunted the bug at an unreasonable hour, and finally watched the thing produce a sentence that makes sense. Understanding is not a state you arrive at by reading. It is the residue left behind by doing.

"Understanding is not a state you arrive at by reading. It is the residue left behind by doing."

The journey, honestly described

The subtitle is not marketing. Zero to hero is a literal description of the path this book walks. It begins where the real foundations are — the linear algebra, calculus, probability, and information theory that everything else stands on — because a model is, underneath all the hype, an enormous and beautifully orchestrated application of those four subjects. From there it climbs, deliberately and without skipping rungs, through classical machine learning, the mechanics of neural networks and backpropagation, and then to the invention that changed everything: attention, and the Transformer built around it.

Having built the architecture, you learn to feed it — the unglamorous, decisive craft of data curation — and to train it across many machines without it falling over. You learn the laws that govern how models scale, and then the quieter art of alignment: teaching a raw predictor of text to actually be helpful, honest, and safe. You learn to make it fast and cheap to run, to give it tools and memory and the ability to retrieve, to let it see and hear, and finally to serve it to millions. The last part stands at the frontier itself — mixtures of experts, million-token contexts, autonomous agents — and ends, fittingly, with the open problems no one has solved yet, including you.

Thirty-five chapters, seven parts, one continuous staircase from the first dot product to the edge of what is known. Nothing is assumed except curiosity and a willingness to do the work.

Why the exercises are the spine, not the appendix

Here is the part most books get backwards. In many technical books the exercises are an afterthought — a dutiful handful tacked onto the end of a chapter, half of them unanswered, there to make the book feel rigorous rather than to make you capable. In this book they are the spine. The prose exists to prepare you for them; they are where the learning actually happens.

There are 676 exercises and reflections across the thirty-five chapters, and they are not filler. They come in deliberate kinds. The pen-and-paper problems force the mathematics through your own hand — you do not really know why we subtract the max before a softmax, or why WᵀW is always symmetric, until you have proven it yourself. The derivations make you rebuild the load-bearing results — backpropagation through LayerNorm, the DPO loss falling out of the Bradley-Terry model, the 6ND scaling rule — from first principles. The code labs make you implement the real thing: a working autograd engine, multi-head attention, a KV cache, a LoRA layer, a tiny GPT you train from scratch on Shakespeare. The challenges stretch you toward genuine projects. And the reflections ask you to think — about whether models truly reason, about what interpretability would change, about what you will build next.

Do them. Not all at once, not joylessly, but do them. The chapters can be read on a train; the exercises cannot be faked. They are the difference between a person who can talk about language models and a person who can make them.

And why every one of them has a solution

Being stuck, alone, with no way to check whether you are right — this is where most self-teaching quietly dies. You attempt a derivation, arrive at something, and have no idea whether it is correct or subtly broken. The momentum bleeds away. The book closes and does not reopen.

So every single one of those 676 exercises has a worked solution. The companion Solutions Appendix — seven volumes, one per part — carries a complete, detailed answer to every problem in the book: the full reasoning for each derivation with the key equations laid out, the numbers carried through for every pen-and-paper question, the approach and the critical lines of code for every lab, and a thoughtful, two-sided discussion for every reflection. When you finish wrestling with an exercise, the answer is there, waiting, so you can confirm that the result you fought for is the right one.

A word on how to use them, because it matters. The solutions are not there to be copied; they are there to be earned against. The honest sequence is always the same: attempt the problem first, sit in the discomfort of being stuck, push until you have something — and then look. The understanding you keep is the understanding you reached just before you turned the page, not the understanding you read off it. For the coding exercises in particular, treat the solutions as guides rather than gospel: there are many correct implementations, and the value was always in the building, not in the matching.

"The solutions are not there to be copied. They are there to be earned against."

How to read this book

Linearly, if you can. The staircase is built so each step rests on the last, and the rewards compound. But the book is also written so that a practitioner can drop into the part they need — alignment, say, or inference — and find it self-contained enough to be useful, with pointers back to the foundations when they matter.

Keep a terminal open. Keep paper nearby. Expect to be confused, repeatedly, because confusion is not the absence of progress — it is the feeling of progress happening. And when you reach the end, you will not have read a book about large language models. You will have built one, from the mathematics up, and you will know — in the only way that counts, the way that survives being asked to prove it — exactly how these strange and remarkable systems work.

Turn the page. Start with a vector. We will get to the frontier together.

— Let's begin.

←

PreviousTable of Contents

NextCh 1. Linear Algebra for ML

→