Part III: The Transformer
Chapter 13

Positional Encoding

Sinusoidal, learned, RoPE, ALiBi — encoding position
20 Exercises
13.1

You have built every component: linear layers, activations, normalization, dropout, residuals (Chapter 10), the autograd to train them (Chapter 11), and multi-head attention (Chapter 12). This chapter assembles them into the complete Transformer. We start with the bird's-eye view, then build each piece, then write the whole thing from scratch.

The Three-Stage Structure

Every Transformer, regardless of variant, has the same three-stage structure: an input stage that turns tokens into vectors, a stack of identical processing blocks, and an output stage that turns vectors back into token predictions.

Arch Stack: Transformer: the three stages (decoder-only)

Logits over vocabulary(T, V)
Unembedding / LM head(d → V)
Final LayerNormnormalize
Transformer Block × Nthe stack
+ Positional encodinginject order
Token embedding(V → d)
input token IDs(T,)

The middle stage — the stack of N identical blocks — does the heavy lifting. Each block refines the representation, mixing context via attention and processing it via the feed-forward network. The input and output stages are comparatively simple: a lookup table in, a linear projection out.

Same Block, Stacked Deep
A Transformer is mostly the same block repeated. GPT-2 small stacks 12 identical blocks; GPT-3 stacks 96; the largest models stack over 100. The block's design is what this chapter is about — once you understand one block, you understand the whole network, because the rest is just repetition.
This uniformity is also why Transformers scale so predictably (Chapter 16): adding capacity means adding more of the same block, and the scaling laws describe exactly how performance improves as you do.
13.2

The input stage converts a sequence of integer token IDs into a sequence of vectors the network can process. This requires two things: a token embedding that maps each token to a learned vector, and a positional encoding that tells the network where each token sits in the sequence.

Token Embeddings

The token embedding is a lookup table: a matrix E of shape (V, d) where row i is the embedding of token i. Looking up a sequence of token IDs gives a sequence of d-dimensional vectors. This is exactly the embedding idea from Chapter 8, now learned end-to-end with the rest of the model.

Why Position Must Be Injected

Here is a subtle but critical fact: self-attention is permutation-equivariant. If you shuffle the input tokens, the outputs shuffle the same way — attention has no inherent notion of order. 'dog bites man' and 'man bites dog' would be processed identically. Position must be added explicitly.

[Missing Component: attnNote]

Three Ways to Encode Position

MethodHowUsed in
SinusoidalFixed sin/cos of varying frequencyOriginal Transformer (2017)
Learned absoluteA learned vector per positionBERT, GPT-2
Rotary (RoPE)Rotate Q,K by position-dependent angleLLaMA, GPT-NeoX, most modern LLMs
ALiBiLinear bias on attention scores by distanceBLOOM, some long-context models
RelativeEncode pairwise position differencesT5, Transformer-XL

Sinusoidal Positional Encoding

The original Transformer used fixed sinusoids of geometrically increasing wavelength. Each dimension of the encoding oscillates at a different frequency, giving every position a unique fingerprint and — crucially — letting the model attend to relative positions via linear combinations.

textSinusoidal positional encoding
PE(pos, 2i)   = sin(pos / 10000^(2i/d))
PE(pos, 2i+1) = cos(pos / 10000^(2i/d))

# pos = position, i = dimension index, d = model dimension
# Low dimensions oscillate fast, high dimensions oscillate slowly
PythonToken + sinusoidal positional embeddings from scratch
import numpy as np

def sinusoidal_encoding(T, d):
    """Returns (T, d) fixed positional encodings."""
    pos = np.arange(T)[:, None]              # (T, 1)
    i   = np.arange(d)[None, :]              # (1, d)
    angle = pos / np.power(10000, (2*(i//2)) / d)
    pe = np.zeros((T, d))
    pe[:, 0::2] = np.sin(angle[:, 0::2])  # even dims
    pe[:, 1::2] = np.cos(angle[:, 1::2])  # odd dims
    return pe

class EmbeddingLayer:
    def __init__(self, vocab, d, seed=0):
        rng = np.random.default_rng(seed)
        self.E = rng.normal(0, 0.02, (vocab, d))  # token table
        self.d = d

    def forward(self, token_ids):  # (T,) integer IDs
        T = len(token_ids)
        tok = self.E[token_ids]             # (T, d) token embeddings
        pos = sinusoidal_encoding(T, self.d)  # (T, d) position
        return tok + pos                    # add them: (T, d)

# Token embedding answers 'what token is this?'
# Positional encoding answers 'where in the sequence is it?'
# Their SUM carries both, and the network learns to disentangle them.
ML Connection: RoPE Dominates Modern LLMs
Most LLMs since 2022 (LLaMA, GPT-NeoX, Mistral, Gemma) use Rotary Position Embedding (RoPE). Instead of adding a position vector, RoPE rotates the query and key vectors by an angle proportional to their position, so that the dot product q·k naturally depends on the relative distance between positions.
RoPE's advantages: it encodes relative position directly in the attention score, extrapolates better to longer sequences than training length, and adds no parameters. You will implement RoPE in the exercises and meet it again in Chapter 33 on long context.
13.3

The residual connections introduced in Chapter 10 are more than an optimization trick in the Transformer — they form the residual stream, a powerful conceptual lens (Elhage et al., 2021) for understanding how the whole network communicates. Every block reads from and writes to this shared stream.

Reframing the Block as Read-Write

Each sublayer (attention or FFN) does not replace the representation — it adds to it. The running sum x flows unchanged through every residual connection; each sublayer reads the current x, computes an update, and adds it back. The stream is a shared workspace that all blocks contribute to.

textThe residual stream view
x₀ = embedding + position           # initial stream
x₁ = x₀ + Attn(LN(x₀))              # attention writes to stream
x₂ = x₁ + FFN(LN(x₁))               # FFN writes to stream
...  (repeat for N blocks)
logits = Unembed(LN(x_final))       # read final stream
Intuition: Why Residuals Make Deep Transformers Trainable
Because each sublayer ADDS to the stream rather than replacing it, the gradient has a direct path from the loss back to every layer — the identity path x → x carries gradient unchanged. This is exactly the LSTM cell-state highway from Chapter 9, generalized to depth instead of time.
Without residuals, a 96-layer Transformer would suffer the vanishing-gradient problem just as a 96-step RNN does. The residual stream is what lets gradients reach the earliest layers, making very deep Transformers trainable.

Pre-LN vs Post-LN

Where you place the LayerNorm relative to the residual add is one of the most consequential design choices. The original Transformer used Post-LN; nearly all modern LLMs use Pre-LN, which keeps the residual stream clean and stabilizes training at depth.

Post-LN (original, 2017)Pre-LN (modern)
x = LN(x + Sublayer(x))x = x + Sublayer(LN(x))
Norm AFTER the residual addNorm BEFORE the sublayer
Residual stream gets normalizedResidual stream stays clean
Needs careful LR warmupTrains stably without warmup tricks
Can diverge at great depthScales to 100+ layers reliably
Used in: original Transformer, BERTUsed in: GPT-2+, LLaMA, most LLMs
13.4

After attention mixes information across positions, the feed-forward network (FFN) processes each position independently. It is deceptively simple — two linear layers with a nonlinearity — yet it holds roughly two-thirds of a Transformer's parameters and is increasingly understood to store much of the model's factual knowledge.

textThe standard FFN (per position)
FFN(x) = W₂ · act(W₁ x + b₁) + b₂

W₁: (d → 4d)    expand to a wider hidden dimension
act: GELU or SwiGLU
W₂: (4d → d)    project back to model dimension

The FFN expands the representation to a wider hidden dimension (typically 4× the model dimension), applies a nonlinearity, then projects back. The expansion gives the network room to compute rich nonlinear features per position; the projection returns to the residual-stream dimension so the output can be added back.

ML Connection: FFNs as Key-Value Memories
Geva et al. (2021) showed that FFN layers act like key-value memories: the first weight matrix W₁ detects patterns (keys), and the second matrix W₂ writes associated information (values) into the residual stream. Specific neurons activate for specific concepts — a 'Canada' neuron, a 'past tense' neuron.
This is why FFNs are thought to store factual knowledge. Editing model facts (the ROME and MEMIT methods) works by surgically modifying FFN weights. The FFN is not just a generic nonlinearity — it is the Transformer's long-term memory.
PythonFeed-forward network (standard and SwiGLU)
import numpy as np

def gelu(x):
    return 0.5*x*(1+np.tanh(np.sqrt(2/np.pi)*(x+0.044715*x**3)))

class FeedForward:  # standard GELU FFN
    def __init__(self, d, hidden=None, seed=0):
        hidden = hidden or 4*d            # 4x expansion
        rng = np.random.default_rng(seed); s = 1/np.sqrt(d)
        self.W1 = rng.normal(0, s, (d, hidden)); self.b1 = np.zeros(hidden)
        self.W2 = rng.normal(0, s, (hidden, d)); self.b2 = np.zeros(d)

    def forward(self, x):  # (T, d) -> (T, d)
        return gelu(x @ self.W1 + self.b1) @ self.W2 + self.b2

class SwiGLU_FFN:  # gated FFN used in LLaMA
    def __init__(self, d, seed=0):
        hidden = int(8/3*d)            # param-matched to 4d standard FFN
        rng = np.random.default_rng(seed); s = 1/np.sqrt(d)
        self.W = rng.normal(0, s, (d, hidden))   # gate
        self.V = rng.normal(0, s, (d, hidden))   # value
        self.W2 = rng.normal(0, s, (hidden, d))

    def forward(self, x):
        swish = (x @ self.W); swish = swish * (1/(1+np.exp(-swish)))
        return (swish * (x @ self.V)) @ self.W2  # gated, then project
13.5

We now have every piece. The Transformer block combines a multi-head attention sublayer and a feed-forward sublayer, each wrapped in a Pre-LN residual connection. This is the unit that gets stacked N times.

textPre-LN Transformer block (Pseudocode)
function TransformerBlock(x, mask):
    # Sublayer 1: multi-head self-attention
    a = MultiHeadAttention(LayerNorm(x), mask)
    x = x + Dropout(a)              # residual add

    # Sublayer 2: position-wise feed-forward
    f = FeedForward(LayerNorm(x))
    x = x + Dropout(f)              # residual add

    return x
PythonThe complete Transformer block from scratch
import numpy as np

class TransformerBlock:
    def __init__(self, d, n_heads, seed=0):
        self.attn = MultiHeadAttention(d, n_heads, seed=seed)
        self.ffn  = FeedForward(d, seed=seed+1)
        self.ln1  = LayerNorm(d)   # before attention
        self.ln2  = LayerNorm(d)   # before FFN

    def forward(self, x, mask=None):  # x: (T, d)
        # Pre-LN: normalize, sublayer, residual add
        attn_out, _ = self.attn.forward(self.ln1.forward(x), mask)
        x = x + attn_out                  # residual 1
        x = x + self.ffn.forward(self.ln2.forward(x))  # residual 2
        return x

class LayerNorm:
    def __init__(self, d, eps=1e-5):
        self.g = np.ones(d); self.b = np.zeros(d); self.eps = eps
    def forward(self, x):
        mu = x.mean(-1, keepdims=True); var = x.var(-1, keepdims=True)
        return self.g * (x - mu) / np.sqrt(var + self.eps) + self.b

# This is the entire block. Stack N of these and you have a Transformer body.

Shape Trace: Data through one block (T=16, d=512, H=8)

OperationShapeNote
input x(16, 512)from previous block
LayerNorm(x)(16, 512)normalized, shape unchanged
MultiHeadAttention(16, 512)context mixed across positions
x + attn_out(16, 512)residual add
LayerNorm(x)(16, 512)normalized again
FeedForward(16, 512)per-position processing
x + ffn_out(16, 512)residual add → next block
13.6

The same block can be arranged in three ways, giving three model families. The difference comes down to two choices: is the self-attention masked (causal) or not, and is there a separate encoder feeding the decoder via cross-attention?

ArchitectureAttentionBest forExamples
Encoder-onlyBidirectional (no mask)Understanding, classificationBERT, RoBERTa
Decoder-onlyCausal (masked)Generation, language modelingGPT, LLaMA, Claude
Encoder-decoderEncoder bidir. + decoder causal + cross-attnSeq-to-seq (translation)T5, BART, original

Why Decoder-Only Won for LLMs

The largest and most capable language models — GPT, LLaMA, Claude, Gemini — are decoder-only. This was not obvious in 2018, when encoder-only (BERT) and encoder-decoder (T5) were equally prominent. The decoder-only design won for LLMs because of its simplicity and the power of the next-token objective.

One objective, one architecture: next-token prediction needs only a causal decoder — no separate encoder, no cross-attention, no masking scheme beyond causal.
Unified interface: every task (translation, Q&A, summarization, code) becomes 'continue this text', so one model serves all tasks via prompting.
Efficient training: every token position provides a training signal (predict the next one), making maximal use of every sequence.
Clean scaling: the uniform stack scales predictably, and the scaling laws of Chapter 16 were established on decoder-only models.
History: The Architecture Convergence
In 2018–2020, the field was split: Google bet on encoder-only (BERT) and encoder-decoder (T5), while OpenAI bet on decoder-only (GPT). By 2023, nearly the entire frontier had converged on decoder-only — GPT-4, LLaMA, Claude, Gemini, Mistral are all decoder-only.
Encoder-only models remain dominant for embeddings and classification where you need to encode a fixed input. But for the generative, general-purpose models that define the LLM era, decoder-only is the unchallenged design.
13.7

After the last block, the residual stream holds a rich representation of each position. The output stage converts this back into a probability distribution over the vocabulary: a final LayerNorm, then a linear projection (the unembedding or 'LM head') from model dimension d to vocabulary size V, then softmax.

textThe output stage
x_final = LayerNorm(x_N)            # (T, d)
logits  = x_final @ W_U             # (T, V)   unembedding
probs   = softmax(logits)           # (T, V)   per-position distributions

Weight Tying

A common trick: tie the unembedding matrix W_U to the token embedding matrix E (using Eᵀ as the unembedding). This saves V×d parameters — 38 million for GPT-2 — and often improves quality, since the same vocabulary geometry serves both reading and writing tokens.

Train Note: Weight Tying Saves Millions of Parameters
For a 50,000-token vocabulary and d=768, the embedding and unembedding matrices each hold 38M parameters. Tying them halves this to 38M total and ties the 'meaning' of a token on input to its prediction on output.
GPT-2 and many models use weight tying. Some very large models untie them, finding that at scale the extra parameters help. Like many architecture choices, the right answer depends on scale.
PythonThe output stage and a complete forward pass
import numpy as np

def softmax(x, axis=-1):
    x = x - x.max(axis=axis, keepdims=True); e = np.exp(x)
    return e / e.sum(axis=axis, keepdims=True)

class LMHead:
    def __init__(self, embedding_matrix):
        self.W_U = embedding_matrix.T   # weight tying: (d, V)
        self.ln_f = LayerNorm(embedding_matrix.shape[1])

    def forward(self, x):  # x: (T, d)
        x = self.ln_f.forward(x)        # final norm
        logits = x @ self.W_U           # (T, V)
        return logits

# The logits at position t are the model's prediction for token t+1.
# Apply softmax for probabilities, or argmax for greedy decoding.
13.8

We now assemble everything into a complete, working decoder-only Transformer — a miniature GPT. This is the culmination of Part III: a model you understand line by line, from token IDs to output logits.

PythonCode Lab: a complete GPT from scratch (forward pass)
import numpy as np

class GPT:
    """A minimal decoder-only Transformer (GPT-style)."""
    def __init__(self, vocab, d=256, n_layers=6, n_heads=8, seed=0):
        rng = np.random.default_rng(seed)
        self.E = rng.normal(0, 0.02, (vocab, d))   # token embeddings
        self.d = d
        self.blocks = [TransformerBlock(d, n_heads, seed=seed+i)
                       for i in range(n_layers)]
        self.head = LMHead(self.E)           # weight-tied output

    def forward(self, token_ids):  # (T,) integer IDs
        T = len(token_ids)
        # 1. Input stage: embed + position
        x = self.E[token_ids] + sinusoidal_encoding(T, self.d)  # (T, d)
        # 2. Causal mask (decoder-only)
        mask = np.tril(np.ones((T, T), dtype=bool))
        # 3. The block stack
        for block in self.blocks:
            x = block.forward(x, mask)        # (T, d)
        # 4. Output stage
        return self.head.forward(x)           # (T, V) logits

# Instantiate and run a forward pass
model = GPT(vocab=1000, d=256, n_layers=6, n_heads=8)
tokens = np.array([5, 42, 17, 3, 99])
logits = model.forward(tokens)
print(f"logits shape: {logits.shape}")  # (5, 1000)
print(f"next-token prediction at pos 0: {logits[0].argmax()}")
# Each row predicts the next token. Untrained, predictions are random;
# after training on text (Chapter 15), they become coherent language.

Shape Trace: Full GPT forward pass (T=5, d=256, V=1000)

OperationShapeNote
token_ids(5,)integer token IDs
E[token_ids](5, 256)token embeddings
+ positional(5, 256)order injected
block × 6(5, 256)context mixed and processed
final LayerNorm(5, 256)normalize
@ W_U(5, 1000)logits over vocab
You Have Built a Transformer
The GPT class above, together with the components from this chapter and Chapter 12, is a complete Transformer. It is the same architecture as GPT-2, GPT-3, and LLaMA — those models differ only in scale (more layers, wider d, larger vocab), a few refinements (RoPE, RMSNorm, SwiGLU), and the enormous training data and compute of Chapters 15–16.
Everything from here — pretraining, scaling, alignment, inference — builds on this architecture. You now understand, from first principles, the machine at the center of the modern AI revolution.
13.9

A trained decoder-only Transformer generates text autoregressively: it predicts the next token, appends it to the sequence, and repeats. The causal mask ensures each prediction depends only on previous tokens, so this loop is consistent with how the model was trained.

textAutoregressive generation (Pseudocode)
function generate(model, prompt, max_tokens):
    tokens = prompt
    for _ in range(max_tokens):
        logits = model.forward(tokens)    # (T, V)
        next_logits = logits[-1]          # last position predicts next
        next_token = sample(next_logits)   # greedy / top-k / nucleus
        tokens = tokens + [next_token]      # append and repeat
        if next_token == EOS: break
    return tokens

Sampling Strategies

StrategyHowEffect
GreedyAlways pick argmaxDeterministic; can be repetitive
TemperatureScale logits by 1/T before softmaxT>1 more random, T<1 more peaked
Top-kSample from k highest-probability tokensCuts off the long tail
Nucleus (top-p)Sample from smallest set with cumulative pAdapts cutoff to confidence
Beam searchTrack b best sequencesBetter for translation, not open text
Train Note: The KV-Cache Makes Generation Fast
Naive generation recomputes attention over the whole sequence at every step — O(T²) per token, O(T³) total. The KV-cache stores the keys and values of past tokens so each new token only computes its own Q, K, V and attends to the cache: O(T) per token.
This is why the causal mask matters for efficiency, not just correctness: because position t never attends to the future, the keys and values of past tokens never change and can be cached. Chapter 27 covers inference optimization in depth.
13.10

Every real Transformer is a specific configuration of the architecture you just built. Here is how the major models fill in the blanks — the same skeleton, different sizes and refinements.

ModelTypeLayersd_modelRefinements
GPT-2 smallDecoder12768Learned pos, GELU, LayerNorm, tied
GPT-3Decoder9612288Same as GPT-2, vastly scaled
BERT-baseEncoder12768Learned pos, bidirectional, MLM
T5-baseEnc-Dec12+12768Relative pos, cross-attention
LLaMA-2 7BDecoder324096RoPE, RMSNorm, SwiGLU, untied
LLaMA-2 70BDecoder808192RoPE, RMSNorm, SwiGLU, GQA

Notice the pattern: the architecture is stable, but two things change over time. Scale grows relentlessly (12 → 96 layers; 768 → 12288 dimensions), and a handful of refinements accumulate (RoPE replaces learned positions; RMSNorm replaces LayerNorm; SwiGLU replaces GELU; GQA reduces KV memory). The core — stacked Pre-LN blocks of attention and FFN — is unchanged from 2017.

ML Connection: The Modern LLM Recipe
The 2024 'default' decoder-only LLM combines: RoPE positional encoding, RMSNorm (Pre-LN), SwiGLU feed-forward, grouped-query attention (GQA) for efficient inference, and untied embeddings at large scale. This recipe — LLaMA-style — is the de facto standard that most open models follow.
Each refinement is a small, empirically-validated improvement over the original 2017 design. None changes the fundamental architecture; together they represent seven years of incremental engineering on a remarkably durable foundation.
13.11

A Transformer can have a subtle bug and still train — just worse — making errors hard to spot. Here are the most common implementation pitfalls and how to catch them.

PitfallSymptomFix
Forgot √d_k scalingSlow/stuck trainingDivide scores by √d_k
Wrong mask broadcastingModel peeks at futureVerify causal mask shape (T,T)
Forgot positional encodingBag-of-words behaviourAdd position before block 1
Post-LN at depthDivergence past ~12 layersUse Pre-LN
No gradient clippingNaN loss spikesClip grad norm to ~1.0
Mask applied after softmaxFuture leaks inMask BEFORE softmax (set -∞)
Tied weights, wrong transposeShape error or garbageW_U = Eᵀ, check shapes
⚠️
Pitfall: The Most Dangerous Bug: A Mask That Almost Works
A causal mask applied with the wrong broadcasting can leak a little future information — enough to inflate training metrics (the model 'cheats') but invisible until evaluation, when generation quality is mysteriously poor. The model trained fine; it just learned to rely on information it won't have at inference.
Always test generation, not just training loss. A model that cheats during training will have suspiciously low training loss and disappointing generation. The overfit-one-batch test from Chapter 10 plus a generation sanity-check catches most architecture bugs.
Train Note: Validate Against a Reference
When implementing a Transformer from scratch, validate against a trusted reference (Hugging Face transformers, nanoGPT). Load the same weights into both, run the same input, and confirm the logits match to within floating-point tolerance.
This single test — numerical equivalence to a reference on a fixed input — catches nearly every architecture bug. It is the gold standard for verifying a from-scratch implementation is correct before investing in training.
13.12

Architecture Quick-Reference

ComponentRoleModern choice
Token embeddingToken → vectorLearned table (V, d)
Positional encodingInject orderRoPE
Self-attentionMix across positionsMulti-head + causal mask
Feed-forwardProcess per positionSwiGLU, ~8/3 d hidden
NormalizationStabilizeRMSNorm, Pre-LN
Residual streamCommunication channelIdentity skip connections
LM headVector → logitsLinear (often weight-tied)

Exercises

Exercises 1–10 are pen-and-paper or derivations; 11–20 require code.

Exercise 1: Pen & Paper
Prove that self-attention without positional encoding is permutation-equivariant: permuting the input rows permutes the output rows identically. Why does this make positional encoding necessary?
Exercise 2: Derive
Show that the sinusoidal encoding lets the model attend to relative positions: express PE(pos+k) as a linear function of PE(pos).
Exercise 3: Pen & Paper
Count the parameters in a single Transformer block (d=512, H=8, FFN hidden=2048). Break down attention vs FFN. What fraction is in the FFN?
Exercise 4: Pen & Paper
Explain the residual stream view. Why does writing x = x + Sublayer(LN(x)) (Pre-LN) keep the stream cleaner than x = LN(x + Sublayer(x)) (Post-LN)?
Exercise 5: Pen & Paper
For a decoder-only model with vocab V=50257, d=768, 12 layers, estimate the total parameter count. Where do most parameters live?
Exercise 6: Pen & Paper
Explain weight tying. How many parameters does it save for V=50257, d=768? What is the conceptual justification?
Exercise 7: Pen & Paper
Compare encoder-only, decoder-only, and encoder-decoder in terms of (a) attention masking, (b) training objective, (c) typical tasks. Why did decoder-only win for LLMs?
Exercise 8: Derive
Implement RoPE on paper: show how rotating q and k by angle proportional to position makes q·k depend only on the relative position difference.
Exercise 9: Pen & Paper
Why does the KV-cache work only because of the causal mask? What property of causal attention makes past keys and values reusable?
Exercise 10: Pen & Paper
A model trains with low loss but generates poorly. List three architecture bugs (from Section 13.11) that could cause this and how to distinguish them.
Exercise 11: Code
Implement sinusoidal positional encoding and visualize it as a heatmap (position × dimension). Confirm low dimensions oscillate fast, high dimensions slowly.
Exercise 12: Code
Implement and compare sinusoidal, learned, and rotary positional encodings on a small sequence task. Which extrapolates best to sequences longer than training length?
Exercise 13: Code
Implement the complete Transformer block (Pre-LN, MHA, FFN, residuals) from scratch. Verify the output shape equals the input shape for any sequence length.
Exercise 14: Code Lab
Build the complete GPT class from Section 13.8. Verify shapes flow correctly end-to-end. Confirm it produces (T, V) logits for any token sequence.
Exercise 15: Code
Implement weight tying: share the embedding matrix between input and output. Verify the parameter count drops by V×d and the forward pass still works.
Exercise 16: Code
Implement autoregressive generation with greedy, temperature, top-k, and nucleus sampling. On an untrained model, confirm the sampling distributions differ as expected.
Exercise 17: Code Lab
Implement the KV-cache for generation. Measure the speedup vs naive recomputation for generating 100 tokens. Confirm it produces identical outputs.
Exercise 18: Code
Convert your decoder-only model to an encoder by removing the causal mask. Verify that with no mask, every position attends to every other (full attention matrix).
Exercise 19: Code
Load GPT-2 weights from Hugging Face into a from-scratch implementation. Verify your logits match the reference to within 1e-4 on a fixed input — the gold-standard correctness test.
Exercise 20: Code (Challenge)
Build a complete trainable nanoGPT: implement the full forward AND backward pass (using your Chapter 11 autograd or PyTorch), train it on a small text corpus (e.g., Shakespeare), and generate samples. This is the capstone of Part III — a Transformer you built and trained from scratch.

Further reading: “Attention Is All You Need” (Vaswani et al., 2017) — the original architecture. “Language Models are Unsupervised Multitask Learners” (Radford et al., 2019) for GPT-2's decoder-only design. “RoFormer” (Su et al., 2021) for RoPE. “LLaMA” (Touvron et al., 2023) for the modern recipe. Andrej Karpathy's nanoGPT and his 'Let's build GPT' video — the best hands-on companion to this chapter. “A Mathematical Framework for Transformer Circuits” (Elhage et al., 2021) for the residual-stream view.


Next → Chapter 14: Tokenization

You have built a complete Transformer that maps token IDs to predictions — but where do token IDs come from? Chapter 14 fills the one remaining gap: how raw text becomes the integer sequences the model consumes. We will build Byte-Pair Encoding from scratch, compare it to WordPiece and Unigram, explore the surprising ways tokenization shapes model behaviour (arithmetic, multilingual fairness, the 'SolidGoldMagikarp' glitch tokens), and understand why tokenization is both essential and a persistent source of model quirks.

20 Exercises in this chapter
Attempt each exercise before checking the worked solutions.
View Solutions →