Solutions Appendix

Chapter 10

Neural Network Fundamentals

20 Solutions

Detailed solutions for the exercises in Chapter 10. Try solving them yourself before checking the answers.

Exercise 1Pen & Paper

Prove a single perceptron cannot represent XOR.

Solution

XOR requires output 1 for (0,1) and (1,0), and 0 for (0,0) and (1,1). A perceptron fires when w₁x₁+w₂x₂+b > 0. The four constraints are: b≤0 (for 0,0→0); w₁+b>0 and w₂+b>0 (for the two positives); w₁+w₂+b≤0 (for 1,1→0). Adding the two positive inequalities gives w₁+w₂+2b>0, i.e. w₁+w₂ > −2b ≥ 0; but the last constraint demands w₁+w₂ ≤ −b ≤ 0 — a contradiction. No linear boundary separates XOR; you need a nonlinearity / hidden layer.

Exercise 2Pen & Paper

Show a linear-activation MLP collapses to one linear layer. Why are nonlinearities essential?

Solution

Stacking linear maps gives W₂(W₁x+b₁)+b₂ = (W₂W₁)x + (W₂b₁+b₂), which is itself an affine map W'x+b'. Any depth of linear layers is therefore equivalent to a single linear layer — no extra expressive power. Nonlinearities between layers break this collapse, letting the network compose features and approximate non-linear functions (universal approximation). Without them, depth is pointless.

Exercise 3Pen & Paper

Construct explicit ReLU 2→2→1 weights computing XOR.

Solution

Let hidden h₁ = ReLU(x₁+x₂−0.5) (an OR-like unit) and h₂ = ReLU(x₁+x₂−1.5) (an AND-like unit). Then XOR = ReLU(h₁ − 2h₂): for (0,0) h=(0,0)→0; for (0,1)/(1,0) h=(0.5,0)→0.5>0→1; for (1,1) h=(1.5,0.5)→1.5−1=... output h₁−2h₂ = 1.5−1.0 = 0.5 — adjust the output bias/threshold so only the single-1 cases fire. The construction shows one hidden unit detects 'at least one' and the other 'both', and their difference is XOR — the canonical demonstration that one hidden layer suffices.

Exercise 4Pen & Paper

Derive the gradient of GELU(x)=x·Φ(x).

Solution

By the product rule, GELU'(x) = Φ(x) + x·Φ'(x). Since Φ is the standard normal CDF, Φ'(x) = φ(x) = (1/√(2π))e^{−x²/2}, the standard normal density. So GELU'(x) = Φ(x) + x·φ(x). Unlike ReLU, this is smooth and slightly negative for small negative x, giving GELU its gentle, differentiable gating.

Exercise 5Derive

Derive He initialization Var(W)=2/n_in for a ReLU layer.

Solution

For a pre-activation z = Σ w_i x_i with n_in independent terms, Var(z) = n_in·Var(w)·Var(x) (zero-mean). To keep Var across layers stable we want the output variance to match the input variance. ReLU zeroes the negative half of its input, halving the variance that propagates: Var(ReLU(z)) ≈ ½Var(z). Compensating for this factor of ½ requires Var(w) = 2/n_in — He initialization. (Xavier's 1/n_in assumes a symmetric activation; ReLU needs the factor of 2.)

Exercise 6Pen & Paper

Show LayerNorm is invariant to LN(αx+β)=LN(x) (before γ,β). Why useful?

Solution

LayerNorm subtracts the mean and divides by the standard deviation across features. Replacing x by αx+β shifts the mean to αμ+β and scales the std to |α|σ; subtracting the new mean and dividing by the new std cancels both α and β, recovering exactly the normalized x. This scale/shift invariance stabilizes training: the layer's output distribution is fixed regardless of the magnitude of incoming activations, so gradients don't explode or vanish from drifting activation scales.

Exercise 7Pen & Paper

Prove inverted dropout's 1/(1−p) scaling preserves E[y]=E[x].

Solution

With keep-probability (1−p), each unit is kept with that probability and scaled by 1/(1−p): E[y] = (1−p)·(x/(1−p)) + p·0 = x. So the expected activation equals x. Because the expectation is preserved at training time, no rescaling is needed at inference — you simply run the network with all units active and no scaling, which is why inverted dropout is the standard form.

Exercise 8Pen & Paper

Compare ReLU FFN (d→4d→d) vs SwiGLU param counts. What hidden dim matches them?

Solution

The standard FFN has two weight matrices d×4d and 4d×d: ≈ 8d² parameters. SwiGLU uses three matrices (gate, up, down), each d×d_ff: ≈ 3·d·d_ff. Setting 3·d·d_ff = 8d² gives d_ff = (8/3)d ≈ 2.67d. This is why SwiGLU FFNs use a hidden dimension of about ⅔·4d — to stay parameter-matched to the standard FFN despite having three matrices.

Exercise 9Pen & Paper

Why does BatchNorm fail for a batch-size-1 variable-length Transformer while LayerNorm works?

Solution

BatchNorm normalizes each feature across the batch dimension, so with batch size 1 there is no batch to estimate statistics from (variance is zero/undefined), and with variable-length sequences the per-position batch statistics are ill-defined. LayerNorm normalizes across the feature dimension within each token independently, so it needs no batch and is unaffected by sequence length or batch size — which is exactly why Transformers use LayerNorm.

Exercise 10Pen & Paper

Estimate the gradient reaching layer 1 of a 50-layer sigmoid net vs ReLU.

Solution

Sigmoid's derivative peaks at 0.25, so backprop through 50 layers multiplies by at most 0.25⁵⁰ ≈ 10⁻³⁰ — the gradient at layer 1 is astronomically smaller than at the output (catastrophic vanishing). ReLU's derivative is 1 on the active path, so gradients reach layer 1 at order 1 (no systematic decay, modulo the weight matrices). This is the quantitative reason ReLU-family activations replaced sigmoids in deep networks.

Exercise 11Code

Implement a perceptron; confirm it learns AND/OR/NAND but fails XOR; plot boundaries.

Solution

The perceptron learning rule converges for the linearly-separable AND, OR, NAND (the boundary is a single line that cleanly separates the classes), but on XOR it never converges — the weights oscillate because no line separates the data (Exercise 1). Plotting shows three clean separating lines and one perpetually-failing case.

Exercise 12Code

Implement an MLP solving XOR; visualize that the hidden representation is linearly separable.

Solution

A 2→2→1 ReLU MLP trained on XOR drives the loss to ~0. Plotting the 4 input points in the 2-D hidden space shows the two XOR-positive points have been moved so a single line now separates them from the negatives — the hidden layer learned a representation in which the problem became linearly separable, the visual essence of why depth helps.

Exercise 13Code

Plot 7 activations and their derivatives on [−5,5]; annotate max gradient and saturation.

Solution

Plotting sigmoid, tanh, ReLU, LeakyReLU, ELU, GELU, SwiGLU/SiLU and their derivatives shows: sigmoid/tanh saturate (derivative →0) for large |x| (max grad 0.25 and 1 respectively); ReLU has derivative 1 for x>0, 0 for x<0 (dead region); GELU/SiLU are smooth with derivatives slightly exceeding... reaching ~1.1 and non-zero for small negatives. The annotations make clear why saturating activations cause vanishing gradients.

Exercise 14Code Lab

Signal propagation through 50 ReLU layers with tiny/huge/He init; plot activation std vs depth.

Solution

With too-small init the activation std shrinks geometrically toward 0 (signal dies); with too-large init it explodes; with He init (Var=2/n_in) the std stays roughly constant across all 50 layers. The three curves — decaying, exploding, flat — are the empirical confirmation of the He derivation in Exercise 5.

Exercise 15Code

Implement BatchNorm, LayerNorm, RMSNorm; apply to a (4,768) tensor with large mean; verify.

Solution

BatchNorm normalizes per-feature across the batch; LayerNorm per-token across features (removing the large mean); RMSNorm divides by the root-mean-square without subtracting the mean (so it is scale- but not shift-invariant). Checking outputs confirms LayerNorm/BatchNorm produce zero-mean unit-variance outputs while RMSNorm only normalizes magnitude — cheaper and used in LLaMA.

Exercise 16Code

Implement inverted dropout; verify expected activation preserved over 10,000 trials at p=0.1,0.3,0.5.

Solution

Averaging the dropped-and-rescaled activations over many trials yields a mean essentially equal to the input for every p, confirming E[y]=E[x] (Exercise 7). The empirical mean matches to within Monte-Carlo noise, validating that inference needs no rescaling.

Exercise 17Code Lab

Train the MLP on MNIST to >97%; plot loss curves and identify overfitting onset.

Solution

A 2–3 layer MLP with ReLU, good init, and dropout reaches >97% test accuracy. The train loss keeps falling while the validation loss flattens and then rises — the inflection point marks the onset of overfitting, the cue for early stopping or stronger regularization.

Exercise 18Code

Ablation: MNIST MLP with/without He init, dropout, LayerNorm; report effects.

Solution

Removing He init slows or destabilizes early training (poor signal propagation); removing dropout raises the train/test gap (more overfitting); removing LayerNorm makes training less stable and slower to converge. Each component contributes a measurable improvement in final accuracy and/or stability — quantifying the 'bag of tricks' that makes deep nets train.

Exercise 19Code

Overfit-one-batch test; then introduce a gradient-sign bug and show the test catches it.

Solution

A correct network can drive the loss on a tiny 10-example batch to ~0 (it has more than enough capacity to memorize). With a flipped gradient sign the loss fails to decrease (or increases), so the overfit-one-batch test immediately flags the bug — the single most useful sanity check before any real training run.

Exercise 20Code (Challenge)

Build the Pre-LN block skeleton with identity attention; verify shapes for (8,64,256).

Solution

Wiring LayerNorm → (placeholder) attention → residual add → LayerNorm → FFN → residual add, with dropout, and checking that an (8,64,256) tensor passes through unchanged in shape, validates the block plumbing. Dropping real attention in later (Chapter 12) requires no structural change — confirming the modular Pre-LN design.

←

ReturnAppendix Index

ReviewBack to Chapter 10

→