Sequence Models: RNNs & LSTMs
Detailed solutions for the exercises in Chapter 9. Try solving them yourself before checking the answers.
Solution
∂L/∂W_hh = Σ_{t} ∂L/∂h_t · (Π_{k<≤t} ∂h_ℓ/∂h_{ℓ−1}) · ∂h_k/∂W_hh, summed over time steps t and their contributing earlier steps k. The dangerous factor is the product of Jacobians Π ∂h_ℓ/∂h_{ℓ−1} = Π diag(tanh')·W_hh. Repeated multiplication of these terms shrinks the gradient toward zero (vanishing) or blows it up (exploding) as the gap t−k grows — the core RNN training problem.
Solution
Since tanh ranges in (−1,1), tanh² ∈ [0,1), so 1−tanh² ∈ (0,1], maximized at 1 when a=0 and →0 as |a|→∞ (saturation). In BPTT each backward step multiplies by these ≤ 1 factors (and by W_hh); over many steps the product of sub-1 derivatives drives the gradient exponentially toward zero, so distant time steps receive almost no learning signal — the vanishing-gradient mechanism.
Solution
The direct path gives ∂c_t/∂c_{t−1} = f_t (element-wise), ignoring the smaller indirect contributions through the gates. When the forget gate f_t ≈ 1, this derivative is ≈ 1, so gradients flow back through the cell state essentially undiminished — the 'constant error carousel'. Contrast the vanilla RNN's ∂h_t/∂h_{t−1} = diag(tanh')W_hh, whose sub-1 factors cause vanishing. The additive, gated cell state is the LSTM's gradient highway.
Solution
An LSTM has 4 gates (input, forget, output, candidate), each with a W (H×D), a U (H×H), and a bias: total 4(HD + H² + H). A GRU has 3 gates (reset, update, candidate): 3(HD + H² + H). Comparing the recurrent (H²) terms, 3H² vs 4H² is exactly 25% fewer — the GRU's efficiency comes from merging the LSTM's cell/hidden state and dropping one gate.
Solution
Then c_t = 1·c_{t−1} + 0·c̃_t = c_{t−1}, so by induction c_t = c_0 for all t — the cell state is perfectly preserved across arbitrarily many steps. This is the 'constant error carousel': information (and gradient) can be carried forward indefinitely without decay, which is precisely how LSTMs retain long-range dependencies that vanilla RNNs forget.
Solution
The alignment weights α_{tj} = softmax(e_{tj}) are non-negative and sum to 1 over j. The context c_t = Σ_j α_{tj} h_j is therefore a convex combination of the encoder states {h_j}, which by definition lies in their convex hull. So attention can only produce blends 'between' the encoder representations — it reweights existing information rather than inventing new directions.
Solution
Vanilla RNN: O(|i−j|) — information must traverse every intermediate step. Bidirectional RNN: still O(|i−j|) per direction (though both ends are reachable). Transformer self-attention: O(1) — any two positions interact directly in one layer. Shorter path length means fewer multiplicative Jacobian factors between distant tokens, so gradients survive and long-range dependencies are far easier to learn — a key reason Transformers outperform RNNs.
Solution
The encoder must compress a T-token sentence into a single fixed H-dimensional vector of bounded capacity. As T grows, the amount of information to store exceeds what H dimensions can faithfully represent, so detail is inevitably lost — the fixed bottleneck cannot scale with input length. Translation quality therefore degrades for long sentences, which is exactly the bottleneck attention removes by letting the decoder revisit all encoder states.
Solution
For the score a = sᵀh, ∂a/∂s = h and ∂a/∂h = s — trivial gradients with no extra parameters. Bahdanau's additive score vᵀtanh(W_s s + W_h h) requires learned matrices W_s, W_h, a vector v, and a tanh nonlinearity, so it is more expensive in both compute and parameters. The dot-product form (later central to Transformer attention) is cheaper because alignment is just an inner product.
Solution
Truncated BPTT only backpropagates gradients k steps before cutting the graph. A dependency spanning more than k steps has no gradient path connecting the distant cause to the effect within a single backward pass, so the weights responsible for carrying that information receive no learning signal for it. The forward hidden state still passes information forward indefinitely, but learning to USE long-range information requires a gradient path, which truncation severs beyond k.
Solution
Train a small char-level RNN to predict the next character on bracketed text. It learns local statistics but, due to vanishing gradients, fails to close parentheses opened more than ~15 characters back — the closing-bracket prediction degrades with distance. This empirically demonstrates the limited effective memory of vanilla RNNs.
Solution
Implement the four gate equations forward, then derive the backward pass (gradients through the gates and the additive cell-state path). The Chapter-5 finite-difference checker should confirm all gate and weight gradients to ~1e−5, validating the hand-derived LSTM backprop — including the f_t gradient highway from Exercise 3.
Solution
On the copy task (reproduce a sequence after an N-step delay), the GRU maintains high accuracy to much larger N than the vanilla RNN, whose accuracy collapses once N exceeds its short effective memory (~10–15). The plot shows the gating mechanism extending usable memory dramatically — the practical payoff of gated recurrence.
Solution
A plain encoder–decoder solves short sequences but its accuracy falls off as source length grows, because the single fixed context vector saturates (Exercise 8). The declining accuracy-vs-length curve is the visible signature of the fixed-bottleneck problem.
Solution
With attention the decoder can look back at all encoder states, so the accuracy-vs-length curve flattens — long sequences no longer degrade. The attention heatmap shows a roughly diagonal alignment (and meaningful re-orderings for tasks like reversal), visually confirming that attention dissolves the bottleneck by learning which source positions to attend to at each output step.
Solution
For a task whose output depends on the input at t=1 across an 80-step gap, the LSTM maintains a non-negligible gradient norm reaching t=1 (thanks to the cell-state highway), while the vanilla RNN's gradient at t=1 decays to near zero. The plot makes the vanishing-vs-preserved gradient contrast concrete.
Solution
A bidirectional LSTM reads the sentence both ways, so each tag decision sees future context. F1 improves over the unidirectional model, with the largest gains on tags that depend on what follows — e.g. disambiguating a word as noun vs verb often requires the next word. Right-context-dependent categories benefit most.
Solution
Build encoder–decoder with Bahdanau/Luong attention, train with teacher forcing, decode with beam search, and visualize alignments; report BLEU. The reflection should note the limitations encountered — sequential (non-parallel) training, limited long-range modeling, and the lingering compression in the recurrent decoder — and explain how the Transformer addresses each: full parallelism over positions, O(1) attention path length, and no recurrent bottleneck (Chapters 12–13).