Sequence Models & RNNs
How should a computer represent the word “cat”? The obvious answer — a one-hot vector with a 1 in position 8,432 of a 50,000-dimensional vector — turns out to be nearly useless. This section explains why, and motivates the dense representations that power all of modern NLP.
The Problem with One-Hot Vectors
| One-hot encoding | Dense embedding |
|---|---|
| Dimension = vocabulary size (50k–200k) | Dimension = 50–1024 (chosen freely) |
| Every pair of words is orthogonal | Similar words are close in space |
| cos(cat, kitten) = 0 = cos(cat, carburettor) | cos(cat, kitten) ≈ 0.8 >> cos(cat, carburettor) |
| Sparse: 1 non-zero entry | Dense: every dimension carries information |
| No generalisation between words | Statistical strength shared across similar words |
| Parameters grow with vocabulary | Parameters fixed by embedding dim |
The fatal flaw of one-hot vectors: every pair of distinct words has dot product zero. The encoding contains no information about meaning whatsoever. A model trained on “the cat sat on the mat” learns nothing that transfers to “the kitten sat on the rug”, even though the sentences are nearly synonymous.
import numpy as np
# One-hot: every word pair is orthogonal
V = 50000 # vocabulary size
cat = np.zeros(V); cat[8432] = 1
kitten = np.zeros(V); kitten[8433] = 1
carb = np.zeros(V); carb[31007] = 1
print(cat @ kitten) # 0.0 -- 'cat' and 'kitten' unrelated?!
print(cat @ carb) # 0.0 -- same similarity as 'carburettor'
# Dense embeddings (here: pre-trained GloVe-style vectors, illustrative)
emb = {
'cat': np.array([0.61, 0.21, -0.43, 0.18]),
'kitten': np.array([0.58, 0.25, -0.39, 0.22]),
'carb': np.array([-0.31, 0.74, 0.12, -0.55]),
}
def cos(a, b): return a @ b / (np.linalg.norm(a) * np.linalg.norm(b))
print(f"cos(cat, kitten) = {cos(emb['cat'], emb['kitten']):.3f}")
print(f"cos(cat, carb) = {cos(emb['cat'], emb['carb']):.3f}")
# cos(cat, kitten) = 0.995 <- semantically similar = geometrically close
# cos(cat, carb) = -0.468 <- unrelated = far apartEvery embedding method in this chapter — and every language model since — rests on a single linguistic insight: words that occur in similar contexts have similar meanings. “Cat” and “kitten” both appear near “purr”, “fur”, “meow”, and “vet”; that contextual overlap is what the embedding captures.
Co-occurrence Counts: The Raw Signal
The simplest way to operationalise the hypothesis: build a co-occurrence matrix X where X[i][j] counts how many times word j appears within a window of w words around word i, summed over a corpus.
import numpy as np
from collections import defaultdict, Counter
corpus = """the cat sat on the mat . the dog sat on the rug .
the cat chased the dog . the kitten purred softly .""".split()
def build_cooccurrence(tokens, window=2):
vocab = sorted(set(tokens))
w2i = {w: i for i, w in enumerate(vocab)}
V = len(vocab)
X = np.zeros((V, V))
for i, word in enumerate(tokens):
for j in range(max(0, i-window), min(len(tokens), i+window+1)):
if i != j:
X[w2i[word], w2i[tokens[j]]] += 1
return X, vocab, w2i
X, vocab, w2i = build_cooccurrence(corpus, window=2)
# 'cat' and 'dog' co-occur with similar words -> similar rows
def cos(a, b): return a@b / (np.linalg.norm(a)*np.linalg.norm(b) + 1e-9)
print(f"cos(cat, dog) = {cos(X[w2i['cat']], X[w2i['dog']]):.3f}")
print(f"cos(cat, softly) = {cos(X[w2i['cat']], X[w2i['softly']]):.3f}")
# cos(cat, dog) = 0.667 -- both are 'sat on / chased' animals
# cos(cat, softly) = 0.000 -- no shared context in this tiny corpusFrom Counts to Embeddings: Three Strategies
| Strategy | Method | Representative |
|---|---|---|
| Count + factorise | Build co-occurrence matrix, reduce with SVD | LSA (1990), HAL (1996) |
| Predict context | Train a classifier to predict context words | Word2Vec (2013), fastText (2016) |
| Count + weighted fit | Fit log co-occurrence ratios directly | GloVe (2014) |
The remarkable result (Levy & Goldberg, 2014): these three families are mathematically connected. Skip-gram with negative sampling implicitly factorises a shifted PMI (pointwise mutual information) matrix. Counting and predicting converge on the same geometry.
Word2Vec (Mikolov et al., 2013) reframes embedding learning as a prediction task: train a shallow network to predict context words from a target word (Skip-gram) or a target word from its context (CBOW). The embeddings are the learned weights — the prediction task itself is discarded after training.
Two Architectures
| Skip-gram: predict context from word | CBOW: predict word from context |
|---|---|
| Input: centre word w | Input: average of context words |
| Output: each context word c in window | Output: centre word w |
| One training pair per (w, c) combination | One training pair per centre position |
| Better for rare words (more updates each) | Faster training (fewer updates) |
| Better on semantic analogy tasks | Better on syntactic tasks, small data |
| Default choice in practice | Used when speed matters most |
The Skip-gram Objective
Given a corpus of T tokens, maximise the log-probability of context words within a window of size m around each position:
Note the two embedding matrices: v (input/centre embeddings) and u (output/context embeddings). Each word has two vectors during training; typically only v is kept (or v and u are averaged).
import numpy as np
class SkipGram:
def __init__(self, V, dim=50, lr=0.05, seed=0):
rng = np.random.default_rng(seed)
self.V_in = rng.normal(0, 0.1, (V, dim)) # centre embeddings v_w
self.V_out = rng.normal(0, 0.1, (V, dim)) # context embeddings u_c
self.lr = lr
def step(self, centre, context): # one (w, c) training pair
v = self.V_in[centre] # (dim,)
scores = self.V_out @ v # (V,) logits over vocab
scores -= scores.max() # stable softmax
p = np.exp(scores); p /= p.sum() # (V,) probabilities
# Gradient of cross-entropy: p - onehot(context)
grad = p.copy(); grad[context] -= 1.0
# Backprop into both embedding matrices
dV_in = self.V_out.T @ grad # (dim,)
dV_out = np.outer(grad, v) # (V, dim)
self.V_in[centre] -= self.lr * dV_in
self.V_out -= self.lr * dV_out
return -np.log(p[context] + 1e-12) # loss for monitoring
# Generate training pairs from corpus
def make_pairs(tokens, w2i, window=2):
pairs = []
for i, w in enumerate(tokens):
for j in range(max(0, i-window), min(len(tokens), i+window+1)):
if i != j: pairs.append((w2i[w], w2i[tokens[j]]))
return pairs
pairs = make_pairs(corpus, w2i)
sg = SkipGram(V=len(vocab), dim=20)
for epoch in range(200):
loss = np.mean([sg.step(w, c) for w, c in pairs])
if epoch % 50 == 0: print(f"epoch {epoch}: loss {loss:.3f}")
# epoch 0: loss 2.973
# epoch 150: loss 1.213 -- 'cat' and 'dog' now have cos ≈ 0.7Negative sampling (Mikolov et al., 2013b) sidesteps the softmax entirely. Instead of asking “which word in V is the context?”, it asks a binary question: “is this (word, context) pair real or random?” For each true pair, draw k random “negative” contexts and train a logistic classifier to distinguish them.
The SGNS Objective
The first term pushes the true pair's dot product up; the k negative terms push random pairs' dot products down. The noise distribution Pₙ(w) ∝ count(w)^{3/4} — the 3/4 power down-weights very frequent words like “the” while still sampling them more than rare words.
for each (centre w, context c) pair in corpus:
# Positive update: this pair is real
g ← σ(u_c · v_w) − 1 # gradient of −log σ(score)
u_c ← u_c − lr · g · v_w
acc ← g · u_c # accumulate gradient for v_w
# Negative updates: k random fake pairs
for i = 1…k:
n ← sample from Pₙ(w) ∝ count(w)^0.75
g ← σ(u_n · v_w) # want score → −∞, so σ → 0
u_n ← u_n − lr · g · v_w
acc ← acc + g · u_n
v_w ← v_w − lr · acc # single update for the centre wordimport numpy as np
from collections import Counter
class SGNS:
def __init__(self, tokens, dim=100, k=5, lr=0.025, seed=0):
self.vocab = sorted(set(tokens))
self.w2i = {w:i for i,w in enumerate(self.vocab)}
V = len(self.vocab)
rng = np.random.default_rng(seed)
self.W_in = rng.uniform(-0.5/dim, 0.5/dim, (V, dim))
self.W_out = np.zeros((V, dim)) # standard word2vec init
self.k, self.lr = k, lr
# Unigram^0.75 noise distribution
counts = Counter(tokens)
freqs = np.array([counts[w] for w in self.vocab], dtype=float)
self.noise = freqs**0.75; self.noise /= self.noise.sum()
self.rng = rng
def _sigmoid(self, x): return 1 / (1 + np.exp(-np.clip(x, -20, 20)))
def step(self, centre, context):
v = self.W_in[centre]
acc = np.zeros_like(v)
# --- Positive pair ---
u = self.W_out[context]
g = self._sigmoid(u @ v) - 1.0 # ∂/∂score of -log σ(score)
acc += g * u
self.W_out[context] -= self.lr * g * v
# --- k negative samples ---
negs = self.rng.choice(len(self.vocab), self.k, p=self.noise)
for n in negs:
if n == context: continue
u = self.W_out[n]
g = self._sigmoid(u @ v) # want σ → 0 for fakes
acc += g * u
self.W_out[n] -= self.lr * g * v
self.W_in[centre] -= self.lr * acc
def most_similar(self, word, topn=5):
v = self.W_in[self.w2i[word]]
sims = self.W_in @ v / (np.linalg.norm(self.W_in, axis=1) * np.linalg.norm(v) + 1e-9)
best = np.argsort(-sims)[1:topn+1] # skip self
return [(self.vocab[i], sims[i]) for i in best]
# Cost per pair: O(k·dim) instead of O(V·dim).
# For V=1M, k=5: a 200,000x speedup. This is what made Word2Vec feasible.Frequent Word Subsampling
A second crucial trick: very frequent words (“the”, “of”, “and”) provide little signal but dominate the training pairs. Word2Vec discards each occurrence of word w with probability:
This both speeds up training and improves embedding quality — rare-word vectors get relatively more updates, and context windows effectively widen (discarded words don't block more distant content words).
Word2Vec streams through the corpus pair by pair, never aggregating global statistics. GloVe (Pennington, Socher & Manning, 2014) takes the opposite approach: first build the full co-occurrence matrix X, then fit embeddings so that dot products reproduce the log co-occurrence counts.
The Key Insight: Ratios of Co-occurrence Probabilities
Consider words i = “ice” and j = “steam”. The ratio P(k|ice)/P(k|steam) discriminates meaning: for k = “solid” the ratio is large; for k = “gas” it is small; for k = “water” (related to both) or k = “fashion” (related to neither), the ratio is near 1. GloVe constructs embeddings whose differences encode these ratios.
The GloVe Objective
Each term is a weighted squared error: the dot product of word vector wᵢ and context vector w̃ⱼ (plus biases) should equal the log co-occurrence count. The weighting function f caps the influence of extremely frequent pairs:
import numpy as np
class GloVe:
def __init__(self, X, dim=50, x_max=100, alpha=0.75, lr=0.05, seed=0):
V = X.shape[0]
rng = np.random.default_rng(seed)
# Two embedding matrices + two bias vectors
self.W = rng.normal(0, 0.1, (V, dim)) # word vectors
self.Wc = rng.normal(0, 0.1, (V, dim)) # context vectors
self.b = np.zeros(V); self.bc = np.zeros(V)
self.X, self.x_max, self.alpha, self.lr = X, x_max, alpha, lr
# Pre-compute the non-zero co-occurrence entries
self.nz = np.argwhere(X > 0)
def _weight(self, x):
return np.minimum((x / self.x_max)**self.alpha, 1.0)
def train_epoch(self):
total = 0.0
np.random.shuffle(self.nz)
for i, j in self.nz:
xij = self.X[i, j]
# Prediction error: w_i·w̃_j + b_i + b̃_j − log X_ij
diff = self.W[i] @ self.Wc[j] + self.b[i] + self.bc[j] - np.log(xij)
fw = self._weight(xij)
total += fw * diff**2
# Gradients (weighted squared loss)
g = 2 * fw * diff
gW, gWc = g * self.Wc[j], g * self.W[i]
self.W[i] -= self.lr * gW
self.Wc[j] -= self.lr * gWc
self.b[i] -= self.lr * g
self.bc[j] -= self.lr * g
return total / len(self.nz)
@property
def embeddings(self):
return self.W + self.Wc # GloVe paper: sum the two matrices
glove = GloVe(X, dim=20)
for epoch in range(100):
loss = glove.train_epoch()
if epoch % 25 == 0: print(f"epoch {epoch}: weighted loss {loss:.4f}")
# epoch 0: weighted loss 1.8042
# epoch 75: weighted loss 0.0031 -- dot products now ≈ log co-occurrencesWord2Vec vs. GloVe
| Word2Vec (SGNS) | GloVe |
|---|---|
| Streaming: one pair at a time | Batch: pre-computed co-occurrence matrix |
| Implicit matrix factorisation (shifted PMI) | Explicit weighted factorisation of log X |
| Memory: O(V·dim) — corpus streamed | Memory: O(nnz(X)) — matrix must fit |
| Local context only | Global corpus statistics |
| Stochastic; sensitive to pair order | Deterministic given X (up to init) |
| Slightly better on analogy tasks | Slightly better on similarity tasks |
In practice, both produce embeddings of comparable quality on most benchmarks. The Levy & Goldberg (2015) systematic study found that hyperparameters (window size, subsampling, vector dimension) matter more than the choice between SGNS and GloVe.
Word2Vec and GloVe share a blind spot: each word is an atomic unit. “Run”, “running”, and “runner” get independent vectors that share no parameters — the model cannot exploit their obvious morphological relationship. Worse, any word not seen in training has no vector at all (the OOV problem).
fastText (Bojanowski et al., 2016) fixes both problems with one idea: represent each word as the sum of its character n-gram embeddings.
The Subword Decomposition
The word “where” with n-grams of length 3–6, including boundary markers < and >:
def char_ngrams(word, n_min=3, n_max=6):
"""fastText-style subword decomposition with boundary markers."""
w = '<' + word + '>'
grams = []
for n in range(n_min, min(n_max, len(w)) + 1):
for i in range(len(w) - n + 1):
grams.append(w[i:i+n])
grams.append(w) # the full word is also a 'gram'
return grams
print(char_ngrams('where', 3, 4))
# ['<wh', 'whe', 'her', 'ere', 're>', <- 3-grams
# '<whe', 'wher', 'here', 'ere>', <- 4-grams
# '<where>'] <- full word
# Word vector = sum of n-gram vectors:
# v(where) = z(<wh) + z(whe) + z(her) + ... + z(<where>)
# 'her' inside 'where' shares the SAME z(her) used in 'her', 'here', 'there'Why Subwords Win
| Problem | Word2Vec / GloVe | fastText |
|---|---|---|
| OOV words | No vector — fail or use <UNK> | Sum the n-gram vectors → usable embedding |
| Rare words | Poor vectors (few updates) | Share n-grams with frequent words |
| Morphology (run/running) | Independent vectors | Shared subwords link related forms |
| Typos (recieve) | No vector | Mostly-overlapping n-grams → close to 'receive' |
| Agglutinative languages | Vocabulary explosion | Subwords keep vocab manageable |
# pip install fasttext (or gensim's FastText)
from gensim.models import FastText
# Train on a small corpus (in practice: Wikipedia, Common Crawl)
sentences = [s.split() for s in [
'the cat sat on the mat',
'the kitten purred softly',
'dogs and cats are running in the park',
]]
ft = FastText(sentences, vector_size=50, window=3, min_count=1,
min_n=3, max_n=6, epochs=100)
# OOV word: 'catlike' was NEVER in training data
v_oov = ft.wv['catlike'] # works! built from <ca, cat, atl, ... n-grams
print(ft.wv.similarity('catlike', 'cat'))
# 0.71 -- shares 'cat' n-grams, lands near 'cat' in embedding space
# Typo robustness
print(ft.wv.similarity('runing', 'running'))
# 0.89 -- overlapping n-grams make typos land near correct spellingsThe most celebrated property of word embeddings: vector arithmetic captures semantic relationships. The canonical example — v(king) − v(man) + v(woman) ≈ v(queen) — suggests that the difference vector v(king) − v(man) encodes a reusable “royalty minus maleness” direction.
The Analogy Mechanism
import numpy as np
import gensim.downloader as api
# Load 100-d GloVe trained on Wikipedia + Gigaword (~130k vocab)
glv = api.load('glove-wiki-gigaword-100')
# --- The classic ---
print(glv.most_similar(positive=['king', 'woman'], negative=['man'], topn=3))
# [('queen', 0.770), ('monarch', 0.684), ('throne', 0.683)]
# --- Syntactic analogies work too ---
print(glv.most_similar(positive=['walking', 'swam'], negative=['walked'], topn=1))
# [('swimming', 0.789)] walked:walking :: swam:swimming
# --- Geographic ---
print(glv.most_similar(positive=['paris', 'italy'], negative=['france'], topn=1))
# [('rome', 0.842)] france:paris :: italy:rome
# --- Nearest neighbours reveal local structure ---
print(glv.most_similar('transformer', topn=5))
# [('transformers',...), ('amplifier',...), ('voltage',...)]
# 2014 vectors: 'transformer' is electrical, not neural. Embeddings freeze
# the corpus's snapshot of meaning -- a key limitation.Anisotropy: The Narrow Cone Problem
Trained embeddings are not spread uniformly over the sphere — they concentrate in a narrow cone, making all cosine similarities positive and inflated. Post-processing helps: mean-centering plus removing the top principal components (the “all-but-the-top” method, Mu & Viswanath 2018) measurably improves similarity benchmarks.
import numpy as np
E = glv.vectors # (V, 100) all embeddings
# Average pairwise cosine of random word pairs (should be ~0 if isotropic)
rng = np.random.default_rng(0)
idx = rng.choice(len(E), (2000, 2))
def cos_rows(A, B):
return np.sum(A*B, axis=1) / (np.linalg.norm(A,axis=1)*np.linalg.norm(B,axis=1) + 1e-9)
avg_cos = cos_rows(E[idx[:,0]], E[idx[:,1]]).mean()
print(f"Mean random-pair cosine (raw): {avg_cos:.3f}")
# 0.318 -- far from 0: vectors live in a narrow cone
# All-but-the-top: centre, then remove top-k principal components
E_c = E - E.mean(axis=0)
U, S, Vt = np.linalg.svd(E_c, full_matrices=False)
k = 3 # remove top 3 PCs
E_iso = E_c - (E_c @ Vt[:k].T) @ Vt[:k]
avg_cos_iso = cos_rows(E_iso[idx[:,0]], E_iso[idx[:,1]]).mean()
print(f"Mean random-pair cosine (corrected): {avg_cos_iso:.3f}")
# 0.011 -- near-isotropic; similarity scores are now meaningfulHow do you know if one set of embeddings is better than another? Two evaluation families exist, and they frequently disagree — a fact with important practical consequences.
| Type | Benchmark | What it measures |
|---|---|---|
| Intrinsic | WordSim-353, SimLex-999 | Correlation of cos(w₁,w₂) with human similarity ratings |
| Intrinsic | Google analogy set (19,544 questions) | Accuracy of a:b :: c:? vector arithmetic |
| Intrinsic | MEN, RareWord (RW) | Similarity on frequent / rare word pairs |
| Extrinsic | Text classification (e.g., SST-2) | Accuracy when embeddings feed a downstream classifier |
| Extrinsic | Named-entity recognition (CoNLL) | F1 with embeddings as input features |
| Extrinsic | Machine translation (BLEU) | Quality when used to initialise encoder embeddings |
import numpy as np
from scipy.stats import spearmanr
# SimLex-999 style: (word1, word2, human_score 0-10)
pairs = [
('old', 'new', 1.58), # antonyms: related but NOT similar
('smart', 'intelligent', 9.20),
('hard', 'difficult', 8.77),
('happy', 'cheerful', 9.55),
('coast', 'shore', 9.00),
('movie', 'popcorn', 2.50), # associated, not similar
]
model_sims = [glv.similarity(a, b) for a, b, _ in pairs]
human_sims = [h for _, _, h in pairs]
rho, p = spearmanr(model_sims, human_sims)
print(f"Spearman ρ = {rho:.3f} (p={p:.3f})")
# Spearman ρ = 0.829
# Note: GloVe scores 'old'/'new' as quite similar (they share contexts!)
# but humans rate them dissimilar. Distributional similarity conflates
# similarity with relatedness -- a fundamental limitation.Every method in this chapter assigns each word exactly one vector. But “bank” means a financial institution, a river edge, a turning manoeuvre, and a pool shot. A single static vector must average these senses into one point — a point that may be close to none of them.
The Superposition of Senses
Arora et al. (2018) showed that a polysemous word's static vector is approximately a frequency-weighted linear combination of its sense vectors:
Remarkably, the individual senses can be partially recovered from the combined vector using sparse coding — evidence that the information is superimposed, not destroyed. But for any specific sentence, the static vector still cannot tell you which sense is active.
# Static embeddings: one vector regardless of context
print(glv.most_similar('bank', topn=6))
# [('banks',.79), ('banking',.74), ('credit',.69),
# ('financial',.68), ('lending',.67), ('lender',.66)]
# The financial sense dominates -- river banks are invisible.
# Contextual embeddings (BERT): different vector per occurrence
import torch
from transformers import AutoTokenizer, AutoModel
tok = AutoTokenizer.from_pretrained('bert-base-uncased')
model = AutoModel.from_pretrained('bert-base-uncased').eval()
def word_vec_in_context(sentence, word):
"""Return BERT's contextual embedding of `word` inside `sentence`."""
enc = tok(sentence, return_tensors='pt')
with torch.no_grad():
h = model(**enc).last_hidden_state[0] # (T, 768)
w_id = tok.convert_tokens_to_ids(word)
pos = (enc.input_ids[0] == w_id).nonzero()[0, 0]
return h[pos]
v_fin = word_vec_in_context('I deposited cash at the bank', 'bank')
v_river = word_vec_in_context('We fished from the river bank', 'bank')
v_fin2 = word_vec_in_context('The bank approved my loan', 'bank')
cos = lambda a, b: torch.cosine_similarity(a, b, dim=0).item()
print(f"finance vs finance: {cos(v_fin, v_fin2):.3f}")
print(f"finance vs river: {cos(v_fin, v_river):.3f}")
# finance vs finance: 0.85 <- same sense, similar vectors
# finance vs river: 0.46 <- different senses, BERT separates themThe Conceptual Bridge to Part III
Contextual embeddings are not a different idea — they are the same distributional hypothesis, applied dynamically. A Transformer computes a fresh embedding for every token occurrence, conditioned on its entire surrounding sentence, by repeatedly mixing each token's vector with its context through attention. The static embedding table of Word2Vec becomes merely the input layer; the real representation emerges through the network.
| Era | Method | Representation |
|---|---|---|
| 1990 | LSA | One vector per word; SVD of term-document counts |
| 2013 | Word2Vec | One vector per word; learned by context prediction |
| 2016 | fastText | One vector per word; composed from subword n-grams |
| 2018 | ELMo | One vector per occurrence; bidirectional LSTM states |
| 2018 | BERT / GPT | One vector per occurrence per layer; Transformer attention |
Method Quick-Reference
| Method | Objective | Strength | Weakness |
|---|---|---|---|
| Word2Vec SGNS | Binary: real pair vs k noise pairs | Fast, scalable, strong analogies | OOV; one vector per word |
| GloVe | Weighted fit of wᵢ·w̃ⱼ ≈ log Xᵢⱼ | Global statistics; deterministic | Matrix memory; OOV |
| fastText | SGNS over char n-gram sums | OOV + morphology + typos | Slightly blurrier vectors |
| LSA | Truncated SVD of counts | Simple; interpretable factors | Linear; weaker quality |
| BERT (preview) | Masked LM over Transformers | Contextual; sense-aware | Heavy; needs Part III |
Exercises
Exercises 1–10 are pen-and-paper; 11–18 require code.
Further reading: “Efficient Estimation of Word Representations in Vector Space” (Mikolov et al., 2013) and “Distributed Representations of Words and Phrases” (Mikolov et al., 2013b) — the original Word2Vec pair. “GloVe: Global Vectors for Word Representation” (Pennington et al., 2014). “Enriching Word Vectors with Subword Information” (Bojanowski et al., 2016) for fastText. “Neural Word Embedding as Implicit Matrix Factorization” (Levy & Goldberg, 2014) for the unifying theory. Jurafsky & Martin, “Speech and Language Processing” (3rd ed.) Chapter 6 for a textbook treatment.
Next → Chapter 9: Sequence Models: RNNs & LSTMs
Static embeddings give each word a fixed point in space, ignoring order entirely — “dog bites man” and “man bites dog” have identical bags of vectors. Chapter 9 introduces recurrent networks that read sequences token by token, maintaining a hidden state that accumulates context. We will meet the vanishing gradient problem head-on, see how LSTM gating solves it, and build the seq2seq-with-attention architecture whose limitations directly motivated the Transformer.