Linear Algebra for Machine Learning
Detailed solutions for the exercises in Chapter 1. Try solving them yourself before checking the answers.
Solution
The dot product is a·b = (1)(3)+(2)(0) = 3. The magnitudes are |a| = √(1+4) = √5 ≈ 2.236 and |b| = 3. The cosine of the angle is cosθ = (a·b)/(|a||b|) = 3/(2.236·3) = 0.447, so θ = arccos(0.447) ≈ 63.4°. Geometrically, b points along the positive x-axis and a points up and to the right, separated by about 63°.
Solution
L1 = |3|+|−4|+0+|5|+|−2| = 14. L2 = √(9+16+0+25+4) = √54 ≈ 7.35. L∞ = max|v_i| = 5. The L1 norm is largest and L∞ is smallest — always true for a fixed vector, since L1 ≥ L2 ≥ L∞ (the more components you sum in full, the larger the total).
Solution
|v| = √(1+1+1+1) = 2, so the unit vector is v/|v| = [0.5, 0.5, 0.5, 0.5]. Its L2 norm is √(4·0.25) = √1 = 1. ✓
Solution
a·b = 0+0+1 = 1; |a| = |b| = √2. cos = 1/(√2·√2) = 1/2, so the angle is 60°. The vectors are moderately similar: they share the third coordinate but differ in the first two, placing them 60° apart.
Solution
The projection onto unit direction d=[1,0] is (u·d)d = 4·[1,0] = [4,0]. The perpendicular component is u − [4,0] = [0,3]. Together [4,0]+[0,3] reconstructs u, decomposing it into x and y parts.
Solution
Each entry is a row of A dotted with a column of B:
Solution
Rank 1. Row 2 = 2×row 1 and row 3 = 3×row 1, so all rows are multiples of [1,2,3] — there is only one linearly independent row. The matrix is the outer product [1,2,3]ᵀ scaled column-wise, hence rank one.
Solution
Work entry-wise. ((AB)ᵀ)_{ij} = (AB)_{ji} = Σ_k A_{jk}B_{ki}. Meanwhile (BᵀAᵀ)_{ij} = Σ_k (Bᵀ)_{ik}(Aᵀ)_{kj} = Σ_k B_{ki}A_{jk}. The two sums are identical term by term, so (AB)ᵀ = BᵀAᵀ. The order reverses because transposing swaps which index is summed against which.
Solution
A matrix M is symmetric iff Mᵀ = M. Here (WᵀW)ᵀ = Wᵀ(Wᵀ)ᵀ = WᵀW, using (Wᵀ)ᵀ = W and the reversal rule from Exercise 8. Since it equals its own transpose, WᵀW is symmetric for any W (even non-square).
Solution
Element-wise: A⊙B = [[5,0],[0,20]]. Matrix product: AB = [[5,10],[15,20]]. Because B = 5I (a scalar multiple of the identity), AB = 5A — multiplying by a scalar matrix simply scales A, whereas the Hadamard product only touches the diagonal where B is nonzero.
Solution
det([[2−λ,1],[1,2−λ]]) = (2−λ)² − 1 = 0, so (2−λ) = ±1, giving λ = 1 and λ = 3. The eigenvectors are [1,−1]/√2 (for λ=1) and [1,1]/√2 (for λ=3) — a symmetric matrix with orthogonal eigenvectors, as expected.
Solution
The matrix is the outer product [2,1]ᵀ·[1,2], so it has a single nonzero singular value. With u = [2,1]/√5 and v = [1,2]/√5, the singular value is σ = |[2,1]|·|[1,2]| = √5·√5 = 5. Thus A = 5·u vᵀ, with σ₁ = 5 and σ₂ = 0, confirming rank 1.
Solution
Broadcasting handles the (4,) mean against the (100,4) matrix automatically:
X - X.mean(axis=0) # mean over rows -> shape (4,), broadcast to (100,4)Each column is centered to zero mean; this is exactly the centering step inside layer/batch normalization and PCA.
Solution
Normalize each row to unit length, then the cosine matrix is a single matrix product; mask the diagonal so a vector isn't matched with itself:
Xn = X / np.linalg.norm(X, axis=1, keepdims=True)
S = Xn @ Xn.T
np.fill_diagonal(S, -np.inf)
i, j = np.unravel_index(S.argmax(), S.shape)The pair (i,j) with the largest off-diagonal cosine is the most similar. This is the core operation behind dense retrieval (Chapter 29).
Solution
Sample from the covariance, center the data, form the empirical covariance, and take the eigenvector of its largest eigenvalue. The covariance [[4,2],[2,1]] is rank-deficient in direction (its eigenvector for the large eigenvalue is the [2,1] direction), and [2,1]/√5 = [0.894, 0.447] — matching the first principal component up to sign.
Xc = X - X.mean(0)
w, V = np.linalg.eigh(np.cov(Xc.T))
pc1 = V[:, -1] # eigenvector of the largest eigenvalueSolution
Truncate the SVD to the top k singular values. The reconstruction error in Frobenius norm equals the square root of the sum of the dropped squared singular values, √(Σ_{i>k} σ_i²). The storage ratio is (256k + k + k·256)/(256·256) ≈ 2·256·k/256². For k=10 the ratio is ≈7.8%, for k=50 ≈39%. For a random matrix the spectrum is roughly flat, so error stays high; for structured matrices it drops fast — the basis of low-rank compression and LoRA.
Solution
The Einstein-summation string names the shared index m and sums over it per batch b:
torch.einsum('bnm,bmk->bnk', A, B) # == torch.bmm(A, B)einsum makes the contraction explicit and matches torch.bmm to floating-point precision.
Solution
Global norm clipping scales the whole tensor by min(1, max_norm/‖g‖):
g * min(1.0, 1.0 / (g.norm() + 1e-12))If the norm already exceeds 1 the factor shrinks it to exactly 1; otherwise the gradient is left unchanged. This is the standard guard against exploding gradients (Chapter 15).
Solution
Energy is measured by squared singular values. Compute the cumulative fraction Σ_{i≤k}σ_i² / Σ_iσ_i² and find the smallest k reaching 0.90. Pretrained weight matrices typically have rapidly decaying spectra, so a small fraction of the rank often captures most of the energy — the empirical justification for low-rank adapters (LoRA): the update can live in a low-dimensional subspace.
Solution
Repeatedly apply A to a random vector and renormalize; the iterate converges to the dominant eigenvector, and the Rayleigh quotient gives the eigenvalue:
for _ in range(1000):
x = A @ x; x /= np.linalg.norm(x)
lam = x @ A @ xlam converges to the largest eigenvalue from np.linalg.eigvalsh. Convergence speed depends on the ratio of the top two eigenvalues.