Solutions Appendix
Chapter 1

Linear Algebra for Machine Learning

20 Solutions

Detailed solutions for the exercises in Chapter 1. Try solving them yourself before checking the answers.

Exercise 1Geometry
Draw vectors a=[1,2] and b=[3,0]; compute their dot product and the angle between them.

Solution

The dot product is a·b = (1)(3)+(2)(0) = 3. The magnitudes are |a| = √(1+4) = √5 ≈ 2.236 and |b| = 3. The cosine of the angle is cosθ = (a·b)/(|a||b|) = 3/(2.236·3) = 0.447, so θ = arccos(0.447) ≈ 63.4°. Geometrically, b points along the positive x-axis and a points up and to the right, separated by about 63°.

Exercise 2Norms
Compute L1, L2, and L∞ norms of v=[3,-4,0,5,-2]. Which is largest? Smallest?

Solution

L1 = |3|+|−4|+0+|5|+|−2| = 14. L2 = √(9+16+0+25+4) = √54 ≈ 7.35. L∞ = max|v_i| = 5. The L1 norm is largest and L∞ is smallest — always true for a fixed vector, since L1 ≥ L2 ≥ L∞ (the more components you sum in full, the larger the total).

Exercise 3Unit Vectors
Normalize v=[1,1,1,1] to a unit vector and verify its L2 norm is 1.

Solution

|v| = √(1+1+1+1) = 2, so the unit vector is v/|v| = [0.5, 0.5, 0.5, 0.5]. Its L2 norm is √(4·0.25) = √1 = 1. ✓

Exercise 4Cosine
Compute cosine similarity between a=[1,0,1] and b=[0,1,1]; interpret geometrically.

Solution

a·b = 0+0+1 = 1; |a| = |b| = √2. cos = 1/(√2·√2) = 1/2, so the angle is 60°. The vectors are moderately similar: they share the third coordinate but differ in the first two, placing them 60° apart.

Exercise 5Projection
Project u=[4,3] onto the x-axis direction [1,0]; find the perpendicular component.

Solution

The projection onto unit direction d=[1,0] is (u·d)d = 4·[1,0] = [4,0]. The perpendicular component is u − [4,0] = [0,3]. Together [4,0]+[0,3] reconstructs u, decomposing it into x and y parts.

Exercise 6Matrix Multiply
By hand, compute AB where A=[[1,2],[3,4]], B=[[5,6],[7,8]].

Solution

Each entry is a row of A dotted with a column of B:

AB = [[1·5+2·7, 1·6+2·8], [3·5+4·7, 3·6+4·8]] = [[19, 22], [43, 50]]
Exercise 7Rank
What is the rank of [[1,2,3],[2,4,6],[3,6,9]]? Why?

Solution

Rank 1. Row 2 = 2×row 1 and row 3 = 3×row 1, so all rows are multiples of [1,2,3] — there is only one linearly independent row. The matrix is the outer product [1,2,3]ᵀ scaled column-wise, hence rank one.

Exercise 8Transpose
Show that (AB)ᵀ = BᵀAᵀ for 2×2 matrices A and B.

Solution

Work entry-wise. ((AB)ᵀ)_{ij} = (AB)_{ji} = Σ_k A_{jk}B_{ki}. Meanwhile (BᵀAᵀ)_{ij} = Σ_k (Bᵀ)_{ik}(Aᵀ)_{kj} = Σ_k B_{ki}A_{jk}. The two sums are identical term by term, so (AB)ᵀ = BᵀAᵀ. The order reverses because transposing swaps which index is summed against which.

Exercise 9Symmetric
Prove that WᵀW is always symmetric.

Solution

A matrix M is symmetric iff Mᵀ = M. Here (WᵀW)ᵀ = Wᵀ(Wᵀ)ᵀ = WᵀW, using (Wᵀ)ᵀ = W and the reversal rule from Exercise 8. Since it equals its own transpose, WᵀW is symmetric for any W (even non-square).

Exercise 10Hadamard vs Matmul
For A=[[1,2],[3,4]] and B=[[5,0],[0,5]], compute A⊙B and AB. What do you notice?

Solution

Element-wise: A⊙B = [[5,0],[0,20]]. Matrix product: AB = [[5,10],[15,20]]. Because B = 5I (a scalar multiple of the identity), AB = 5A — multiplying by a scalar matrix simply scales A, whereas the Hadamard product only touches the diagonal where B is nonzero.

Exercise 11Eigenvalues
Find eigenvalues of [[2,1],[1,2]] by solving det(A−λI)=0.

Solution

det([[2−λ,1],[1,2−λ]]) = (2−λ)² − 1 = 0, so (2−λ) = ±1, giving λ = 1 and λ = 3. The eigenvectors are [1,−1]/√2 (for λ=1) and [1,1]/√2 (for λ=3) — a symmetric matrix with orthogonal eigenvectors, as expected.

Exercise 12SVD by Hand
SVD-decompose the rank-1 matrix [[2,4],[1,2]].

Solution

The matrix is the outer product [2,1]ᵀ·[1,2], so it has a single nonzero singular value. With u = [2,1]/√5 and v = [1,2]/√5, the singular value is σ = |[2,1]|·|[1,2]| = √5·√5 = 5. Thus A = 5·u vᵀ, with σ₁ = 5 and σ₂ = 0, confirming rank 1.

Exercise 13Code: Broadcasting
Subtract the per-column mean of a (100,4) activation matrix in one line of NumPy.

Solution

Broadcasting handles the (4,) mean against the (100,4) matrix automatically:

X - X.mean(axis=0)   # mean over rows -> shape (4,), broadcast to (100,4)

Each column is centered to zero mean; this is exactly the centering step inside layer/batch normalization and PCA.

Exercise 14Code: Cosine Search
Embed 10 random words as 8-d vectors, build a cosine-similarity matrix, find the most similar pair.

Solution

Normalize each row to unit length, then the cosine matrix is a single matrix product; mask the diagonal so a vector isn't matched with itself:

Xn = X / np.linalg.norm(X, axis=1, keepdims=True)
S  = Xn @ Xn.T
np.fill_diagonal(S, -np.inf)
i, j = np.unravel_index(S.argmax(), S.shape)

The pair (i,j) with the largest off-diagonal cosine is the most similar. This is the core operation behind dense retrieval (Chapter 29).

Exercise 15Code: PCA
Generate 500 points with covariance [[4,2],[2,1]], run PCA, verify the first PC aligns with [0.894, 0.447].

Solution

Sample from the covariance, center the data, form the empirical covariance, and take the eigenvector of its largest eigenvalue. The covariance [[4,2],[2,1]] is rank-deficient in direction (its eigenvector for the large eigenvalue is the [2,1] direction), and [2,1]/√5 = [0.894, 0.447] — matching the first principal component up to sign.

Xc = X - X.mean(0)
w, V = np.linalg.eigh(np.cov(Xc.T))
pc1 = V[:, -1]   # eigenvector of the largest eigenvalue
Exercise 16Code: SVD compression
Compress a 256×256 matrix with k=10, 50; report error and compression ratio.

Solution

Truncate the SVD to the top k singular values. The reconstruction error in Frobenius norm equals the square root of the sum of the dropped squared singular values, √(Σ_{i>k} σ_i²). The storage ratio is (256k + k + k·256)/(256·256) ≈ 2·256·k/256². For k=10 the ratio is ≈7.8%, for k=50 ≈39%. For a random matrix the spectrum is roughly flat, so error stays high; for structured matrices it drops fast — the basis of low-rank compression and LoRA.

Exercise 17Code: Einsum
Implement batched matmul (B,N,M)×(B,M,K)→(B,N,K) with einsum; verify against torch.bmm.

Solution

The Einstein-summation string names the shared index m and sums over it per batch b:

torch.einsum('bnm,bmk->bnk', A, B)   # == torch.bmm(A, B)

einsum makes the contraction explicit and matches torch.bmm to floating-point precision.

Exercise 18Code: Gradient norm
Compute the L2 norm of a (768,3072) gradient and clip it to max_norm=1.0 in one line.

Solution

Global norm clipping scales the whole tensor by min(1, max_norm/‖g‖):

g * min(1.0, 1.0 / (g.norm() + 1e-12))

If the norm already exceeds 1 the factor shrinks it to exactly 1; otherwise the gradient is left unchanged. This is the standard guard against exploding gradients (Chapter 15).

Exercise 19Code: LoRA rank analysis
SVD a pretrained nn.Linear weight; how many singular values capture 90% of the energy?

Solution

Energy is measured by squared singular values. Compute the cumulative fraction Σ_{i≤k}σ_i² / Σ_iσ_i² and find the smallest k reaching 0.90. Pretrained weight matrices typically have rapidly decaying spectra, so a small fraction of the rank often captures most of the energy — the empirical justification for low-rank adapters (LoRA): the update can live in a low-dimensional subspace.

Exercise 20Challenge: Power Iteration
Implement power iteration for the largest eigenvalue of a 100×100 symmetric matrix; compare to eigvalsh.

Solution

Repeatedly apply A to a random vector and renormalize; the iterate converges to the dominant eigenvector, and the Rayleigh quotient gives the eigenvalue:

for _ in range(1000):
    x = A @ x; x /= np.linalg.norm(x)
lam = x @ A @ x

lam converges to the largest eigenvalue from np.linalg.eigvalsh. Convergence speed depends on the ratio of the top two eigenvalues.