Solutions Appendix
Chapter 6

Discriminative Models

18 Solutions

Detailed solutions for the exercises in Chapter 6. Try solving them yourself before checking the answers.

Exercise 1Pen & Paper
Derive the gradient of binary cross-entropy w.r.t. w in logistic regression; show it equals Xᵀ(p−y)/N.

Solution

With p = σ(Xw+b) and L = −(1/N)Σ[y·log p + (1−y)·log(1−p)], use the chain rule through z = Xw+b. The key simplification is ∂L/∂z = (p − y)/N per sample (the sigmoid and log-loss derivatives combine cleanly). Since ∂z/∂w = X, we get ∂L/∂w = Xᵀ(p−y)/N. The gradient is the input matrix times the prediction error — the same predicted-minus-true structure as softmax cross-entropy.

Exercise 2Pen & Paper
Prove the SVM hard-margin objective minimize ½‖w‖² s.t. y_i(w·x_i+b)≥1 maximizes the margin 2/‖w‖.

Solution

The signed distance from a point to the hyperplane w·x+b=0 is (w·x+b)/‖w‖. The constraint forces the closest points of each class to satisfy |w·x+b| = 1, so their distance to the boundary is 1/‖w‖ and the gap between the two class margins is 2/‖w‖. Maximizing 2/‖w‖ is equivalent to minimizing ‖w‖, hence ½‖w‖² (a convenient convex, differentiable form). The constraints fix the scale so the margin is well-defined.

Exercise 3Pen & Paper
Compute information gain for a binary split on XOR. Which feature gives higher gain?

Solution

The XOR labels are {(0,0)→0, (0,1)→1, (1,0)→1, (1,1)→0}. Splitting on feature 1 puts {0,1} in one branch and {1,0} in the other — each branch is still a 50/50 mix, so entropy stays at 1 bit and the information gain is 0. Splitting on feature 2 gives the same. Neither feature gives any gain individually: this is exactly why a single-feature (axis-aligned) split, and a linear model, cannot solve XOR — the signal lives only in the interaction of the two features.

Exercise 4Pen & Paper
Show bagging reduces variance to (ρ + (1−ρ)/B)σ². What happens as B→∞?

Solution

Averaging B identically-distributed trees each with variance σ² and pairwise correlation ρ gives Var = ρσ² + (1−ρ)σ²/B. The second term (the independent part) vanishes as B grows, but the first term ρσ² remains. As B→∞, Var → ρσ² — a floor set by the correlation between trees. This is why random forests inject randomness (feature subsampling) to lower ρ: adding more trees alone cannot beat the ρσ² floor.

Exercise 5Pen & Paper
Spam filter: 95% recall, 80% precision, 5% spam base rate. What is its accuracy?

Solution

Per 100 emails, 5 are spam. Recall 0.95 → TP = 4.75, FN = 0.25. Precision 0.80 → FP = TP·(1−0.8)/0.8 = 1.19. TN = 95 − 1.19 = 93.81. Accuracy = (TP+TN)/100 = (4.75+93.81)/100 ≈ 98.6%. Note how high accuracy can coexist with imperfect precision/recall when the negative class dominates — accuracy alone is misleading on imbalanced data.

Exercise 6Pen & Paper
Why does L2 correspond to a Gaussian prior in MAP? What prior corresponds to L1?

Solution

MAP adds −log p(θ) to the loss. A zero-mean Gaussian prior gives −log p(θ) ∝ ‖θ‖² → L2 (ridge). A zero-mean Laplace (double-exponential) prior gives −log p(θ) ∝ ‖θ‖₁ → L1 (lasso). The Laplace prior is sharply peaked at zero with heavier tails, which is why L1 drives many weights exactly to zero (sparsity) while L2 merely shrinks them.

Exercise 7Pen & Paper
Gini impurity G=1−Σp_k²: compute for a pure node and a maximally impure binary node. Relation to entropy?

Solution

Pure node (all one class): G = 1 − 1² = 0. Maximally impure binary (50/50): G = 1 − (0.25+0.25) = 0.5. Gini and entropy both measure impurity, are both zero for pure nodes, and both maximized at the uniform split; Gini is essentially a quadratic (Taylor) approximation of entropy and is cheaper to compute, which is why CART uses it. They almost always pick the same splits.

Exercise 8Pen & Paper
In gradient boosting with squared error, show the pseudo-residuals equal the negative gradient.

Solution

With loss L = ½(y − F)², the gradient w.r.t. the prediction F is ∂L/∂F = −(y − F). The negative gradient is therefore (y − F) = r, exactly the residual y − F_{m−1}(x). So fitting the next tree to the residuals is fitting it to the negative gradient — which is why gradient boosting generalizes: for other losses you fit the negative gradient, not the literal residual.

Exercise 9Pen & Paper
Why can't logistic regression solve XOR geometrically? Minimum fix?

Solution

Logistic regression draws a single straight decision boundary, but XOR's two positive points sit at opposite corners of the square, so no line separates positives from negatives. The minimum fix is one interaction feature, x₁·x₂ (or a small basis expansion): in the augmented space the classes become linearly separable. This is the toy version of why we need nonlinear features / hidden layers.

Exercise 10Pen & Paper
In one-vs-rest with K classes, describe a case where every binary classifier abstains (<0.5).

Solution

If a test point lies in a region that every classifier was trained to treat as 'rest' — for instance, an outlier far from all class clusters, or a point in a gap between classes — each binary model can output < 0.5 for its own class. With no class exceeding 0.5, all abstain and one-vs-rest has no confident winner (you fall back to argmax of the raw scores). This exposes one-vs-rest's lack of a calibrated joint distribution over classes.

Exercise 11Code
Implement logistic regression from scratch; compare accuracy/time to sklearn on breast cancer.

Solution

Gradient descent on BCE with the Xᵀ(p−y)/N gradient from Exercise 1 converges to essentially the same weights and ~95–97% accuracy as sklearn's LogisticRegression. Your hand-written version is slower (sklearn uses optimized solvers like LBFGS) but matches accuracy, confirming the implementation.

Exercise 12Code
Kernel-trick logistic regression with RBF on the circles dataset; compare to linear.

Solution

Replacing w·x with Σ_i α_i K(x_i, x) using an RBF kernel lets the model learn a nonlinear (circular) boundary, separating the concentric circles that linear LR cannot. Accuracy jumps from ~50% (linear, chance) to near 100% (RBF), illustrating how kernels implicitly lift data into a higher-dimensional separable space.

Exercise 13Code
Build a decision tree from scratch (Gini, early stopping) on iris; plot train/test accuracy vs max_depth.

Solution

Recursively split on the feature/threshold that maximizes Gini gain, stopping at max_depth or pure nodes. Train accuracy rises monotonically with depth toward 100%, but test accuracy peaks at a moderate depth (2–3 for iris) then declines as the tree overfits — the classic depth/overfitting curve, and the rationale for pruning and ensembles.

Exercise 14Code Lab
Random forest vs gradient-boosted trees on adult income; feature importances, ROC-AUC, PR-AUC.

Solution

Both ensembles strongly outperform a single tree. Gradient boosting usually edges out random forest on tabular AUC. Feature importances highlight capital-gain, marital-status, education. On this imbalanced dataset PR-AUC is the more informative metric than ROC-AUC, because ROC can look optimistic when negatives dominate — PR-AUC focuses on performance on the rare positive (high-income) class.

Exercise 15Code
Calibration: SVM and random forest on 10:1 imbalanced data; reliability diagrams; isotonic calibration.

Solution

Raw SVM scores (margins) are not probabilities and are poorly calibrated; random forest probabilities are biased toward 0.5. Reliability diagrams show both deviating from the diagonal. Isotonic regression (or Platt scaling) on a held-out set pulls predicted probabilities onto the diagonal, so a predicted 0.8 truly means ~80% — essential when downstream decisions use the probabilities.

Exercise 16Code
Gradient boosting from scratch for classification (log-loss, pseudo-residuals y−p); compare to sklearn.

Solution

Initialize with the log-odds of the base rate, then iteratively fit regression trees to the pseudo-residuals (y − p), where p = σ(F), and add them with a learning rate. The loss curve and accuracy track sklearn's GradientBoostingClassifier closely, validating that fitting trees to negative-gradient residuals implements boosting for log-loss.

Exercise 17Code Lab
Linear probing: BERT [CLS] embeddings on SST-2 at layers 1,4,8,12; accuracy vs depth.

Solution

Training a logistic-regression probe on frozen [CLS] embeddings shows accuracy rising with depth: early layers encode shallow/lexical features (lower probe accuracy), middle-to-late layers encode task-relevant semantics (higher accuracy), often peaking around the upper-middle layers before the top layers specialize to the pretraining objective. This probes where sentiment information lives in the network.

Exercise 18Code (Challenge)
Full tabular pipeline: clean, encode, scale; train LR/RF/XGBoost; proper splits; calibration plot.

Solution

A complete pipeline imputes missing values, one-hot/ordinal-encodes categoricals, scales numerics (for LR), and fits all three models with a train/validation/test split (validation for tuning, test untouched until the end). XGBoost typically wins on tabular data; the final calibration plot for the best model, after isotonic/Platt calibration, should hug the diagonal — demonstrating an end-to-end, leak-free, well-calibrated workflow.