Discriminative Models
Detailed solutions for the exercises in Chapter 6. Try solving them yourself before checking the answers.
Solution
With p = σ(Xw+b) and L = −(1/N)Σ[y·log p + (1−y)·log(1−p)], use the chain rule through z = Xw+b. The key simplification is ∂L/∂z = (p − y)/N per sample (the sigmoid and log-loss derivatives combine cleanly). Since ∂z/∂w = X, we get ∂L/∂w = Xᵀ(p−y)/N. The gradient is the input matrix times the prediction error — the same predicted-minus-true structure as softmax cross-entropy.
Solution
The signed distance from a point to the hyperplane w·x+b=0 is (w·x+b)/‖w‖. The constraint forces the closest points of each class to satisfy |w·x+b| = 1, so their distance to the boundary is 1/‖w‖ and the gap between the two class margins is 2/‖w‖. Maximizing 2/‖w‖ is equivalent to minimizing ‖w‖, hence ½‖w‖² (a convenient convex, differentiable form). The constraints fix the scale so the margin is well-defined.
Solution
The XOR labels are {(0,0)→0, (0,1)→1, (1,0)→1, (1,1)→0}. Splitting on feature 1 puts {0,1} in one branch and {1,0} in the other — each branch is still a 50/50 mix, so entropy stays at 1 bit and the information gain is 0. Splitting on feature 2 gives the same. Neither feature gives any gain individually: this is exactly why a single-feature (axis-aligned) split, and a linear model, cannot solve XOR — the signal lives only in the interaction of the two features.
Solution
Averaging B identically-distributed trees each with variance σ² and pairwise correlation ρ gives Var = ρσ² + (1−ρ)σ²/B. The second term (the independent part) vanishes as B grows, but the first term ρσ² remains. As B→∞, Var → ρσ² — a floor set by the correlation between trees. This is why random forests inject randomness (feature subsampling) to lower ρ: adding more trees alone cannot beat the ρσ² floor.
Solution
Per 100 emails, 5 are spam. Recall 0.95 → TP = 4.75, FN = 0.25. Precision 0.80 → FP = TP·(1−0.8)/0.8 = 1.19. TN = 95 − 1.19 = 93.81. Accuracy = (TP+TN)/100 = (4.75+93.81)/100 ≈ 98.6%. Note how high accuracy can coexist with imperfect precision/recall when the negative class dominates — accuracy alone is misleading on imbalanced data.
Solution
MAP adds −log p(θ) to the loss. A zero-mean Gaussian prior gives −log p(θ) ∝ ‖θ‖² → L2 (ridge). A zero-mean Laplace (double-exponential) prior gives −log p(θ) ∝ ‖θ‖₁ → L1 (lasso). The Laplace prior is sharply peaked at zero with heavier tails, which is why L1 drives many weights exactly to zero (sparsity) while L2 merely shrinks them.
Solution
Pure node (all one class): G = 1 − 1² = 0. Maximally impure binary (50/50): G = 1 − (0.25+0.25) = 0.5. Gini and entropy both measure impurity, are both zero for pure nodes, and both maximized at the uniform split; Gini is essentially a quadratic (Taylor) approximation of entropy and is cheaper to compute, which is why CART uses it. They almost always pick the same splits.
Solution
With loss L = ½(y − F)², the gradient w.r.t. the prediction F is ∂L/∂F = −(y − F). The negative gradient is therefore (y − F) = r, exactly the residual y − F_{m−1}(x). So fitting the next tree to the residuals is fitting it to the negative gradient — which is why gradient boosting generalizes: for other losses you fit the negative gradient, not the literal residual.
Solution
Logistic regression draws a single straight decision boundary, but XOR's two positive points sit at opposite corners of the square, so no line separates positives from negatives. The minimum fix is one interaction feature, x₁·x₂ (or a small basis expansion): in the augmented space the classes become linearly separable. This is the toy version of why we need nonlinear features / hidden layers.
Solution
If a test point lies in a region that every classifier was trained to treat as 'rest' — for instance, an outlier far from all class clusters, or a point in a gap between classes — each binary model can output < 0.5 for its own class. With no class exceeding 0.5, all abstain and one-vs-rest has no confident winner (you fall back to argmax of the raw scores). This exposes one-vs-rest's lack of a calibrated joint distribution over classes.
Solution
Gradient descent on BCE with the Xᵀ(p−y)/N gradient from Exercise 1 converges to essentially the same weights and ~95–97% accuracy as sklearn's LogisticRegression. Your hand-written version is slower (sklearn uses optimized solvers like LBFGS) but matches accuracy, confirming the implementation.
Solution
Replacing w·x with Σ_i α_i K(x_i, x) using an RBF kernel lets the model learn a nonlinear (circular) boundary, separating the concentric circles that linear LR cannot. Accuracy jumps from ~50% (linear, chance) to near 100% (RBF), illustrating how kernels implicitly lift data into a higher-dimensional separable space.
Solution
Recursively split on the feature/threshold that maximizes Gini gain, stopping at max_depth or pure nodes. Train accuracy rises monotonically with depth toward 100%, but test accuracy peaks at a moderate depth (2–3 for iris) then declines as the tree overfits — the classic depth/overfitting curve, and the rationale for pruning and ensembles.
Solution
Both ensembles strongly outperform a single tree. Gradient boosting usually edges out random forest on tabular AUC. Feature importances highlight capital-gain, marital-status, education. On this imbalanced dataset PR-AUC is the more informative metric than ROC-AUC, because ROC can look optimistic when negatives dominate — PR-AUC focuses on performance on the rare positive (high-income) class.
Solution
Raw SVM scores (margins) are not probabilities and are poorly calibrated; random forest probabilities are biased toward 0.5. Reliability diagrams show both deviating from the diagonal. Isotonic regression (or Platt scaling) on a held-out set pulls predicted probabilities onto the diagonal, so a predicted 0.8 truly means ~80% — essential when downstream decisions use the probabilities.
Solution
Initialize with the log-odds of the base rate, then iteratively fit regression trees to the pseudo-residuals (y − p), where p = σ(F), and add them with a learning rate. The loss curve and accuracy track sklearn's GradientBoostingClassifier closely, validating that fitting trees to negative-gradient residuals implements boosting for log-loss.
Solution
Training a logistic-regression probe on frozen [CLS] embeddings shows accuracy rising with depth: early layers encode shallow/lexical features (lower probe accuracy), middle-to-late layers encode task-relevant semantics (higher accuracy), often peaking around the upper-middle layers before the top layers specialize to the pretraining objective. This probes where sentiment information lives in the network.
Solution
A complete pipeline imputes missing values, one-hot/ordinal-encodes categoricals, scales numerics (for LR), and fits all three models with a train/validation/test split (validation for tuning, test untouched until the end). XGBoost typically wins on tabular data; the final calibration plot for the best model, after isotonic/Platt calibration, should hug the diagonal — demonstrating an end-to-end, leak-free, well-calibrated workflow.