Part II: Classical ML & Representations
Chapter 6

Classical Machine Learning

Linear models, regularization, and the bias-variance tradeoff
18 Exercises
6.1

Before building any classifier, we must answer a fundamental question: should our model learn the full joint distribution P(X, Y), or should it go straight for the conditional P(Y|X)? This choice shapes everything — architecture, training objective, data requirements, and what questions the model can answer.

Discriminative P(Y | X)Generative P(X, Y) = P(X | Y) P(Y)
Models the boundary between classes directlyModels how data is generated for each class
Typically needs less data to fit wellNeeds more data; learns richer structure
Cannot generate new samples of XCan sample new X given Y
Cannot answer P(X) queriesCan compute P(X) and detect anomalies
Examples: logistic regression, SVM, neural netsExamples: Naïve Bayes, GMM, VAE, GPT
Training: minimise cross-entropy / hinge lossTraining: maximise log-likelihood P(X, Y; θ)

For most classification tasks — spam detection, sentiment analysis, named-entity recognition — we only need P(Y|X). Discriminative models learn a direct mapping and typically outperform generative models when labelled data is plentiful. Generative models shine when you have limited labels, want to generate new data, or need anomaly detection.

Why LLMs Are Both
A language model like GPT is trained generatively: it learns P(x₁, x₂, …, xₙ) = ∏ P(xₜ | x<t). But it is used discriminatively at inference time when performing classification: given a prompt and possible labels, the model returns P(label | prompt).
This duality — generative training, discriminative use — is one of the deepest and most powerful ideas in modern ML. You will see it again in Chapters 8, 17, and 23.
6.2

Logistic regression is the workhorse of binary classification and the building block of deep learning output layers. Despite its name, it is a classification model, not a regression model. Understanding it deeply — geometrically, probabilistically, and computationally — is essential before studying neural networks.

The Model

We want to predict P(Y=1 | X=x). The sigmoid function squashes a linear combination of features into [0, 1]:

P(Y=1 | x; w, b) = σ(w·x + b) = 1 / (1 + exp(-(w·x + b)))

The linear part w·x + b is called the log-odds or logit. The decision boundary is the hyperplane where the logit equals zero — where the model is equally uncertain between the two classes.

Intuition: Geometric Interpretation
Logistic regression finds a hyperplane that separates the two classes. Points on one side get probability > 0.5; points on the other get < 0.5. The magnitude of the weight vector ‖w‖ controls how quickly probability drops off as you move away from the boundary.
Every neuron in a neural network's hidden layer is performing logistic regression on a learned representation. The depth of the network allows the final logistic layer to operate in a space where the classes are linearly separable, even if they aren't in the original input space.

Training: Maximum Likelihood = Cross-Entropy Loss

From Chapter 3, the MLE objective for logistic regression is to maximise the log-likelihood of the training labels. With N examples:

L(w, b) = (1/N) Σᵢ [ yᵢ log σ(zᵢ) + (1-yᵢ) log(1-σ(zᵢ)) ]
Loss = -L(w, b) = Binary Cross-Entropy
PythonLogistic regression from scratch
import numpy as np

class LogisticRegression:  # from scratch, no sklearn
    def __init__(self, lr=0.01, n_iter=1000, lam=0.0):
        self.lr, self.n_iter, self.lam = lr, n_iter, lam
        self.w = self.b = None

    def _sigmoid(self, z):  # stable sigmoid
        return np.where(z >= 0, 1 / (1 + np.exp(-z)),
                   np.exp(z) / (1 + np.exp(z)))

    def fit(self, X, y):  # X: (N, D)  y: (N,) binary
        N, D = X.shape
        self.w = np.zeros(D); self.b = 0.0
        for _ in range(self.n_iter):
            z    = X @ self.w + self.b            # (N,) logits
            p    = self._sigmoid(z)          # (N,) probabilities
            err  = p - y                    # (N,) residual
            dw   = (X.T @ err) / N + self.lam * self.w # gradient + L2 reg
            db   = err.mean()
            self.w -= self.lr * dw
            self.b -= self.lr * db

    def predict_proba(self, X):
        return self._sigmoid(X @ self.w + self.b)

    def predict(self, X, threshold=0.5): 
        return (self.predict_proba(X) >= threshold).astype(int)

# ---- Demo on linearly separable 2D data ----
np.random.seed(42)
X0 = np.random.randn(100, 2) - 1  # class 0 centred at (-1,-1)
X1 = np.random.randn(100, 2) + 1  # class 1 centred at (+1,+1)
X  = np.vstack([X0, X1])
y  = np.array([0]*100 + [1]*100)

model = LogisticRegression(lr=0.1, n_iter=500)
model.fit(X, y)
acc = (model.predict(X) == y).mean()
print(f"Train accuracy: {acc:.1%}")
# Train accuracy: 98.5%

The Decision Boundary

The decision boundary is where σ(w·x + b) = 0.5, which means w·x + b = 0. This is a hyperplane in feature space. In 2D it's a line; in 3D a plane; in 768D (BERT embeddings) it's a 767-dimensional hyperplane.

PythonVisualising the decision boundary
import matplotlib.pyplot as plt

# Plot points and decision boundary
fig, ax = plt.subplots(figsize=(6, 5)
ax.scatter(X0[:, 0], X0[:, 1], c='steelblue', label='Class 0', alpha=0.6)
ax.scatter(X1[:, 0], X1[:, 1], c='crimson', label='Class 1', alpha=0.6)

# Decision boundary: w[0]*x + w[1]*y + b = 0  =>  y = -(w[0]*x + b)/w[1]
x_line = np.linspace(-4, 4, 100)
y_line = -(model.w[0] * x_line + model.b) / model.w[1]
ax.plot(x_line, y_line, 'k-', lw=2, label='Decision boundary')

# Probability contours (shows the sigmoid 'squash')
xx, yy = np.meshgrid(np.linspace(-4, 4, 200), np.linspace(-4, 4, 200)
Z = model.predict_proba(np.c_[xx.ravel(), yy.ravel()]).reshape(xx.shape)
ax.contourf(xx, yy, Z, alpha=0.15, cmap='RdBu_r')
ax.legend(); plt.tight_layout(); plt.savefig('boundary.png', dpi=150)

Regularisation: L1 and L2

Without regularisation, logistic regression can overfit to training data by growing its weights without bound. L2 regularisation (ridge) penalises large weights and corresponds to a Gaussian prior on w. L1 (lasso) promotes sparsity and corresponds to a Laplace prior.

Loss_L2 = BCE + λ‖w‖² / 2 Loss_L1 = BCE + λ‖w‖₁
PythonRegularisation path — how λ shapes the weights
from sklearn.linear_model import LogisticRegression as SklearnLR
import numpy as np

# Train with different L2 regularisation strengths (C = 1/λ in sklearn)
lambdas = np.logspace(-3, 3, 50)
coef_norms = []
for lam in lambdas:
    m = SklearnLR(C=1/lam, max_iter=1000, solver='lbfgs')
    m.fit(X, y)
    coef_norms.append(np.linalg.norm(m.coef_))  # tracks L2 norm of weights

# High λ → small weights → underfitting (high bias)
# Low  λ → large weights → overfitting (high variance)
print(f"λ=0.001: ‖w‖={coef_norms[0]:.3f}")
# λ=0.001: ‖w‖=2.847  (almost unregularised)
print(f"λ=1000:  ‖w‖={coef_norms[-1]:.3f}")
# λ=1000:  ‖w‖=0.001  (heavily penalised → near-zero weights)

Multiclass Extension: Softmax Regression

Logistic regression extends to K classes through the softmax function. Each class gets its own weight vector wₖ; the probability of class k is softmax(Wₓ)ₖ. This is exactly what a neural network's final linear layer does.

P(Y=k | x; W) = exp(wₖ·x) / Σⱼ exp(wⱼ·x) (one-vs-all: single matrix multiply)
ML Connection: Every LLM Output Layer Is Softmax Regression
The final layer of every autoregressive language model is softmax regression over the vocabulary: given the hidden state h, compute logits z = W_vocab · h and return softmax(z) as the next-token probability distribution.
The vocabulary matrix W_vocab ∈ ℝ^{V×D} is the largest single tensor in most LLMs (50,257 × 768 = 38.6M parameters for GPT-2 small). It is sometimes tied to the input embedding matrix — the same weights serve double duty.
6.3

SVMs take the logistic regression idea — find a separating hyperplane — and ask a deeper question: among all hyperplanes that correctly separate the training data, which one will generalise best to unseen points? The answer: the one with the largest margin.

The Maximum-Margin Objective

The margin is the distance from the decision boundary to the nearest training points on each side (the support vectors). SVMs maximise this margin. Intuitively, a wider margin means more room for test points to land on the correct side even if they differ slightly from training points.

minimise (1/2)‖w‖² subject to yᵢ(w·xᵢ + b) ≥ 1 for all i

The constraint yᵢ(w·xᵢ + b) ≥ 1 says every training point must be at least 1/‖w‖ away from the boundary on the correct side. Minimising ‖w‖² maximises this margin. The points where the constraint is tight (equality holds) are the support vectors.

Intuition: Why Maximum Margin Generalises
Larger margin → smaller VC dimension → tighter generalisation bound. The SVM is not just heuristically appealing; it has formal PAC-learning theory backing its preference for large margins.
In practice, the margin idea survives into deep learning as the intuition behind large logit gaps. A model that assigns 99% probability to the correct class has a large 'margin' in the softmax sense and typically generalises better than one that assigns 60%.

Soft-Margin SVMs: Allowing Mistakes

Real data is rarely linearly separable. The soft-margin SVM introduces slack variables ξᵢ ≥ 0 that allow points to violate the margin. C controls the trade-off: large C penalises violations heavily (smaller margin, less overfitting tolerance); small C allows violations (larger margin, more robust).

minimise (1/2)‖w‖² + C Σᵢ ξᵢ subject to yᵢ(w·xᵢ + b) ≥ 1 - ξᵢ, ξᵢ ≥ 0
PythonHard vs. soft margin SVM from scratch
import numpy as np

class LinearSVM:  # subgradient descent, primal form
    def __init__(self, C=1.0, lr=0.001, n_iter=1000):
        self.C, self.lr, self.n_iter = C, lr, n_iter
        self.w = self.b = None

    def fit(self, X, y):  # y in {-1, +1}
        N, D = X.shape
        self.w = np.zeros(D); self.b = 0.0
        for _ in range(self.n_iter):
            for i in range(N):  # stochastic
                margin = y[i] * (X[i] @ self.w + self.b)
                if margin >= 1:  # correctly classified with margin
                    self.w -= self.lr * self.w  # regularisation only
                else:  # hinge loss active
                    self.w -= self.lr * (self.w - self.C * y[i] * X[i])
                    self.b += self.lr * self.C * y[i]

    def predict(self, X):
        return np.sign(X @ self.w + self.b)

# Compare C values on a noisy dataset
y_pm = (2 * y - 1)  # convert 0/1 → -1/+1
for C in [0.01, 1.0, 100.0]:
    svm = LinearSVM(C=C, lr=0.0005, n_iter=2000)
    svm.fit(X, y_pm)
    acc = (svm.predict(X) == y_pm).mean()
    margin = 1 / np.linalg.norm(svm.w)
    print(f"C={C:6.2f}  acc={acc:.3f}  margin={margin:.3f}  ‖w‖={np.linalg.norm(svm.w):.3f}")
# C=  0.01  acc=0.870  margin=2.731  ‖w‖=0.366  (wide margin, misses some)
# C=  1.00  acc=0.985  margin=0.942  ‖w‖=1.061  (balanced)
# C=100.00  acc=0.990  margin=0.421  ‖w‖=2.374  (narrow margin, fits training)

The Kernel Trick

Linear SVMs can only find linear boundaries. The kernel trick implicitly maps data into a high-dimensional feature space Φ(x) without computing the mapping explicitly. The dual SVM depends only on dot products xᵢ·xⱼ, which are replaced by K(xᵢ, xⱼ) = Φ(xᵢ)·Φ(xⱼ).

KernelFormula K(x, z)Feature spaceWhen to use
Linearx·zℝᴰ (no change)Linearly separable data
Polynomial(x·z + r)ᵈAll degree-d monomialsStructured interactions
RBF / Gaussianexp(-γ‖x-z‖²)Infinite-dimensionalGeneral, most popular
Sigmoidtanh(α x·z + c)Approximates 1-layer NNNLP (historical)
String kernel∑ sub-sequence matchesSequence feature spaceText without vectorising
PythonKernel SVM with sklearn
from sklearn.svm import SVC
from sklearn.datasets import make_circles
from sklearn.model_selection import cross_val_score
import numpy as np

# Non-linearly separable: concentric circles
X_circ, y_circ = make_circles(n_samples=400, noise=0.1, factor=0.5, random_state=0)

for kernel in ['linear', 'rbf', 'poly']:
    svc = SVC(kernel=kernel, C=1.0, gamma='scale')
    cv  = cross_val_score(svc, X_circ, y_circ, cv=5, scoring='accuracy')
    print(f"{kernel:8s}: {cv.mean():.3f} ± {cv.std():.3f}")
# linear:   0.508 ± 0.028  (circles not linearly separable)
# rbf:      0.975 ± 0.011  (RBF maps to sphere → linearly separable)
# poly:     0.847 ± 0.031  (degree-3 poly: better than linear, worse than RBF)
ML Connection: Kernel Methods and Attention
Attention in the Transformer can be viewed as a form of kernel regression. The attention score K(q, k) = exp(q·k/√d) is a positive-definite kernel (the softmax-normalized RBF kernel). Each attention head is performing kernel-weighted averaging of values.
This interpretation (Tsai et al., 2019; Choromanski et al., 2021) led to Performer and other linear-attention models that approximate the attention kernel efficiently, reducing complexity from O(T²) to O(T).
6.4

Decision trees partition the feature space into rectangular regions using a sequence of binary splits. Each internal node tests a feature; each leaf assigns a class. They are inherently interpretable, handle mixed feature types, and require no normalisation — but they overfit easily and have high variance.

Information Gain: Choosing the Best Split

At each node, we choose the feature and threshold that maximises the reduction in label uncertainty after the split. We measure uncertainty with entropy H and call the reduction information gain:

IG(S, feature f, threshold t) = H(S) - [ |Sₗ|/|S| · H(Sₗ) + |Sᵣ|/|S| · H(Sᵣ) ]

Alternative impurity measures: Gini impurity G = 1 - Σₖ pₖ² (slightly faster to compute than entropy; similar results). Mean squared error for regression trees.

textCART: Classification and Regression Trees (Breiman et al., 1984) (Pseudocode)
function BuildTree(S, depth):
    if stopping_condition(S, depth):
        return Leaf(majority_class(S))

    best_gain ← 0;  best_split ← None
    for each feature f  and  threshold t:
        Sₗ, Sᵣ ← split(S, f, t)
        gain ← IG(S, Sₗ, Sᵣ)
        if gain > best_gain:
            best_gain ← gain;  best_split ← (f, t)

    Sₗ, Sᵣ ← split(S, best_split)
    return Node(best_split,
               left  = BuildTree(Sₗ, depth+1),
               right = BuildTree(Sᵣ, depth+1))
PythonDecision tree from scratch
import numpy as np

class Node:  # a single node in the tree
    def __init__(self, feature=None, threshold=None,
                 left=None, right=None, value=None):
        self.feature, self.threshold = feature, threshold
        self.left, self.right, self.value = left, right, value

def entropy(y):  # Shannon entropy of label array
    counts = np.bincount(y); probs = counts[counts > 0] / len(y)
    return -np.sum(probs * np.log2(probs))

def info_gain(parent, left, right):
    n = len(parent); nl, nr = len(left), len(right)
    return entropy(parent) - (nl/n * entropy(left) + nr/n * entropy(right))

class DecisionTreeClassifier:
    def __init__(self, max_depth=10, min_samples=2): 
        self.max_depth = max_depth; self.min_samples = min_samples

    def _best_split(self, X, y):
        best = {'gain': -1}
        for f in range(X.shape[1]):  # each feature
            for thresh in np.unique(X[:, f]):  # each unique value
                lm = X[:, f] <= thresh  # left mask
                if lm.sum() < self.min_samples or (~lm).sum() < self.min_samples: continue
                gain = info_gain(y, y[lm], y[~lm])
                if gain > best['gain']: best = {'gain': gain, 'f': f, 't': thresh}
        return best

    def _build(self, X, y, depth=0): 
        if depth >= self.max_depth or len(np.unique(y)) == 1 or len(y) < self.min_samples:
            return Node(value=np.bincount(y).argmax())  # leaf
        split = self._best_split(X, y)
        if split['gain'] <= 0: return Node(value=np.bincount(y).argmax())
        lm = X[:, split['f']] <= split['t']
        return Node(split['f'], split['t'],
                    left=self._build(X[lm], y[lm], depth+1),
                    right=self._build(X[~lm], y[~lm], depth+1))

    def fit(self, X, y):   self.root = self._build(X, y)
    def _predict_one(self, x, node):
        if node.value is not None: return node.value
        branch = node.left if x[node.feature] <= node.threshold else node.right
        return self._predict_one(x, branch)
    def predict(self, X): return np.array([self._predict_one(x, self.root) for x in X])

Overfitting and Pruning

A fully grown tree memorises the training data. Pruning controls complexity: pre-pruning stops growth early (max_depth, min_samples); post-pruning grows the full tree then prunes sub-trees that don’t improve validation performance. Regularisation via cost-complexity pruning adds a penalty α per leaf.

⚠️
Pitfall: The Variance Trap
A depth-20 tree on 1,000 training points will achieve ~100% training accuracy and often <70% test accuracy. Decision trees have extremely high variance — small changes in training data produce very different trees.
The solution is ensembling (Section 6.5). Random forests and gradient boosting dramatically reduce variance by averaging many trees, each trained on a slightly different view of the data.
6.5

A single decision tree has high variance — small changes in training data produce wildly different trees. Ensemble methods combine many trees, each slightly different, to average out this variance. Two strategies dominate: bagging (train independently, then average) and boosting (train sequentially, each correcting the previous).

Random Forests: Bagging + Feature Randomness

Random forests train B trees, each on a bootstrap sample (sampling N points with replacement) and at each split only considering a random subset of features (√D for classification, D/3 for regression). The two sources of randomness decorrelate the trees, making their average much more stable.

PythonRandom forest from scratch
import numpy as np
from collections import Counter

class RandomForest:
    def __init__(self, n_trees=100, max_depth=10, max_features='sqrt'):
        self.n_trees, self.max_depth, self.max_features = n_trees, max_depth, max_features
        self.trees = []

    def fit(self, X, y):
        N, D = X.shape
        k = int(np.sqrt(D)) if self.max_features == 'sqrt' else D
        for _ in range(self.n_trees):
            # 1. Bootstrap sample (sample N rows with replacement)
            idx  = np.random.choice(N, N, replace=True)
            Xb, yb = X[idx], y[idx]
            # 2. Build a tree with random feature subsets at each split
            tree = DecisionTreeClassifier(max_depth=self.max_depth)
            tree.max_features_k = k   # (set attr for tree to read)
            tree.fit(Xb, yb)
            self.trees.append(tree)

    def predict(self, X):
        # Majority vote across all trees
        votes = np.stack([t.predict(X) for t in self.trees], axis=1)
        return np.array([Counter(row).most_common(1)[0][0] for row in votes])

# Compare single tree vs. forest on wine dataset
from sklearn.datasets import load_wine
from sklearn.model_selection import train_test_split
Xw, yw = load_wine(return_X_y=True)
X_tr, X_te, y_tr, y_te = train_test_split(Xw, yw, test_size=0.3, random_state=42)

dt = DecisionTreeClassifier(max_depth=5); dt.fit(X_tr, y_tr)
print(f"Single tree: {(dt.predict(X_te)==y_te).mean():.3f}")
# Single tree: 0.907  (overfits a little)

rf = RandomForest(n_trees=200, max_depth=5); rf.fit(X_tr, y_tr)
print(f"Random forest: {(rf.predict(X_te)==y_te).mean():.3f}")
# Random forest: 0.981  (ensembling reduces variance dramatically)

Gradient Boosting: Sequential Error Correction

Gradient boosting builds trees sequentially, each fitting the residuals (negative gradient of the loss) of the previous ensemble. Where random forests reduce variance by averaging independent models, gradient boosting reduces bias by iteratively correcting systematic errors.

F_m(x) = F_{m-1}(x) + η · h_m(x) where h_m fits the pseudo-residuals rᵢ = -∂L/∂F_{m-1}(xᵢ)
PythonGradient boosting for regression from scratch
import numpy as np
from sklearn.tree import DecisionTreeRegressor

class GradientBoostingRegressor:
    def __init__(self, n_trees=100, lr=0.1, max_depth=3):
        self.n_trees, self.lr, self.max_depth = n_trees, lr, max_depth
        self.trees = []; self.f0 = None

    def fit(self, X, y):
        self.f0 = y.mean()            # initial prediction: mean of targets
        F = np.full(len(y), self.f0)  # running ensemble prediction
        for _ in range(self.n_trees):
            residuals = y - F          # negative gradient of MSE
            tree = DecisionTreeRegressor(max_depth=self.max_depth)
            tree.fit(X, residuals)     # fit the residuals (not y!)
            F += self.lr * tree.predict(X)  # update ensemble
            self.trees.append(tree)

    def predict(self, X):
        pred = np.full(len(X), self.f0)
        for tree in self.trees:
            pred += self.lr * tree.predict(X)
        return pred

# Why max_depth=3? Shallow trees = high bias, low variance.
# Boosting supplies the bias reduction; trees should be weak learners.
# XGBoost, LightGBM, CatBoost all use this idea with massive engineering.
Compare: Bagging vs. Boosting
Bagging (Random Forest): trains trees in parallel on bootstrap samples. Reduces VARIANCE. Each tree is an independent strong learner. Works well even with max_depth=∞. Adding more trees monotonically reduces test error until saturation.
Boosting (GBT, XGBoost): trains trees sequentially. Reduces BIAS. Each tree is a weak learner (max_depth=3). Sensitive to outliers and noisy labels. Can overfit with too many trees or high learning rate. Use early stopping on a validation set.
6.6

Accuracy is a seductive but dangerous metric. On a dataset where 99% of examples are class 0, a classifier that always predicts class 0 achieves 99% accuracy while being completely useless. This section builds the correct evaluation toolkit.

The Confusion Matrix

PythonComplete evaluation toolkit
import numpy as np
from sklearn.metrics import (roc_auc_score, average_precision_score,
                             classification_report, confusion_matrix)

def evaluate_classifier(y_true, y_pred, y_prob, name='Model'):
    TP = ((y_pred==1) & (y_true==1)).sum()
    TN = ((y_pred==0) & (y_true==0)).sum()
    FP = ((y_pred==1) & (y_true==0)).sum()
    FN = ((y_pred==0) & (y_true==1)).sum()

    precision  = TP / (TP + FP + 1e-9)
    recall     = TP / (TP + FN + 1e-9)
    f1         = 2 * precision * recall / (precision + recall + 1e-9)
    specificity= TN / (TN + FP + 1e-9)
    roc_auc    = roc_auc_score(y_true, y_prob)
    pr_auc     = average_precision_score(y_true, y_prob)

    print(f"\n{'='*40}\n{name}\n{'='*40}")
    print(f"Precision:   {precision:.3f}  (of predicted positives, how many were right?)")
    print(f"Recall:      {recall:.3f}  (of actual positives, how many did we find?)")
    print(f"F1-score:    {f1:.3f}  (harmonic mean of precision and recall)")
    print(f"Specificity: {specificity:.3f}  (TN rate)")
    print(f"ROC-AUC:     {roc_auc:.3f}  (rank order quality, threshold-independent)")
    print(f"PR-AUC:      {pr_auc:.3f}  (better for imbalanced datasets)")

# Key insight: precision-recall are the right metrics for imbalanced data.
# ROC-AUC can be misleadingly high when negative class dominates.

Calibration: Do Probabilities Mean What We Think?

A model predicts P(Y=1|x) = 0.7 for 1,000 points. If it is well-calibrated, roughly 700 of those should actually be class 1. Many models (especially SVMs and random forests) produce uncalibrated probabilities that require post-hoc correction via isotonic regression or Platt scaling.

PythonChecking and fixing calibration
from sklearn.calibration import calibration_curve, CalibratedClassifierCV
from sklearn.svm import SVC
import numpy as np

# SVM does not output calibrated probabilities by default
svm_raw = SVC(probability=False, C=1.0); svm_raw.fit(X_tr, y_tr)

# Calibrated version: Platt scaling (logistic fit over SVM decision function)
svm_cal = CalibratedClassifierCV(SVC(C=1.0), cv=5, method='sigmoid')
svm_cal.fit(X_tr, y_tr)

# Calibration curve: fraction of positives vs. mean predicted prob
prob_cal = svm_cal.predict_proba(X_te)[:, 1]
fraction, mean_pred = calibration_curve(y_te, prob_cal, n_bins=10)
print(f"Calibration error (ECE): {np.mean(np.abs(fraction - mean_pred)):.4f}")
# Well-calibrated model: fraction ≈ mean_pred for all bins
6.7

Classical discriminative models operate on fixed-length feature vectors. How you engineer those features determines the ceiling on model performance far more than the choice of algorithm. This section covers the techniques that practitioners used before deep learning automated feature extraction.

Numerical Features

Standardisation (z-score): subtract mean, divide by std. Essential for SVMs and logistic regression; trees are invariant.
Min-max scaling: map to [0, 1]. Use when the distribution is not Gaussian and you need bounded inputs.
Log transform: compresses right-skewed distributions (prices, counts, frequencies). Apply before standardisation.
Polynomial features: add xᵢxⱼ and xᵢ² terms to capture interactions. Combinatorial explosion for high D.
Quantile transform: maps to uniform or Gaussian distribution, robust to outliers.

Categorical Features

One-hot encoding: K binary columns for K categories. Sparse for high-cardinality features.
Ordinal encoding: assign integers 1…K when there is a natural order (size: small=1, medium=2, large=3).
Target encoding: replace each category with the mean target value in that category. Powerful but leaks labels — always use out-of-fold estimates.
Embedding lookup: learnable dense vectors (Chapter 8). The deep learning replacement for one-hot.
PythonFeature engineering pipeline with sklearn
from sklearn.pipeline import Pipeline
from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import StandardScaler, OneHotEncoder
from sklearn.impute import SimpleImputer
from sklearn.ensemble import GradientBoostingClassifier
import numpy as np; import pandas as pd

# Example: Titanic-style dataset with mixed feature types
num_cols = ['age', 'fare']  # numerical
cat_cols = ['sex', 'embarked']  # categorical

num_pipe = Pipeline([('impute', SimpleImputer(strategy='median')),
                        ('scale',  StandardScaler())])

cat_pipe = Pipeline([('impute', SimpleImputer(strategy='most_frequent')),
                        ('ohe',   OneHotEncoder(handle_unknown='ignore', sparse_output=False))])

preproc = ColumnTransformer([
    ('num', num_pipe, num_cols),
    ('cat', cat_pipe, cat_cols),
])

model = Pipeline([
    ('preproc', preproc),
    ('clf',     GradientBoostingClassifier(n_estimators=200, max_depth=3, lr=0.05)),
])
# model.fit(X_train, y_train) -- handles NaNs, encodes cats, scales nums
6.8

Multiclass Strategies

Logistic regression and neural networks handle multiclass natively via softmax. For binary classifiers (SVM, basic logistic), two strategies extend to K classes:

StrategyHowModels trainedPrediction
One-vs-Rest (OvR)Train K binary classifiers, each class vs. all othersKargmax of K scores
One-vs-One (OvO)Train K(K-1)/2 binary classifiers, each pair of classesK(K-1)/2Majority vote
Softmax (native)Single model with K output units + cross-entropy1argmax(softmax(z))

Multi-label Classification

Multi-label problems allow multiple labels per example (a news article can be about both politics and technology). The standard approach: train K independent binary classifiers (binary relevance), one per label. More sophisticated: classifier chains (each classifier sees previous label predictions as features) or label powerset (treat each unique label combination as a class).

PythonMulti-label classification
from sklearn.multioutput import MultiOutputClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.datasets import make_multilabel_classification
from sklearn.metrics import jaccard_score

# Create a dataset with 4 labels, 2-3 labels per sample on average
X_ml, y_ml = make_multilabel_classification(n_samples=1000, n_features=20
                                                  n_classes=4, n_labels=2, random_state=0)

# Binary relevance: K independent binary classifiers
mo_clf = MultiOutputClassifier(LogisticRegression())
mo_clf.fit(X_ml[:800], y_ml[:800)
preds  = mo_clf.predict(X_ml[800:])

# Jaccard score (intersection over union) is standard for multi-label
jacc = jaccard_score(y_ml[800:], preds, average='samples')
print(f"Jaccard score: {jacc:.3f}")
# Jaccard score: 0.742  (fraction of correct labels in union of true+pred)
6.9

Classical discriminative models require human-engineered features. Neural networks replace the feature engineering step with learned representations. But the final classification layer is still logistic regression (binary) or softmax regression (multiclass) — the models in this chapter are not replaced, they are elevated.

The Unifying View
A neural network is a composition of functions: feature extractor f_θ followed by a linear classifier. Formally: P(Y=k|x) = softmax(W · f_θ(x)). Everything from Chapter 6 applies to this final layer:
• L2 regularisation on W is weight decay. • Temperature scaling of logits calibrates probabilities. • The decision boundary is a hyperplane in the representation space f_θ(x). • Transfer learning freezes f_θ and fine-tunes only W — exactly re-training logistic regression on new features.

Feature Space Geometry

Deep learning training can be understood as learning a representation space where the original classification problem becomes linearly separable. Visualising this with t-SNE or UMAP reveals whether the network has learned a useful geometry.

PythonProbing LLM representations with linear classifiers
import torch
from transformers import AutoTokenizer, AutoModel
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import cross_val_score
import numpy as np

# Linear probe: can a logistic regression on BERT embeddings
# classify sentiment? If yes, BERT has learned a useful representation.

tokenizer = AutoTokenizer.from_pretrained('bert-base-uncased')
model     = AutoModel.from_pretrained('bert-base-uncased').eval()

def get_embedding(text):
    inputs = tokenizer(text, return_tensors='pt', truncation=True, max_length=128)
    with torch.no_grad():
        out = model(**inputs)
    return out.last_hidden_state[:, 0, :].squeeze().numpy()
    # [CLS] token embedding: 768-dimensional summary of the sentence

# Simulate sentiment data (in practice, use SST-2 or IMDB)
sentences = ['I love this film', 'Terrible movie', 'Amazing!', 'Awful']
labels    = [1, 0, 1, 0]  # 1=positive, 0=negative
embeddings = np.stack([get_embedding(s) for s in sentences])

# Train logistic regression on the [CLS] embeddings
probe = LogisticRegression(); probe.fit(embeddings, labels)
print(f"Linear probe accuracy: {probe.score(embeddings, labels):.1%}")
# With enough data: ~90%+ on SST-2 -- BERT's representations are linearly separable
6.10

Reference: Which Classifier to Use When

SituationRecommended modelWhy
Linearly separable, need probabilitiesLogistic RegressionFast, calibrated, interpretable coefficients
Non-linear boundary, medium datasetRBF SVMKernel trick; good generalisation
Tabular mixed features, no DLGradient Boosted TreesBest general-purpose tabular model
Need feature importanceRandom ForestOut-of-bag importance + low tuning
Very high-dimensional (text TF-IDF)Logistic Regression (L1)Sparse features → sparse solution
Imbalanced classesGBT + PR-AUC metricHandles imbalance; PR-AUC measures it
Need to generate examplesUse generative model (Ch. 7)Discriminative models can't generate X
Rich unstructured input (images, text)Neural network (Parts III–IV)Learns features automatically

Exercises

Exercises 1–10 are pen-and-paper; 11–18 require code.

Exercise 1: Pen & Paper
Derive the gradient of binary cross-entropy with respect to the weight vector w in logistic regression. Show it equals Xᵀ(p - y)/N where p = σ(Xw + b).
Exercise 2: Pen & Paper
Prove that the SVM hard-margin objective (minimise ½‖w‖² s.t. yᵢ(w·xᵢ+b)≥1) is equivalent to maximising the margin 2/‖w‖.
Exercise 3: Pen & Paper
Compute information gain for a binary split on the XOR dataset {(0,0,0), (0,1,1), (1,0,1), (1,1,0)}. Which feature gives higher gain?
Exercise 4: Pen & Paper
Show that bagging (averaging B trees with correlation ρ and variance σ²) reduces variance to (ρ + (1-ρ)/B)σ². What happens as B→∞?
Exercise 5: Pen & Paper
A spam filter has 95% recall and 80% precision. In a dataset where 5% of emails are spam, what is its accuracy?
Exercise 6: Pen & Paper
Explain why L2 regularisation corresponds to a Gaussian prior on weights in the MAP estimation framework. What prior corresponds to L1?
Exercise 7: Pen & Paper
The Gini impurity is G = 1 - Σₖ pₖ². Compute G for a pure node (all one class) and a maximally impure node (equal split). What is Gini's relationship to entropy?
Exercise 8: Pen & Paper
Prove that in gradient boosting with squared-error loss, the pseudo-residuals rᵢ = yᵢ - F_{m-1}(xᵢ) equal the negative gradient of the loss w.r.t. predictions.
Exercise 9: Pen & Paper
Logistic regression cannot solve the XOR problem. Explain geometrically why, and describe the minimum feature engineering needed to fix it.
Exercise 10: Pen & Paper
In multiclass one-vs-rest with K classes, we train K models. Describe a pathological situation where no class gets > 0.5 probability from its binary classifier and all models abstain.
Exercise 11: Code
Implement logistic regression from scratch and compare accuracy and training time to sklearn on the breast cancer dataset.
Exercise 12: Code
Implement the kernel trick for logistic regression using the RBF kernel: replace w·x with Σᵢ αᵢ K(xᵢ, x). Train on the circles dataset and compare to linear LR.
Exercise 13: Code
Build a full decision tree from scratch with early stopping and Gini impurity. Evaluate on iris dataset. Plot train vs. test accuracy as a function of max_depth.
Exercise 14: Code Lab
Train a random forest and gradient boosted trees on the adult income dataset. Plot feature importances for both. Evaluate with ROC-AUC and PR-AUC. Which is more appropriate here and why?
Exercise 15: Code
Demonstrate calibration: train an SVM and a random forest on an imbalanced dataset (10:1 ratio). Plot reliability diagrams for both. Apply isotonic regression calibration and re-plot.
Exercise 16: Code
Implement gradient boosting from scratch for classification (use log-loss instead of MSE, pseudo-residuals = y - p). Compare to sklearn's GradientBoostingClassifier.
Exercise 17: Code Lab
Linear probing experiment: extract [CLS] embeddings from BERT-base-uncased on the SST-2 dataset. Train logistic regression probes on embeddings from layers 1, 4, 8, 12. How does classification accuracy change with depth?
Exercise 18: Code (Challenge)
Implement a complete pipeline: (a) load any real-world tabular dataset, (b) handle missing values, encode categoricals, scale numerics, (c) train logistic regression, random forest, and XGBoost, (d) evaluate with proper train/val/test splits, (e) produce a calibration plot for the best model.

Further reading: “The Elements of Statistical Learning” (Hastie, Tibshirani, Friedman) Chapters 4, 9, 10, 15 — the authoritative reference. “Pattern Recognition and Machine Learning” (Bishop) Chapter 7 for SVMs. scikit-learn documentation for practical API reference.


Next → Chapter 7: Generative Models (Pre-deep learning)

Chapter 7 flips the modelling question: instead of learning P(Y|X) directly, we learn how data is generated — P(X|Y) and P(Y). Naïve Bayes, Gaussian Mixture Models, and Hidden Markov Models show why generative approaches excel when labels are scarce, enabling models to share statistical strength across classes and generate new samples from the learned distribution.

18 Exercises in this chapter
Attempt each exercise before checking the worked solutions.
View Solutions →