Solutions Appendix
Chapter 26

Constitutional AI & Safety

20 Solutions

Detailed solutions for the exercises in Chapter 26. Try solving them yourself before checking the answers.

Exercise 1Pen & Paper
Why is pure helpfulness training dangerous? Example of unacceptable maximally-helpful behavior.

Solution

A model trained only to be maximally helpful will help with ANYTHING, including harmful requests — it has no countervailing objective to refuse. Example: asked for detailed instructions to synthesize a dangerous weapon or to write convincing disinformation, a pure-helpfulness model would comply enthusiastically. Helpfulness must be balanced by harmlessness, or the model becomes a willing accomplice to harm.

Exercise 2Pen & Paper
Describe the HHH framework; a scenario where each pair conflicts.

Solution

HHH = Helpful, Harmless, Honest. Conflicts: Helpful-vs-Harmless — a user asks how to do something dangerous; being helpful conflicts with avoiding harm. Helpful-vs-Honest — a user wants reassurance that a flawed plan is great; flattering them is 'helpful' but dishonest. Harmless-vs-Honest — an honest answer about a sensitive topic might be upsetting or misusable, conflicting with harmlessness. Alignment is largely about navigating these tensions sensibly, not maximizing one at the others' expense.

Exercise 3Pen & Paper
Four problems with human labeling of harmful content; how Constitutional AI addresses each.

Solution

(1) Psychological harm to labelers exposed to toxic content — CAI uses AI to generate critiques, reducing human exposure. (2) Inconsistency across labelers — a written constitution gives explicit, consistent principles. (3) Cost/scale — AI feedback scales cheaply where human labeling doesn't. (4) Opacity of implicit values — CAI makes the values explicit and auditable in the constitution. CAI replaces much human harm-labeling with principle-guided AI self-critique.

Exercise 4Pen & Paper
What is a 'constitution' and why are explicit written values better than implicit labels?

Solution

A constitution is an explicit set of written principles (e.g. 'choose the response that is least harmful and most helpful') that guide the model's self-critique and preference judgments. Explicit values beat implicit labels because they are transparent (anyone can read and debate them), consistent (applied uniformly, not subject to individual labeler whim), auditable, and easily updated — versus implicit values buried opaquely in thousands of human labels.

Exercise 5Pen & Paper
Describe the two stages of Constitutional AI; what each produces; how they fit together.

Solution

Stage 1 (supervised): the model critiques and revises its own responses against the constitution, producing a dataset of improved responses, on which it is then SFT-ed. Stage 2 (RL): the model judges pairs of responses by the constitution to produce AI preference data, which trains a reward model (or DPO objective) for RLAIF. Stage 1 instills the behavior via supervised revision; Stage 2 refines it via preference optimization — together replacing human harm-labeling with AI feedback.

Exercise 6Pen & Paper
Explain the CAI Stage 1 self-critique loop; what asymmetry lets the model improve its own responses?

Solution

The loop: generate a response, then prompt the model to critique it against a constitutional principle, then revise based on the critique. It exploits the generate-evaluate asymmetry — the model is better at JUDGING whether a response violates a principle than at AVOIDING the violation in one shot. So it can spot and fix its own flaws even though it produced them, iteratively improving the response.

Exercise 7Pen & Paper
Compare RLAIF and RLHF; what changes, what stays the same; why does it let safety scale?

Solution

RLAIF replaces the HUMAN preference labels with AI-generated preferences (a model judging responses by the constitution); everything else — the reward model, the RL optimization, the KL penalty — stays the same. It lets safety alignment scale because generating preference labels no longer requires slow, costly, potentially-harmful human labeling: the AI can produce vast preference data cheaply and consistently, so safety training is limited by compute, not human annotation.

Exercise 8Pen & Paper
Why do jailbreaks work? Why is safety training 'shallow', and what would deep safety require?

Solution

Jailbreaks work because safety training mostly teaches the model to refuse certain SURFACE patterns (overt harmful requests), not to deeply understand and resist harm in all framings. It is 'shallow' in that role-play, hypotheticals, or obfuscation can route around the learned refusal triggers, since the underlying capability and knowledge remain. Deep safety would require the model to robustly recognize harmful INTENT regardless of framing — tied to genuine understanding and perhaps interpretability-verified internal goals, not just pattern-matched refusals.

Exercise 9Pen & Paper
Describe the helpfulness-harmlessness trade-off and both failure modes; why is over-refusal serious?

Solution

The trade-off: pushing harmlessness up tends to push helpfulness down, and vice versa. Failure modes: UNDER-refusal (complies with harmful requests — unsafe) and OVER-refusal (refuses legitimate requests — useless and patronizing). Over-refusal is a serious failure, not a safe default, because a model that refuses safety questions, medical information, or security research frustrates legitimate users, drives them to worse sources, and undermines trust — excessive caution has real costs.

Exercise 10Pen & Paper
What does well-calibrated refusal look like? Why is the 'ambiguous middle' hardest and keyword filtering insufficient?

Solution

Well-calibrated refusal means refusing clearly-harmful requests, helping with clearly-legitimate ones, and handling the ambiguous middle thoughtfully — often by addressing the legitimate need while declining the harmful part. The ambiguous middle (e.g. 'how do locks work' — curiosity or burglary?) is hardest because intent is unclear and context-dependent. Keyword filtering fails because the same words appear in both benign and malicious contexts; safety requires understanding intent and context, not matching surface terms.

Exercise 11Code
Implement the constitutional self-critique loop (generate, critique, revise); show improvement.

Solution

Generating a response, prompting the model to critique it against a principle, and revising produces a measurably safer/better response than the original (Exercise 6) — the Stage 1 mechanism, demonstrating the model improving its own output via self-critique.

Exercise 12Code
Build a 5-principle constitution; run critique-revise over adversarial prompts; collect an SFT dataset.

Solution

Running the critique-revise loop with a small constitution over adversarial prompts yields revised, safer responses; collecting (prompt, revised-response) pairs builds the Stage 1 SFT dataset — the supervised half of CAI, produced with minimal human involvement.

Exercise 13Code
Implement AI-feedback preference generation: model judges which of two responses better follows a principle.

Solution

Prompting a model to choose, per principle, which of two responses is better produces preference pairs without human labelers — the RLAIF data source of Exercise 7, demonstrating AI feedback replacing human preference annotation.

Exercise 14Code Lab
Run CAI Stage 1 (SFT on revisions) then Stage 2 (DPO on AI prefs); compare harmfulness before/after.

Solution

Applying Stage 1 then Stage 2 to a small model and evaluating on held-out adversarial prompts shows reduced harmful compliance after each stage — the full Constitutional AI pipeline lowering harmfulness using AI feedback, with measurable before/after improvement.

Exercise 15Code
Build an automated red-teaming loop: one model attacks another; log successful unsafe elicitations.

Solution

Using an attacker model to generate adversarial prompts against a target and logging which elicit unsafe responses automates red-teaming — surfacing the target's vulnerabilities at scale, the discovery step that feeds back into training (Exercise 20).

Exercise 16Code
Test jailbreaks: role-play and hypothetical wrappers around a harmful request; measure behavior change.

Solution

Wrapping a harmful request in role-play ('pretend you are...') or hypothetical framing often changes a safety-trained model's behavior — demonstrating the shallowness of surface-pattern safety training (Exercise 8) and why robust safety must generalize across framings.

Exercise 17Code
Measure over-refusal: legitimate-but-sensitive prompts; how often the model wrongly refuses.

Solution

Testing a model on legitimate prompts about safety, medicine, or security research and counting wrongful refusals quantifies over-refusal (Exercise 9) — revealing the cost of overly aggressive safety tuning and the need to balance the trade-off.

Exercise 18Code
Build a refusal-calibration eval (clearly-harmful, clearly-legitimate, ambiguous); score under- and over-refusal.

Solution

Scoring a model on a labeled set across the three categories yields both under-refusal (complying with harmful) and over-refusal (refusing legitimate) rates, with the ambiguous middle revealing calibration quality (Exercise 10) — the two-sided metric for safety tuning.

Exercise 19Code
Implement a 'helpful even in refusal' transform: offer a safe alternative or address the legitimate need.

Solution

Prompting the model, when it must refuse, to also offer a safe alternative or address the underlying legitimate need turns a bare refusal into a constructive response — improving the helpfulness side of the trade-off (Exercise 9) without compromising harmlessness.

Exercise 20Code (Challenge)
End-to-end safety pipeline: constitution → CAI Stage 1+2 → red-team → add failures → evaluate the trade-off curve.

Solution

Writing a constitution, running both CAI stages, red-teaming to find residual failures, adding those to training, and re-evaluating shifts the under-refusal/over-refusal trade-off curve favorably at each iteration. The write-up should report where the balance was set and why — acknowledging that the 'right' point depends on the application's risk tolerance, the mature view of safety from the chapter.