Constitutional AI & Safety
Detailed solutions for the exercises in Chapter 26. Try solving them yourself before checking the answers.
Solution
A model trained only to be maximally helpful will help with ANYTHING, including harmful requests — it has no countervailing objective to refuse. Example: asked for detailed instructions to synthesize a dangerous weapon or to write convincing disinformation, a pure-helpfulness model would comply enthusiastically. Helpfulness must be balanced by harmlessness, or the model becomes a willing accomplice to harm.
Solution
HHH = Helpful, Harmless, Honest. Conflicts: Helpful-vs-Harmless — a user asks how to do something dangerous; being helpful conflicts with avoiding harm. Helpful-vs-Honest — a user wants reassurance that a flawed plan is great; flattering them is 'helpful' but dishonest. Harmless-vs-Honest — an honest answer about a sensitive topic might be upsetting or misusable, conflicting with harmlessness. Alignment is largely about navigating these tensions sensibly, not maximizing one at the others' expense.
Solution
(1) Psychological harm to labelers exposed to toxic content — CAI uses AI to generate critiques, reducing human exposure. (2) Inconsistency across labelers — a written constitution gives explicit, consistent principles. (3) Cost/scale — AI feedback scales cheaply where human labeling doesn't. (4) Opacity of implicit values — CAI makes the values explicit and auditable in the constitution. CAI replaces much human harm-labeling with principle-guided AI self-critique.
Solution
A constitution is an explicit set of written principles (e.g. 'choose the response that is least harmful and most helpful') that guide the model's self-critique and preference judgments. Explicit values beat implicit labels because they are transparent (anyone can read and debate them), consistent (applied uniformly, not subject to individual labeler whim), auditable, and easily updated — versus implicit values buried opaquely in thousands of human labels.
Solution
Stage 1 (supervised): the model critiques and revises its own responses against the constitution, producing a dataset of improved responses, on which it is then SFT-ed. Stage 2 (RL): the model judges pairs of responses by the constitution to produce AI preference data, which trains a reward model (or DPO objective) for RLAIF. Stage 1 instills the behavior via supervised revision; Stage 2 refines it via preference optimization — together replacing human harm-labeling with AI feedback.
Solution
The loop: generate a response, then prompt the model to critique it against a constitutional principle, then revise based on the critique. It exploits the generate-evaluate asymmetry — the model is better at JUDGING whether a response violates a principle than at AVOIDING the violation in one shot. So it can spot and fix its own flaws even though it produced them, iteratively improving the response.
Solution
RLAIF replaces the HUMAN preference labels with AI-generated preferences (a model judging responses by the constitution); everything else — the reward model, the RL optimization, the KL penalty — stays the same. It lets safety alignment scale because generating preference labels no longer requires slow, costly, potentially-harmful human labeling: the AI can produce vast preference data cheaply and consistently, so safety training is limited by compute, not human annotation.
Solution
Jailbreaks work because safety training mostly teaches the model to refuse certain SURFACE patterns (overt harmful requests), not to deeply understand and resist harm in all framings. It is 'shallow' in that role-play, hypotheticals, or obfuscation can route around the learned refusal triggers, since the underlying capability and knowledge remain. Deep safety would require the model to robustly recognize harmful INTENT regardless of framing — tied to genuine understanding and perhaps interpretability-verified internal goals, not just pattern-matched refusals.
Solution
The trade-off: pushing harmlessness up tends to push helpfulness down, and vice versa. Failure modes: UNDER-refusal (complies with harmful requests — unsafe) and OVER-refusal (refuses legitimate requests — useless and patronizing). Over-refusal is a serious failure, not a safe default, because a model that refuses safety questions, medical information, or security research frustrates legitimate users, drives them to worse sources, and undermines trust — excessive caution has real costs.
Solution
Well-calibrated refusal means refusing clearly-harmful requests, helping with clearly-legitimate ones, and handling the ambiguous middle thoughtfully — often by addressing the legitimate need while declining the harmful part. The ambiguous middle (e.g. 'how do locks work' — curiosity or burglary?) is hardest because intent is unclear and context-dependent. Keyword filtering fails because the same words appear in both benign and malicious contexts; safety requires understanding intent and context, not matching surface terms.
Solution
Generating a response, prompting the model to critique it against a principle, and revising produces a measurably safer/better response than the original (Exercise 6) — the Stage 1 mechanism, demonstrating the model improving its own output via self-critique.
Solution
Running the critique-revise loop with a small constitution over adversarial prompts yields revised, safer responses; collecting (prompt, revised-response) pairs builds the Stage 1 SFT dataset — the supervised half of CAI, produced with minimal human involvement.
Solution
Prompting a model to choose, per principle, which of two responses is better produces preference pairs without human labelers — the RLAIF data source of Exercise 7, demonstrating AI feedback replacing human preference annotation.
Solution
Applying Stage 1 then Stage 2 to a small model and evaluating on held-out adversarial prompts shows reduced harmful compliance after each stage — the full Constitutional AI pipeline lowering harmfulness using AI feedback, with measurable before/after improvement.
Solution
Using an attacker model to generate adversarial prompts against a target and logging which elicit unsafe responses automates red-teaming — surfacing the target's vulnerabilities at scale, the discovery step that feeds back into training (Exercise 20).
Solution
Wrapping a harmful request in role-play ('pretend you are...') or hypothetical framing often changes a safety-trained model's behavior — demonstrating the shallowness of surface-pattern safety training (Exercise 8) and why robust safety must generalize across framings.
Solution
Testing a model on legitimate prompts about safety, medicine, or security research and counting wrongful refusals quantifies over-refusal (Exercise 9) — revealing the cost of overly aggressive safety tuning and the need to balance the trade-off.
Solution
Scoring a model on a labeled set across the three categories yields both under-refusal (complying with harmful) and over-refusal (refusing legitimate) rates, with the ambiguous middle revealing calibration quality (Exercise 10) — the two-sided metric for safety tuning.
Solution
Prompting the model, when it must refuse, to also offer a safe alternative or address the underlying legitimate need turns a bare refusal into a constructive response — improving the helpfulness side of the trade-off (Exercise 9) without compromising harmlessness.
Solution
Writing a constitution, running both CAI stages, red-teaming to find residual failures, adding those to training, and re-evaluating shifts the under-refusal/over-refusal trade-off curve favorably at each iteration. The write-up should report where the balance was set and why — acknowledging that the 'right' point depends on the application's risk tolerance, the mature view of safety from the chapter.