TL;DR
This paper localizes and analyzes the policy routing mechanism in alignment-trained language models, revealing how specific circuit components control model refusals and safety behaviors.
Contribution
It identifies and characterizes the policy routing circuit, demonstrating its causal role and how it can be modulated or bypassed to influence model responses.
Findings
The gate is causally necessary for refusal behavior.
Interchange screening reliably detects routing motifs across models.
Modulating the detection layer controls the model's policy from refusal to factual answering.
Abstract
We localize the policy routing mechanism in alignment-trained language models. An intermediate-layer attention gate reads detected content and triggers deeper amplifier heads that boost the signal toward refusal. In smaller models the gate and amplifier are single heads; at larger scale they become bands of heads across adjacent layers. The gate contributes under 1% of output DLA, yet interchange testing (p < 0.001) and knockout cascade confirm it is causally necessary. Interchange screening at n >= 120 detects the same motif in twelve models from six labs (2B to 72B), though specific heads differ by lab. Per-head ablation weakens up to 58x at 72B and misses gates that interchange identifies; at scale, interchange is the only reliable audit. Modulating the detection-layer signal continuously controls policy from hard refusal through evasion to factual answering. On safety prompts the…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
