Detection Is Cheap, Routing Is Learned: Why Refusal-Based Alignment Evaluation Fails
Gregory N. Frank

TL;DR
This paper reveals that current alignment evaluation methods overlook the routing mechanism that connects concept detection to behavior, which is crucial for understanding model censorship and alignment.
Contribution
It demonstrates the importance of analyzing routing mechanisms in models, showing that detection and refusal metrics alone are insufficient for alignment assessment.
Findings
Probe accuracy is non-diagnostic; generalization tests are needed.
Surgical ablation reveals lab-specific routing and affects censorship.
Refusal is no longer the main censorship mechanism; narrative steering dominates.
Abstract
Current alignment evaluation mostly measures whether models encode dangerous concepts and whether they refuse harmful requests. Both miss the layer where alignment often operates: routing from concept detection to behavioral policy. We study political censorship in Chinese-origin language models as a natural experiment, using probes, surgical ablations, and behavioral tests across nine open-weight models from five labs. Three findings follow. First, probe accuracy alone is non-diagnostic: political probes, null controls, and permutation baselines can all reach 100%, so held-out category generalization is the informative test. Second, surgical ablation reveals lab-specific routing. Removing the political-sensitivity direction eliminates censorship and restores accurate factual output in most models tested, while one model confabulates because its architecture entangles factual knowledge…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
