Detection Is Cheap, Routing Is Learned: Why Refusal-Based Alignment Evaluation Fails

Gregory N. Frank

arXiv:2603.18280·cs.LG·May 4, 2026

Detection Is Cheap, Routing Is Learned: Why Refusal-Based Alignment Evaluation Fails

Gregory N. Frank

PDF

TL;DR

This paper reveals that current alignment evaluation methods overlook the routing mechanism that connects concept detection to behavior, which is crucial for understanding model censorship and alignment.

Contribution

It demonstrates the importance of analyzing routing mechanisms in models, showing that detection and refusal metrics alone are insufficient for alignment assessment.

Findings

01

Probe accuracy is non-diagnostic; generalization tests are needed.

02

Surgical ablation reveals lab-specific routing and affects censorship.

03

Refusal is no longer the main censorship mechanism; narrative steering dominates.

Abstract

Current alignment evaluation mostly measures whether models encode dangerous concepts and whether they refuse harmful requests. Both miss the layer where alignment often operates: routing from concept detection to behavioral policy. We study political censorship in Chinese-origin language models as a natural experiment, using probes, surgical ablations, and behavioral tests across nine open-weight models from five labs. Three findings follow. First, probe accuracy alone is non-diagnostic: political probes, null controls, and permutation baselines can all reach 100%, so held-out category generalization is the informative test. Second, surgical ablation reveals lab-specific routing. Removing the political-sensitivity direction eliminates censorship and restores accurate factual output in most models tested, while one model confabulates because its architecture entangles factual knowledge…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.