Residual Paving: Diagnosing the Routing Bottleneck in Selective Refusal Editing
Bryce Hinkley, Peyman Najafirad

TL;DR
This paper introduces Residual Paving, a routing-based residual editing method for frozen transformers, significantly reducing edit refusal rates while preserving benign and harmful distributions, and diagnosing bottlenecks in selective refusal editing.
Contribution
It proposes a novel routed residual editing approach that separates route selectivity from residual capacity, enabling effective diagnosis and improvement of selective refusal editing in transformers.
Findings
Reduces edit refusal from 88.6% to 4.0% on Gemma-3-4B-IT.
Achieves 95.5% benign and 87.3% harmful distribution preservation.
Oracle routing improves diagnostic scores across six backbones.
Abstract
We study selective refusal editing as a three-way control problem: induce non-refusal on designated edit prompts while preserving benign behavior and harmful refusals outside the edit set. We introduce Residual Paving, a routed residual editing method for frozen instruction-tuned transformers that separates route selectivity, whether to intervene, from residual-edit capacity, what edit to apply. An early-layer router predicts a scalar gate and expert mixture; when active, prompt-conditioned bottleneck residual experts apply later-layer residual updates while leaving the backbone unchanged. This decomposition supports an oracle-routing diagnostic where only the learned scalar gate is replaced with the held-out edit/keep label, leaving the residual editor and frozen backbone fixed. On the primary Gemma-3-4B-IT held-out split, learned Residual Paving reduces edit refusal from 88.6% to…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
