Surgical Refusal Ablation: Disentangling Safety from Intelligence via Concept-Guided Spectral Cleaning
Tony Cristofano

TL;DR
This paper introduces Surgical Refusal Ablation (SRA), a spectral cleaning method that disentangles safety refusal signals from core capabilities in language models, reducing harmful refusals while preserving model performance.
Contribution
SRA uses spectral residualization with concept atoms to orthogonalize refusal directions, minimizing collateral damage and distribution drift in safety-aligned language models.
Findings
SRA achieves deep refusal reduction with negligible perplexity impact.
SRA maintains original distribution while reducing refusals to near zero.
Standard ablation causes severe distribution drift, which SRA avoids.
Abstract
Safety-aligned language models systematically refuse harmful requests. While activation steering can modulate refusal, ablating the raw "refusal vector" calculated from contrastive harmful and harmless prompts often causes collateral damage and distribution drift. We argue this degradation occurs because the raw vector is polysemantic, entangling the refusal signal with core capability circuits and linguistic style. We introduce Surgical Refusal Ablation (SRA) to distill these steering directions. SRA constructs a registry of independent Concept Atoms representing protected capabilities and stylistic confounds, then uses ridge-regularized spectral residualization to orthogonalize the refusal vector against these directions. This yields a clean refusal direction that targets refusal-relevant structure while minimizing disruption to the model's semantic geometry. Across five models…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsTopic Modeling · Adversarial Robustness in Machine Learning · Explainable Artificial Intelligence (XAI)
