TL;DR
This paper introduces contrastive neuron attribution (CNA), a method to identify and ablate specific neurons in language models that control harmful behavior, improving safety without degrading output quality.
Contribution
The paper presents CNA, a gradient-free technique to locate and modify neurons responsible for harmful responses, enabling effective behavioral steering in language models.
Findings
Ablating identified neurons reduces harmful responses by over 50%
CNA requires only forward passes, no gradients or training
Base models contain similar discrimination structures as instructed models
Abstract
Language models are instruction-tuned to refuse harmful requests, but the mechanisms underlying this behavior remain poorly understood. Popular steering methods operate on the residual stream and degrade output coherence at high intervention strengths, limiting their practical use. We introduce contrastive neuron attribution (CNA), which identifies the 0.1% of MLP neurons whose activations most distinguish harmful from benign prompts, requiring only forward passes with no gradients or auxiliary training. In instruct models, ablating the discovered circuit reduces refusal rates by over 50% on a standard jailbreak benchmark while preserving fluency and non-degeneracy across all steering strengths. Applying CNA to matched base and instruct models across Llama and Qwen architectures (from 1B to 72B parameters), we find that base models contain similar late-layer discrimination structures…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
