Targeted Neuron Modulation via Contrastive Pair Search

Sam Herring; Jake Naviasky; Karan Malhotra

arXiv:2605.12290·cs.LG·May 13, 2026

Targeted Neuron Modulation via Contrastive Pair Search

Sam Herring, Jake Naviasky, Karan Malhotra

PDF

1 Repo

TL;DR

This paper introduces contrastive neuron attribution (CNA), a method to identify and ablate specific neurons in language models that control harmful behavior, improving safety without degrading output quality.

Contribution

The paper presents CNA, a gradient-free technique to locate and modify neurons responsible for harmful responses, enabling effective behavioral steering in language models.

Findings

01

Ablating identified neurons reduces harmful responses by over 50%

02

CNA requires only forward passes, no gradients or training

03

Base models contain similar discrimination structures as instructed models

Abstract

Language models are instruction-tuned to refuse harmful requests, but the mechanisms underlying this behavior remain poorly understood. Popular steering methods operate on the residual stream and degrade output coherence at high intervention strengths, limiting their practical use. We introduce contrastive neuron attribution (CNA), which identifies the 0.1% of MLP neurons whose activations most distinguish harmful from benign prompts, requiring only forward passes with no gradients or auxiliary training. In instruct models, ablating the discovered circuit reduces refusal rates by over 50% on a standard jailbreak benchmark while preserving fluency and non-degeneracy across all steering strengths. Applying CNA to matched base and instruct models across Llama and Qwen architectures (from 1B to 72B parameters), we find that base models contain similar late-layer discrimination structures…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

nousresearch/neural-steering
github

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.