Perturbation Probing: A Two-Pass-per-Prompt Diagnostic for FFN Behavioral Circuits in Aligned LLMs
Hongliang Liu, Tung-Ling Li, Yuhao Wu

TL;DR
Perturbation probing reveals causal neural circuits in large language models, enabling targeted interventions to modify behaviors like safety refusal and language switching without retraining.
Contribution
This work introduces a two-pass perturbation probing method that identifies and manipulates FFN circuits in LLMs, advancing mechanistic understanding and control.
Findings
Identified opposition and routing circuits organizing LLM behaviors.
Ablating 50 neurons reduces harmful safety refusals by 80%.
Interventions improve factual accuracy and language switching in specific models.
Abstract
Perturbation probing generates task-specific causal hypotheses for FFN neurons in large language models using two forward passes per prompt and no backpropagation, followed by a one-time intervention sweep of about 150 passes amortized across all identified neurons. Across eight behavioral circuits, 13 models, and four architecture families, we identify two circuit structures that organize LLM behavior. Opposition circuits appear when RLHF suppresses a pre-training tendency. In safety refusal, about 50 neurons, or 0.014 percent of all neurons, control the refusal template; ablating them changes 80 percent of response formats on 520 AdvBench prompts while producing near-zero harmful compliance, 3 of 520 cases, all with disclaimers. Routing circuits appear for pre-training behaviors distributed through attention. For language selection, residual-stream direction injection switches English…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
