Towards Understanding Safety Alignment: A Mechanistic Perspective from Safety Neurons
Jianhui Chen, Xiaozhi Wang, Zijun Yao, Yushi Bai, Lei Hou, Juanzi Li

TL;DR
This paper investigates the internal safety mechanisms of large language models by identifying safety neurons and demonstrating how activating or patching them can significantly improve safety without impairing overall performance.
Contribution
It introduces a mechanistic interpretability approach to locate safety neurons in LLMs and shows how manipulating these neurons can enhance safety measures effectively.
Findings
Approximately 5% of neurons are safety neurons across models.
Patching safety neuron activations restores over 90% safety performance.
Safety neurons overlap with helpfulness neurons but require different activation patterns.
Abstract
Large language models (LLMs) excel in various capabilities but pose safety risks such as generating harmful content and misinformation, even after safety alignment. In this paper, we explore the inner mechanisms of safety alignment through the lens of mechanistic interpretability, focusing on identifying and analyzing safety neurons within LLMs that are responsible for safety behaviors. We propose inference-time activation contrasting to locate these neurons and dynamic activation patching to evaluate their causal effects on model safety. Experiments on multiple prevalent LLMs demonstrate that we can consistently identify about safety neurons, and by only patching their activations we can restore over of the safety performance across various red-teaming benchmarks without influencing general ability. The finding of safety neurons also helps explain the ''alignment tax''…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsTopic Modeling · Natural Language Processing Techniques
MethodsActivation Patching
