Towards Understanding Safety Alignment: A Mechanistic Perspective from Safety Neurons

Jianhui Chen; Xiaozhi Wang; Zijun Yao; Yushi Bai; Lei Hou; Juanzi Li

arXiv:2406.14144·cs.CL·October 24, 2025·1 cites

Towards Understanding Safety Alignment: A Mechanistic Perspective from Safety Neurons

Jianhui Chen, Xiaozhi Wang, Zijun Yao, Yushi Bai, Lei Hou, Juanzi Li

PDF

Open Access

TL;DR

This paper investigates the internal safety mechanisms of large language models by identifying safety neurons and demonstrating how activating or patching them can significantly improve safety without impairing overall performance.

Contribution

It introduces a mechanistic interpretability approach to locate safety neurons in LLMs and shows how manipulating these neurons can enhance safety measures effectively.

Findings

01

Approximately 5% of neurons are safety neurons across models.

02

Patching safety neuron activations restores over 90% safety performance.

03

Safety neurons overlap with helpfulness neurons but require different activation patterns.

Abstract

Large language models (LLMs) excel in various capabilities but pose safety risks such as generating harmful content and misinformation, even after safety alignment. In this paper, we explore the inner mechanisms of safety alignment through the lens of mechanistic interpretability, focusing on identifying and analyzing safety neurons within LLMs that are responsible for safety behaviors. We propose inference-time activation contrasting to locate these neurons and dynamic activation patching to evaluate their causal effects on model safety. Experiments on multiple prevalent LLMs demonstrate that we can consistently identify about $5%$ safety neurons, and by only patching their activations we can restore over $90%$ of the safety performance across various red-teaming benchmarks without influencing general ability. The finding of safety neurons also helps explain the ''alignment tax''…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsTopic Modeling · Natural Language Processing Techniques

MethodsActivation Patching