Who Transfers Safety? Identifying and Targeting Cross-Lingual Shared Safety Neurons
Xianhui Zhang, Chengyu Xie, Linxia Zhu, Yonghui Yang, Weixiang Zhao, Zifeng Cheng, Cong Wang, Fei Shen, Tat-Seng Chua

TL;DR
This paper identifies a small set of cross-lingual safety neurons in multilingual language models that are crucial for safety transfer, and proposes a neuron-focused training method to improve safety in low-resource languages.
Contribution
The study uncovers shared safety neurons across languages and introduces a neuron-targeted training strategy that outperforms existing methods in enhancing multilingual safety.
Findings
Shared safety neurons regulate safety across languages.
Suppressing these neurons reduces safety in non-high-resource languages.
Reinforcing these neurons improves safety transfer and consistency.
Abstract
Multilingual safety remains significantly imbalanced, leaving non-high-resource (NHR) languages vulnerable compared to robust high-resource (HR) ones. Moreover, the neural mechanisms driving safety alignment remain unclear despite observed cross-lingual representation transfer. In this paper, we find that LLMs contain a set of cross-lingual shared safety neurons (SS-Neurons), a remarkably small yet critical neuronal subset that jointly regulates safety behavior across languages. We first identify monolingual safety neurons (MS-Neurons) and validate their causal role in safety refusal behavior through targeted activation and suppression. Our cross-lingual analyses then identify SS-Neurons as the subset of MS-Neurons shared between HR and NHR languages, serving as a bridge to transfer safety capabilities from HR to NHR domains. We observe that suppressing these neurons causes…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAdversarial Robustness in Machine Learning · Topic Modeling · Explainable Artificial Intelligence (XAI)
