Who Transfers Safety? Identifying and Targeting Cross-Lingual Shared Safety Neurons

Xianhui Zhang; Chengyu Xie; Linxia Zhu; Yonghui Yang; Weixiang Zhao; Zifeng Cheng; Cong Wang; Fei Shen; Tat-Seng Chua

arXiv:2602.01283·cs.CV·February 3, 2026

Who Transfers Safety? Identifying and Targeting Cross-Lingual Shared Safety Neurons

Xianhui Zhang, Chengyu Xie, Linxia Zhu, Yonghui Yang, Weixiang Zhao, Zifeng Cheng, Cong Wang, Fei Shen, Tat-Seng Chua

PDF

Open Access

TL;DR

This paper identifies a small set of cross-lingual safety neurons in multilingual language models that are crucial for safety transfer, and proposes a neuron-focused training method to improve safety in low-resource languages.

Contribution

The study uncovers shared safety neurons across languages and introduces a neuron-targeted training strategy that outperforms existing methods in enhancing multilingual safety.

Findings

01

Shared safety neurons regulate safety across languages.

02

Suppressing these neurons reduces safety in non-high-resource languages.

03

Reinforcing these neurons improves safety transfer and consistency.

Abstract

Multilingual safety remains significantly imbalanced, leaving non-high-resource (NHR) languages vulnerable compared to robust high-resource (HR) ones. Moreover, the neural mechanisms driving safety alignment remain unclear despite observed cross-lingual representation transfer. In this paper, we find that LLMs contain a set of cross-lingual shared safety neurons (SS-Neurons), a remarkably small yet critical neuronal subset that jointly regulates safety behavior across languages. We first identify monolingual safety neurons (MS-Neurons) and validate their causal role in safety refusal behavior through targeted activation and suppression. Our cross-lingual analyses then identify SS-Neurons as the subset of MS-Neurons shared between HR and NHR languages, serving as a bridge to transfer safety capabilities from HR to NHR domains. We observe that suppressing these neurons causes…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsAdversarial Robustness in Machine Learning · Topic Modeling · Explainable Artificial Intelligence (XAI)