Multilingual Safety Alignment via Self-Distillation
Ruiyang Qin, Qingzhuo Wang, Dongrui Liu, Qiang Li, Zhihua Wei, Wen Shen

TL;DR
This paper introduces a cross-lingual safety transfer framework for large language models that enhances safety in low-resource languages without requiring response data, using self-distillation and a novel divergence measure.
Contribution
The paper proposes Multilingual Self-Distillation, a novel method for transferring safety capabilities from high-resource to low-resource languages without response data, and introduces Dual-Perspective Safety Weighting.
Findings
Achieves superior safety performance across diverse multilingual benchmarks.
Effectively generalizes to unseen languages and challenging datasets.
Maintains the model's utility while improving safety in low-resource languages.
Abstract
Large language models (LLMs) exhibit severe multilingual safety misalignment: they possess strong safeguards in high-resource languages but remain highly vulnerable to jailbreak attacks in low-resource languages. Current safety alignment methods generally rely on high-quality response data for each target language, which is expensive and difficult to generate. In this paper, we propose a cross-lingual safeguard transfer framework named Multilingual Self-Distillation (MSD). This framework transfers an LLM's inherent safety capabilities from high-resource (e.g., English) to low-resource (e.g., Javanese) languages, overcoming the need for response data in any language. Our framework is flexible and can be integrated with different self-distillation strategies. Specifically, we implement two concrete methods -- on-policy MSD and off-policy MSD -- both of which enable effective cross-lingual…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
