Multilingual Safety Alignment via Self-Distillation

Ruiyang Qin; Qingzhuo Wang; Dongrui Liu; Qiang Li; Zhihua Wei; Wen Shen

arXiv:2605.02971·cs.LG·May 11, 2026

Multilingual Safety Alignment via Self-Distillation

Ruiyang Qin, Qingzhuo Wang, Dongrui Liu, Qiang Li, Zhihua Wei, Wen Shen

PDF

TL;DR

This paper introduces a cross-lingual safety transfer framework for large language models that enhances safety in low-resource languages without requiring response data, using self-distillation and a novel divergence measure.

Contribution

The paper proposes Multilingual Self-Distillation, a novel method for transferring safety capabilities from high-resource to low-resource languages without response data, and introduces Dual-Perspective Safety Weighting.

Findings

01

Achieves superior safety performance across diverse multilingual benchmarks.

02

Effectively generalizes to unseen languages and challenging datasets.

03

Maintains the model's utility while improving safety in low-resource languages.

Abstract

Large language models (LLMs) exhibit severe multilingual safety misalignment: they possess strong safeguards in high-resource languages but remain highly vulnerable to jailbreak attacks in low-resource languages. Current safety alignment methods generally rely on high-quality response data for each target language, which is expensive and difficult to generate. In this paper, we propose a cross-lingual safeguard transfer framework named Multilingual Self-Distillation (MSD). This framework transfers an LLM's inherent safety capabilities from high-resource (e.g., English) to low-resource (e.g., Javanese) languages, overcoming the need for response data in any language. Our framework is flexible and can be integrated with different self-distillation strategies. Specifically, we implement two concrete methods -- on-policy MSD and off-policy MSD -- both of which enable effective cross-lingual…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.