TL;DR
This study explores using knowledge distillation to improve multilingual safety in large language models, revealing challenges and potential in aligning safety across languages.
Contribution
It demonstrates how standard fine-tuning on safety data can unintentionally increase jailbreak success, and proposes methods to mitigate safety degradation.
Findings
Fine-tuning on safety data can increase jailbreak success rate by up to 16.6%.
Distillation effects vary across different languages and models.
Removing boundary refusals can mitigate safety declines, with some trade-offs.
Abstract
Large language models (LLMs) are increasingly deployed worldwide, yet their safety alignment remains predominantly English-centric. This allows for vulnerabilities in non-English contexts, especially with low-resource languages. We introduce a novel application of knowledge distillation (KD) in the context of multilingual jailbreak prevention, examining its efficacy. We distill the refusal behaviors of a proprietary teacher model (OpenAI o1-mini) with Low-Rank Adaptation (LoRA) into three open-source student models: Meta-Llama-3-8B-Instruct, Gemma-2-2B-IT, and Qwen3-8B, using ~28,000 multilingual jailbreak prompts from XSafety via black-box response-based, parameter-efficient fine-tuning (PEFT). Evaluation on the MultiJail benchmark reveals a counterintuitive behavior: standard fine-tuning on the teacher's ``safe'' refusal data inadvertently increases Jailbreak Success Rate (JSR) for…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
