Response-Based Knowledge Distillation for Multilingual Jailbreak Prevention Unwittingly Compromises Safety

Max Zhang; Derek Liu; Kai Zhang; Joshua Franco; Haihao Liu

arXiv:2602.11157·cs.CL·April 27, 2026

Response-Based Knowledge Distillation for Multilingual Jailbreak Prevention Unwittingly Compromises Safety

Max Zhang, Derek Liu, Kai Zhang, Joshua Franco, Haihao Liu

PDF

2 Videos

TL;DR

This study explores using knowledge distillation to improve multilingual safety in large language models, revealing challenges and potential in aligning safety across languages.

Contribution

It demonstrates how standard fine-tuning on safety data can unintentionally increase jailbreak success, and proposes methods to mitigate safety degradation.

Findings

01

Fine-tuning on safety data can increase jailbreak success rate by up to 16.6%.

02

Distillation effects vary across different languages and models.

03

Removing boundary refusals can mitigate safety declines, with some trade-offs.

Abstract

Large language models (LLMs) are increasingly deployed worldwide, yet their safety alignment remains predominantly English-centric. This allows for vulnerabilities in non-English contexts, especially with low-resource languages. We introduce a novel application of knowledge distillation (KD) in the context of multilingual jailbreak prevention, examining its efficacy. We distill the refusal behaviors of a proprietary teacher model (OpenAI o1-mini) with Low-Rank Adaptation (LoRA) into three open-source student models: Meta-Llama-3-8B-Instruct, Gemma-2-2B-IT, and Qwen3-8B, using ~28,000 multilingual jailbreak prompts from XSafety via black-box response-based, parameter-efficient fine-tuning (PEFT). Evaluation on the MultiJail benchmark reveals a counterintuitive behavior: standard fine-tuning on the teacher's ``safe'' refusal data inadvertently increases Jailbreak Success Rate (JSR) for…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

Response-Based Knowledge Distillation for Multilingual Jailbreak Prevention Unwittingly Compromises Safety· underline