Multilingual Collaborative Defense for Large Language Models

Hongliang Li; Jinan Xu; Gengping Cui; Changhao Guan; Fengran Mo; Kaiyu Huang

arXiv:2505.11835·cs.CL·September 16, 2025

Multilingual Collaborative Defense for Large Language Models

Hongliang Li, Jinan Xu, Gengping Cui, Changhao Guan, Fengran Mo, Kaiyu Huang

PDF

Open Access 1 Repo

TL;DR

This paper introduces Multilingual Collaborative Defense (MCD), a novel method that enhances the safety of large language models across multiple languages by automatically optimizing a soft safety prompt, thereby improving multilingual safeguarding and transferability.

Contribution

The paper proposes MCD, a new learning approach that improves multilingual safety of LLMs, addressing language safety misalignment and enhancing generalization across underrepresented languages.

Findings

01

MCD outperforms existing safeguarding methods in multilingual jailbreak benchmarks.

02

MCD demonstrates strong language transferability in zero-shot scenarios.

03

MCD maintains low false refusal rates while improving safety across languages.

Abstract

The robustness and security of large language models (LLMs) has become a prominent research area. One notable vulnerability is the ability to bypass LLM safeguards by translating harmful queries into rare or underrepresented languages, a simple yet effective method of "jailbreaking" these models. Despite the growing concern, there has been limited research addressing the safeguarding of LLMs in multilingual scenarios, highlighting an urgent need to enhance multilingual safety. In this work, we investigate the correlation between various attack features across different languages and propose Multilingual Collaborative Defense (MCD), a novel learning method that optimizes a continuous, soft safety prompt automatically to facilitate multilingual safeguarding of LLMs. The MCD approach offers three advantages: First, it effectively improves safeguarding performance across multiple languages.…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

hliang-lee/mcd
pytorchOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsAdversarial Robustness in Machine Learning · Hate Speech and Cyberbullying Detection · Topic Modeling