Multilingual Collaborative Defense for Large Language Models
Hongliang Li, Jinan Xu, Gengping Cui, Changhao Guan, Fengran Mo, Kaiyu Huang

TL;DR
This paper introduces Multilingual Collaborative Defense (MCD), a novel method that enhances the safety of large language models across multiple languages by automatically optimizing a soft safety prompt, thereby improving multilingual safeguarding and transferability.
Contribution
The paper proposes MCD, a new learning approach that improves multilingual safety of LLMs, addressing language safety misalignment and enhancing generalization across underrepresented languages.
Findings
MCD outperforms existing safeguarding methods in multilingual jailbreak benchmarks.
MCD demonstrates strong language transferability in zero-shot scenarios.
MCD maintains low false refusal rates while improving safety across languages.
Abstract
The robustness and security of large language models (LLMs) has become a prominent research area. One notable vulnerability is the ability to bypass LLM safeguards by translating harmful queries into rare or underrepresented languages, a simple yet effective method of "jailbreaking" these models. Despite the growing concern, there has been limited research addressing the safeguarding of LLMs in multilingual scenarios, highlighting an urgent need to enhance multilingual safety. In this work, we investigate the correlation between various attack features across different languages and propose Multilingual Collaborative Defense (MCD), a novel learning method that optimizes a continuous, soft safety prompt automatically to facilitate multilingual safeguarding of LLMs. The MCD approach offers three advantages: First, it effectively improves safeguarding performance across multiple languages.…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAdversarial Robustness in Machine Learning · Hate Speech and Cyberbullying Detection · Topic Modeling
