DuoGuard: A Two-Player RL-Driven Framework for Multilingual LLM   Guardrails

Yihe Deng; Yu Yang; Junkai Zhang; Wei Wang; Bo Li

arXiv:2502.05163·cs.CL·February 10, 2025

DuoGuard: A Two-Player RL-Driven Framework for Multilingual LLM Guardrails

Yihe Deng, Yu Yang, Junkai Zhang, Wei Wang, Bo Li

PDF

Open Access 1 Repo 4 Models

TL;DR

DuoGuard introduces a two-player RL framework where a generator and guardrail model co-evolve to create synthetic multilingual safety data, significantly improving safety performance across languages while being efficient and scalable.

Contribution

We propose a novel adversarial RL framework for multilingual guardrail training, formalize its convergence, and demonstrate its superiority over existing models in safety benchmarks.

Findings

01

Achieves nearly 10% improvement over LlamaGuard3 in English safety benchmarks.

02

Outperforms state-of-the-art models while being 4.5x faster and smaller.

03

Effectively addresses safety data imbalance in low-resource languages.

Abstract

The rapid advancement of large language models (LLMs) has increased the need for guardrail models to ensure responsible use, particularly in detecting unsafe and illegal content. While substantial safety data exist in English, multilingual guardrail modeling remains underexplored due to the scarcity of open-source safety data in other languages. To address this gap, we propose a novel two-player Reinforcement Learning (RL) framework, where a generator and a guardrail model co-evolve adversarially to produce high-quality synthetic data for multilingual guardrail training. We theoretically formalize this interaction as a two-player game, proving convergence to a Nash equilibrium. Empirical evaluations show that our model \ours outperforms state-of-the-art models, achieving nearly 10% improvement over LlamaGuard3 (8B) on English benchmarks while being 4.5x faster at inference with a…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

yihedeng9/duoguard
pytorchOfficial

Models

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsNatural Language Processing Techniques