Preference Tuning For Toxicity Mitigation Generalizes Across Languages
Xiaochen Li, Zheng-Xin Yong, Stephen H. Bach

TL;DR
This paper demonstrates that preference tuning with English data can effectively reduce toxicity in multilingual LLMs across various languages, showing strong cross-lingual generalization and explaining it through mechanistic interpretability.
Contribution
It reveals that DPO preference tuning trained on English data generalizes to multiple languages and explains this phenomenon via the dual multilinguality of MLP layers.
Findings
Toxicity drops from 46.8% to 3.9% across 17 languages after DPO training.
DPO generalizes to models like BLOOM, Llama3, and Aya-23.
Bilingual sentence retrieval predicts cross-lingual transferability.
Abstract
Detoxifying multilingual Large Language Models (LLMs) has become crucial due to their increasing global use. In this work, we explore zero-shot cross-lingual generalization of preference tuning in detoxifying LLMs. Unlike previous studies that show limited cross-lingual generalization for other safety tasks, we demonstrate that Direct Preference Optimization (DPO) training with only English data can significantly reduce toxicity in multilingual open-ended generations. For example, the probability of mGPT-1.3B generating toxic continuations drops from 46.8% to 3.9% across 17 different languages after training. Our results also extend to other multilingual LLMs, such as BLOOM, Llama3, and Aya-23. Using mechanistic interpretability tools like causal intervention and activation analysis, we identified the dual multilinguality property of MLP layers in LLMs, which explains the cross-lingual…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
Taxonomy
TopicsComputational Drug Discovery Methods
MethodsDirect Preference Optimization · BLOOM
