Preference Tuning For Toxicity Mitigation Generalizes Across Languages

Xiaochen Li; Zheng-Xin Yong; Stephen H. Bach

arXiv:2406.16235·cs.CL·November 11, 2024

Preference Tuning For Toxicity Mitigation Generalizes Across Languages

Xiaochen Li, Zheng-Xin Yong, Stephen H. Bach

PDF

Open Access 1 Repo 6 Models 1 Video

TL;DR

This paper demonstrates that preference tuning with English data can effectively reduce toxicity in multilingual LLMs across various languages, showing strong cross-lingual generalization and explaining it through mechanistic interpretability.

Contribution

It reveals that DPO preference tuning trained on English data generalizes to multiple languages and explains this phenomenon via the dual multilinguality of MLP layers.

Findings

01

Toxicity drops from 46.8% to 3.9% across 17 languages after DPO training.

02

DPO generalizes to models like BLOOM, Llama3, and Aya-23.

03

Bilingual sentence retrieval predicts cross-lingual transferability.

Abstract

Detoxifying multilingual Large Language Models (LLMs) has become crucial due to their increasing global use. In this work, we explore zero-shot cross-lingual generalization of preference tuning in detoxifying LLMs. Unlike previous studies that show limited cross-lingual generalization for other safety tasks, we demonstrate that Direct Preference Optimization (DPO) training with only English data can significantly reduce toxicity in multilingual open-ended generations. For example, the probability of mGPT-1.3B generating toxic continuations drops from 46.8% to 3.9% across 17 different languages after training. Our results also extend to other multilingual LLMs, such as BLOOM, Llama3, and Aya-23. Using mechanistic interpretability tools like causal intervention and activation analysis, we identified the dual multilinguality property of MLP layers in LLMs, which explains the cross-lingual…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

batsresearch/cross-lingual-detox
pytorchOfficial

Models

Videos

Preference Tuning For Toxicity Mitigation Generalizes Across Languages· underline

Taxonomy

TopicsComputational Drug Discovery Methods

MethodsDirect Preference Optimization · BLOOM