Breaking mBad! Supervised Fine-tuning for Cross-Lingual Detoxification

Himanshu Beniwal; Youngwoo Kim; Maarten Sap; Soham Dan; Thomas Hartvigsen

arXiv:2505.16722·cs.CL·October 24, 2025

Breaking mBad! Supervised Fine-tuning for Cross-Lingual Detoxification

Himanshu Beniwal, Youngwoo Kim, Maarten Sap, Soham Dan, Thomas Hartvigsen

PDF

Open Access 1 Repo

TL;DR

This paper introduces a cross-lingual detoxification method for large language models, enabling toxicity mitigation across diverse languages and scripts, while analyzing its impact on model performance and safety trade-offs.

Contribution

It proposes a novel cross-lingual detoxification approach and evaluates its effectiveness across numerous settings, addressing toxicity in multilingual LLMs.

Findings

01

Effective toxicity reduction in multiple languages

02

Trade-offs between safety and knowledge retention

03

Robust performance across diverse linguistic settings

Abstract

As large language models (LLMs) become increasingly prevalent in global applications, ensuring that they are toxicity-free across diverse linguistic contexts remains a critical challenge. We explore "Cross-lingual Detoxification", a cross-lingual paradigm that mitigates toxicity, enabling detoxification capabilities to transfer between high and low-resource languages across different script families. We analyze cross-lingual detoxification's effectiveness through 392 extensive settings to evaluate toxicity reduction in cross-distribution settings with limited data and investigate how mitigation impacts model performance on non-toxic tasks, revealing trade-offs between safety and knowledge preservation. Our code and dataset are publicly available at https://github.com/himanshubeniwal/Breaking-mBad.

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

himanshubeniwal/breaking-mbad
pytorchOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsAdversarial Robustness in Machine Learning · Hate Speech and Cyberbullying Detection · Topic Modeling