MuTox: Universal MUltilingual Audio-based TOXicity Dataset and Zero-shot Detector
Marta R. Costa-juss\`a, Mariano Coria Meglioli, Pierre Andrews, David, Dale, Prangthip Hansanti, Elahe Kalbassi, Alex Mourachko, Christophe Ropers,, Carleigh Wood

TL;DR
MuTox introduces a multilingual audio toxicity dataset and a zero-shot classifier that significantly outperforms existing methods, enabling broad language coverage and improved detection accuracy in speech toxicity analysis.
Contribution
The paper presents MuTox, the first large-scale multilingual audio toxicity dataset and a zero-shot detection model, advancing multilingual speech toxicity detection capabilities.
Findings
MuTox dataset covers 20 languages with 24,000 audio samples.
The zero-shot classifier outperforms text-based classifiers by over 1% AUC.
MuTox improves precision and recall by approximately 2.5 times over wordlist-based methods.
Abstract
Research in toxicity detection in natural language processing for the speech modality (audio-based) is quite limited, particularly for languages other than English. To address these limitations and lay the groundwork for truly multilingual audio-based toxicity detection, we introduce MuTox, the first highly multilingual audio-based dataset with toxicity labels. The dataset comprises 20,000 audio utterances for English and Spanish, and 4,000 for the other 19 languages. To demonstrate the quality of this dataset, we trained the MuTox audio-based toxicity classifier, which enables zero-shot toxicity detection across a wide range of languages. This classifier outperforms existing text-based trainable classifiers by more than 1% AUC, while expanding the language coverage more than tenfold. When compared to a wordlist-based classifier that covers a similar number of languages, MuTox improves…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsHate Speech and Cyberbullying Detection · Infrastructure Maintenance and Monitoring
