SynthDetoxM: Modern LLMs are Few-Shot Parallel Detoxification Data Annotators
Daniil Moskovskiy, Nikita Sushko, Sergey Pletenev, Elena Tutubalina, Alexander Panchenko

TL;DR
This paper presents SynthDetoxM, a large multilingual detoxification dataset generated using modern LLMs, which improves detoxification performance in low-data scenarios and advances multilingual toxic language mitigation.
Contribution
The paper introduces SynthDetoxM, a novel multilingual detoxification dataset created through a new pipeline using open-source LLMs, enhancing detoxification model training and performance.
Findings
Models trained on SynthDetoxM outperform those trained on human-annotated datasets.
SynthDetoxM enables superior detoxification in few-shot learning scenarios.
The dataset covers four major languages with 16,000 high-quality sentence pairs.
Abstract
Existing approaches to multilingual text detoxification are hampered by the scarcity of parallel multilingual datasets. In this work, we introduce a pipeline for the generation of multilingual parallel detoxification data. We also introduce SynthDetoxM, a manually collected and synthetically generated multilingual parallel text detoxification dataset comprising 16,000 high-quality detoxification sentence pairs across German, French, Spanish and Russian. The data was sourced from different toxicity evaluation datasets and then rewritten with nine modern open-source LLMs in few-shot setting. Our experiments demonstrate that models trained on the produced synthetic datasets have superior performance to those trained on the human-annotated MultiParaDetox dataset even in data limited setting. Models trained on SynthDetoxM outperform all evaluated LLMs in few-shot setting. We release our…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
Taxonomy
TopicsMetabolomics and Mass Spectrometry Studies · Medical Imaging Techniques and Applications · Cancer, Hypoxia, and Metabolism
