SmurfCat at PAN 2024 TextDetox: Alignment of Multilingual Transformers   for Text Detoxification

Elisei Rykov; Konstantin Zaytsev; Ivan Anisimov; Alexandr Voronin

arXiv:2407.05449·cs.CL·July 11, 2024

SmurfCat at PAN 2024 TextDetox: Alignment of Multilingual Transformers for Text Detoxification

Elisei Rykov, Konstantin Zaytsev, Ivan Anisimov, Alexandr Voronin

PDF

Open Access 1 Repo 2 Models 1 Datasets

TL;DR

This paper introduces a multilingual text detoxification method using data augmentation, fine-tuning of sequence-to-sequence models, and alignment techniques, achieving top results in the PAN-2024 competition.

Contribution

It presents a novel multilingual dataset creation process and applies advanced alignment and fine-tuning techniques to improve text detoxification performance.

Findings

01

Achieved state-of-the-art results for Ukrainian detoxification.

02

Secured first place in automated evaluation at PAN 2024.

03

Obtained second place in human evaluation at PAN 2024.

Abstract

This paper presents a solution for the Multilingual Text Detoxification task in the PAN-2024 competition of the SmurfCat team. Using data augmentation through machine translation and a special filtering procedure, we collected an additional multilingual parallel dataset for text detoxification. Using the obtained data, we fine-tuned several multilingual sequence-to-sequence models, such as mT0 and Aya, on a text detoxification task. We applied the ORPO alignment technique to the final model. Our final model has only 3.7 billion parameters and achieves state-of-the-art results for the Ukrainian language and near state-of-the-art results for other languages. In the competition, our team achieved first place in the automated evaluation with a score of 0.52 and second place in the final human evaluation with a score of 0.74.

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

s-nlp/multilingual-transformer-detoxification
pytorchOfficial

Models

Datasets

malexandersalazar/toxicity-multilingual-binary-classification-dataset
dataset· 27 dl
27 dl

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsNatural Language Processing Techniques

MethodsmT0