Evaluating Text Style Transfer: A Nine-Language Benchmark for Text Detoxification
Vitaly Protasov, Nikolay Babakov, Daryna Dementieva, Alexander Panchenko

TL;DR
This paper introduces a comprehensive nine-language benchmark for evaluating text detoxification, comparing automatic metrics and LLM-based judgments to improve assessment reliability in multilingual text style transfer.
Contribution
It presents the first multilingual benchmarking study for text detoxification evaluation across nine languages, comparing neural metrics and LLM judgments, with insights for robust evaluation pipelines.
Findings
Proposed metrics correlate better with human judgments than baselines.
Multilingual evaluation reveals language-specific challenges.
Guidelines for building reliable multilingual TST evaluation pipelines.
Abstract
Despite notable advances in large language models (LLMs), reliable evaluation of text generation tasks such as text style transfer (TST) remains an open challenge. Existing research has shown that automatic metrics often correlate poorly with human judgments (Dementieva et al., 2024; Pauli et al., 2025), limiting our ability to assess model performance accurately. Furthermore, most prior work has focused primarily on English, while the evaluation of multilingual TST systems, particularly for text detoxification, remains largely underexplored. In this paper, we present the first comprehensive multilingual benchmarking study of evaluation metrics for text detoxification evaluation across nine languages: Arabic, Amharic, Chinese, English, German, Hindi, Russian, Spanish, and Ukrainian. Drawing inspiration from machine translation evaluation, we compare neural-based automatic metrics with…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAuthorship Attribution and Profiling · Text Readability and Simplification · Topic Modeling
