Evaluating Text Style Transfer: A Nine-Language Benchmark for Text Detoxification

Vitaly Protasov; Nikolay Babakov; Daryna Dementieva; Alexander Panchenko

arXiv:2507.15557·cs.CL·March 5, 2026

Evaluating Text Style Transfer: A Nine-Language Benchmark for Text Detoxification

Vitaly Protasov, Nikolay Babakov, Daryna Dementieva, Alexander Panchenko

PDF

Open Access

TL;DR

This paper introduces a comprehensive nine-language benchmark for evaluating text detoxification, comparing automatic metrics and LLM-based judgments to improve assessment reliability in multilingual text style transfer.

Contribution

It presents the first multilingual benchmarking study for text detoxification evaluation across nine languages, comparing neural metrics and LLM judgments, with insights for robust evaluation pipelines.

Findings

01

Proposed metrics correlate better with human judgments than baselines.

02

Multilingual evaluation reveals language-specific challenges.

03

Guidelines for building reliable multilingual TST evaluation pipelines.

Abstract

Despite notable advances in large language models (LLMs), reliable evaluation of text generation tasks such as text style transfer (TST) remains an open challenge. Existing research has shown that automatic metrics often correlate poorly with human judgments (Dementieva et al., 2024; Pauli et al., 2025), limiting our ability to assess model performance accurately. Furthermore, most prior work has focused primarily on English, while the evaluation of multilingual TST systems, particularly for text detoxification, remains largely underexplored. In this paper, we present the first comprehensive multilingual benchmarking study of evaluation metrics for text detoxification evaluation across nine languages: Arabic, Amharic, Chinese, English, German, Hindi, Russian, Spanish, and Ukrainian. Drawing inspiration from machine translation evaluation, we compare neural-based automatic metrics with…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsAuthorship Attribution and Profiling · Text Readability and Simplification · Topic Modeling