Evaluating Text Style Transfer Evaluation: Are There Any Reliable   Metrics?

Sourabrata Mukherjee; Atul Kr. Ojha; John P. McCrae; Ondrej Dusek

arXiv:2502.04718·cs.CL·April 24, 2025

Evaluating Text Style Transfer Evaluation: Are There Any Reliable Metrics?

Sourabrata Mukherjee, Atul Kr. Ojha, John P. McCrae, Ondrej Dusek

PDF

Open Access 1 Video

TL;DR

This paper assesses the reliability of various automatic metrics, including large language models, for evaluating text style transfer tasks across multiple languages, aiming to find more effective evaluation methods.

Contribution

It systematically evaluates existing and novel metrics, including LLM-based approaches, for TST evaluation and demonstrates their effectiveness through correlation with human judgments.

Findings

01

Advanced NLP metrics improve evaluation accuracy.

02

LLM-based evaluations outperform traditional metrics.

03

Ensemble approaches enhance reliability of TST assessment.

Abstract

Text style transfer (TST) is the task of transforming a text to reflect a particular style while preserving its original content. Evaluating TST outputs is a multidimensional challenge, requiring the assessment of style transfer accuracy, content preservation, and naturalness. Using human evaluation is ideal but costly, as is common in other natural language processing (NLP) tasks, however, automatic metrics for TST have not received as much attention as metrics for, e.g., machine translation or summarization. In this paper, we examine both set of existing and novel metrics from broader NLP tasks for TST evaluation, focusing on two popular subtasks, sentiment transfer and detoxification, in a multilingual context comprising English, Hindi, and Bengali. By conducting meta-evaluation through correlation with human judgments, we demonstrate the effectiveness of these metrics when used…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

Evaluating Text Style Transfer Evaluation: Are There Any Reliable Metrics?· underline

Taxonomy

TopicsNatural Language Processing Techniques

MethodsSoftmax · Attention Is All You Need · Sparse Evolutionary Training