Evaluating Text Style Transfer Evaluation: Are There Any Reliable Metrics?
Sourabrata Mukherjee, Atul Kr. Ojha, John P. McCrae, Ondrej Dusek

TL;DR
This paper assesses the reliability of various automatic metrics, including large language models, for evaluating text style transfer tasks across multiple languages, aiming to find more effective evaluation methods.
Contribution
It systematically evaluates existing and novel metrics, including LLM-based approaches, for TST evaluation and demonstrates their effectiveness through correlation with human judgments.
Findings
Advanced NLP metrics improve evaluation accuracy.
LLM-based evaluations outperform traditional metrics.
Ensemble approaches enhance reliability of TST assessment.
Abstract
Text style transfer (TST) is the task of transforming a text to reflect a particular style while preserving its original content. Evaluating TST outputs is a multidimensional challenge, requiring the assessment of style transfer accuracy, content preservation, and naturalness. Using human evaluation is ideal but costly, as is common in other natural language processing (NLP) tasks, however, automatic metrics for TST have not received as much attention as metrics for, e.g., machine translation or summarization. In this paper, we examine both set of existing and novel metrics from broader NLP tasks for TST evaluation, focusing on two popular subtasks, sentiment transfer and detoxification, in a multilingual context comprising English, Hindi, and Bengali. By conducting meta-evaluation through correlation with human judgments, we demonstrate the effectiveness of these metrics when used…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
Taxonomy
TopicsNatural Language Processing Techniques
MethodsSoftmax · Attention Is All You Need · Sparse Evolutionary Training
