Mind the Style Gap: Meta-Evaluation of Style and Attribute Transfer Metrics
Amalie Brogaard Pauli, Isabelle Augenstein, Ira Assent

TL;DR
This paper critically evaluates existing metrics for style transfer in text, revealing their limitations and introducing a new dataset and method that better align with human judgments on content preservation.
Contribution
It presents a large meta-evaluation of style transfer metrics, introduces a challenging new dataset, and proposes a style-aware evaluation method using small language models.
Findings
Existing metrics often correlate highly with human judgments but are unsuitable for content preservation evaluation.
Meta-evaluation on current datasets can mislead conclusions about metric effectiveness.
The new dataset reveals that style-aware metrics better match human judgments.
Abstract
Large language models (LLMs) make it easy to rewrite a text in any style -- e.g. to make it more polite, persuasive, or more positive -- but evaluation thereof is not straightforward. A challenge lies in measuring content preservation: that content not attributable to style change is retained. This paper presents a large meta-evaluation of metrics for evaluating style and attribute transfer, focusing on content preservation. We find that meta-evaluation studies on existing datasets lead to misleading conclusions about the suitability of metrics for content preservation. Widely used metrics show a high correlation with human judgments despite being deemed unsuitable for the task -- because they do not abstract from style changes when evaluating content preservation. We show that the overly high correlations with human judgment stem from the nature of the test data. To address this issue,…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
Taxonomy
TopicsGenerative Adversarial Networks and Image Synthesis · Digital Humanities and Scholarship · Topic Modeling
MethodsSparse Evolutionary Training · Focus
