Evaluating Style Transfer for Text
Remi Mir, Bjarke Felbo, Nick Obradovich, Iyad Rahwan

TL;DR
This paper addresses the lack of standard evaluation methods in text style transfer, proposing improved metrics and best practices validated on a Yelp dataset to enhance reliability and comparability.
Contribution
It introduces new automated evaluation metrics for style transfer, correlates them with human judgment, and provides guidelines for assessing tradeoffs between style transfer aspects.
Findings
Automated metrics correlate better with human judgments.
Models show tradeoffs between style transfer quality and content preservation.
Software tools for evaluation are publicly released.
Abstract
Research in the area of style transfer for text is currently bottlenecked by a lack of standard evaluation practices. This paper aims to alleviate this issue by experimentally identifying best practices with a Yelp sentiment dataset. We specify three aspects of interest (style transfer intensity, content preservation, and naturalness) and show how to obtain more reliable measures of them from human evaluation than in previous work. We propose a set of metrics for automated evaluation and demonstrate that they are more strongly correlated and in agreement with human judgment: direction-corrected Earth Mover's Distance, Word Mover's Distance on style-masked texts, and adversarial classification for the respective aspects. We also show that the three examined models exhibit tradeoffs between aspects of interest, demonstrating the importance of evaluating style transfer models at specific…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsTopic Modeling · Natural Language Processing Techniques · Sentiment Analysis and Opinion Mining
