A Call for Standardization and Validation of Text Style Transfer Evaluation
Phil Ostheimer, Mayank Nagda, Marius Kloft, Sophie Fellenz

TL;DR
This paper highlights the significant lack of standardization and validation in Text Style Transfer evaluation, analyzing existing methods and proposing requirements to improve consistency and reliability in future research.
Contribution
It provides a comprehensive meta-analysis of TST evaluation practices, identifying key gaps in standardization and validation, and offers guidelines for future research to address these issues.
Findings
Substantial standardization gap in evaluation methods
Few automated metrics validated with human experiments
Identified pitfalls due to current evaluation inconsistencies
Abstract
Text Style Transfer (TST) evaluation is, in practice, inconsistent. Therefore, we conduct a meta-analysis on human and automated TST evaluation and experimentation that thoroughly examines existing literature in the field. The meta-analysis reveals a substantial standardization gap in human and automated evaluation. In addition, we also find a validation gap: only few automated metrics have been validated using human experiments. To this end, we thoroughly scrutinize both the standardization and validation gap and reveal the resulting pitfalls. This work also paves the way to close the standardization and validation gap in TST evaluation by calling out requirements to be met by future research.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSpeech and dialogue systems · Multimedia Communication and Technology · Speech Recognition and Synthesis
