TL;DR
This paper reviews 97 style transfer studies focusing on human evaluation methods, highlighting issues with lack of standardization and reproducibility that hinder progress in the field.
Contribution
It provides a comprehensive summary of current human evaluation practices in style transfer research and discusses challenges in standardization and reproducibility.
Findings
Human evaluation protocols are often underspecified.
Lack of standardization hampers reproducibility.
Improving evaluation methods can advance the field.
Abstract
This paper reviews and summarizes human evaluation practices described in 97 style transfer papers with respect to three main evaluation aspects: style transfer, meaning preservation, and fluency. In principle, evaluations by human raters should be the most reliable. However, in style transfer papers, we find that protocols for human evaluations are often underspecified and not standardized, which hampers the reproducibility of research in this field and progress toward better human and automatic evaluation methods.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
