Spurious Correlations in Reference-Free Evaluation of Text Generation
Esin Durmus, Faisal Ladhak, Tatsunori Hashimoto

TL;DR
This paper investigates the reliability of reference-free evaluation metrics for text generation, revealing they often rely on spurious correlations and have high error rates, but can be improved by designing better metrics.
Contribution
The study identifies the reliance on spurious correlations in current reference-free metrics and proposes methods to mitigate these issues for more accurate evaluation.
Findings
Reference-free metrics often depend on spurious features like word overlap and length.
Current metrics show high error rates in ranking state-of-the-art summarization systems.
Designing evaluation metrics to avoid spurious features improves their reliability.
Abstract
Model-based, reference-free evaluation metrics have been proposed as a fast and cost-effective approach to evaluate Natural Language Generation (NLG) systems. Despite promising recent results, we find evidence that reference-free evaluation metrics of summarization and dialog generation may be relying on spurious correlations with measures such as word overlap, perplexity, and length. We further observe that for text summarization, these metrics have high error rates when ranking current state-of-the-art abstractive summarization systems. We demonstrate that these errors can be mitigated by explicitly designing evaluation metrics to avoid spurious features in reference-free evaluation.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsTopic Modeling · Natural Language Processing Techniques · Software Engineering Research
