Spurious Correlations in Reference-Free Evaluation of Text Generation

Esin Durmus; Faisal Ladhak; Tatsunori Hashimoto

arXiv:2204.09890·cs.CL·April 22, 2022

Spurious Correlations in Reference-Free Evaluation of Text Generation

Esin Durmus, Faisal Ladhak, Tatsunori Hashimoto

PDF

Open Access

TL;DR

This paper investigates the reliability of reference-free evaluation metrics for text generation, revealing they often rely on spurious correlations and have high error rates, but can be improved by designing better metrics.

Contribution

The study identifies the reliance on spurious correlations in current reference-free metrics and proposes methods to mitigate these issues for more accurate evaluation.

Findings

01

Reference-free metrics often depend on spurious features like word overlap and length.

02

Current metrics show high error rates in ranking state-of-the-art summarization systems.

03

Designing evaluation metrics to avoid spurious features improves their reliability.

Abstract

Model-based, reference-free evaluation metrics have been proposed as a fast and cost-effective approach to evaluate Natural Language Generation (NLG) systems. Despite promising recent results, we find evidence that reference-free evaluation metrics of summarization and dialog generation may be relying on spurious correlations with measures such as word overlap, perplexity, and length. We further observe that for text summarization, these metrics have high error rates when ranking current state-of-the-art abstractive summarization systems. We demonstrate that these errors can be mitigated by explicitly designing evaluation metrics to avoid spurious features in reference-free evaluation.

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsTopic Modeling · Natural Language Processing Techniques · Software Engineering Research