On the Limitations of Reference-Free Evaluations of Generated Text

Daniel Deutsch; Rotem Dror; Dan Roth

arXiv:2210.12563·cs.CL·October 25, 2022·1 cites

On the Limitations of Reference-Free Evaluations of Generated Text

Daniel Deutsch, Rotem Dror, Dan Roth

PDF

Open Access

TL;DR

This paper critically examines reference-free evaluation metrics for generated text, revealing their inherent biases and limitations, and argues they should be used for diagnostic purposes rather than performance measurement.

Contribution

The paper demonstrates the biases and limitations of reference-free metrics and advocates for their use as diagnostic tools instead of performance measures.

Findings

01

Reference-free metrics are equivalent to evaluating one model with another.

02

These metrics can be optimized at test time to find the best output.

03

They are biased towards models similar to the evaluators and against higher-quality outputs.

Abstract

There is significant interest in developing evaluation metrics which accurately estimate the quality of generated text without the aid of a human-written reference text, which can be time consuming and expensive to collect or entirely unavailable in online applications. However, in this work, we demonstrate that these reference-free metrics are inherently biased and limited in their ability to evaluate generated text, and we argue that they should not be used to measure progress on tasks like machine translation or summarization. We show how reference-free metrics are equivalent to using one generation model to evaluate another, which has several limitations: (1) the metrics can be optimized at test time to find the approximate best-possible output, (2) they are inherently biased toward models which are more similar to their own, and (3) they can be biased against higher-quality…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsTopic Modeling · Software Engineering Research · Scientific Computing and Data Management

MethodsTest