On the Limitations of Reference-Free Evaluations of Generated Text
Daniel Deutsch, Rotem Dror, Dan Roth

TL;DR
This paper critically examines reference-free evaluation metrics for generated text, revealing their inherent biases and limitations, and argues they should be used for diagnostic purposes rather than performance measurement.
Contribution
The paper demonstrates the biases and limitations of reference-free metrics and advocates for their use as diagnostic tools instead of performance measures.
Findings
Reference-free metrics are equivalent to evaluating one model with another.
These metrics can be optimized at test time to find the best output.
They are biased towards models similar to the evaluators and against higher-quality outputs.
Abstract
There is significant interest in developing evaluation metrics which accurately estimate the quality of generated text without the aid of a human-written reference text, which can be time consuming and expensive to collect or entirely unavailable in online applications. However, in this work, we demonstrate that these reference-free metrics are inherently biased and limited in their ability to evaluate generated text, and we argue that they should not be used to measure progress on tasks like machine translation or summarization. We show how reference-free metrics are equivalent to using one generation model to evaluate another, which has several limitations: (1) the metrics can be optimized at test time to find the approximate best-possible output, (2) they are inherently biased toward models which are more similar to their own, and (3) they can be biased against higher-quality…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsTopic Modeling · Software Engineering Research · Scientific Computing and Data Management
MethodsTest
