On the Effectiveness of Automated Metrics for Text Generation Systems
Pius von D\"aniken, Jan Deriu, Don Tuggener, Mark Cieliebak

TL;DR
This paper develops a theoretical framework for evaluating text generation systems, addressing uncertainties like metric imperfections and test set sizes, and demonstrates its application on real evaluation data to improve reliability.
Contribution
It introduces a novel theory incorporating uncertainties in automated metrics and test set sizes, guiding more reliable evaluation of text generation systems.
Findings
The theory helps determine the sample size needed for reliable system comparison.
Application on WMT 21 data shows improved evaluation robustness.
Guidelines for enhancing evaluation protocols are outlined.
Abstract
A major challenge in the field of Text Generation is evaluation because we lack a sound theory that can be leveraged to extract guidelines for evaluation campaigns. In this work, we propose a first step towards such a theory that incorporates different sources of uncertainty, such as imperfect automated metrics and insufficiently sized test sets. The theory has practical applications, such as determining the number of samples needed to reliably distinguish the performance of a set of Text Generation systems in a given setting. We showcase the application of the theory on the WMT 21 and Spot-The-Bot evaluation data and outline how it can be leveraged to improve the evaluation protocol regarding the reliability, robustness, and significance of the evaluation outcome.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
MethodsTest
