A Statistical Analysis of Summarization Evaluation Metrics using Resampling Methods
Daniel Deutsch, Rotem Dror, Dan Roth

TL;DR
This paper introduces resampling-based methods to assess the statistical reliability of summarization evaluation metrics, revealing high uncertainty and identifying some metrics that outperform ROUGE.
Contribution
It proposes confidence interval and hypothesis testing techniques for correlation estimates in summarization metrics using bootstrapping and permutation methods.
Findings
Confidence intervals are wide, indicating high uncertainty.
Some metrics like QAEval and BERTScore outperform ROUGE in certain settings.
Many metrics do not show significant improvement over ROUGE.
Abstract
The quality of a summarization evaluation metric is quantified by calculating the correlation between its scores and human annotations across a large number of summaries. Currently, it is unclear how precise these correlation estimates are, nor whether differences between two metrics' correlations reflect a true difference or if it is due to mere chance. In this work, we address these two problems by proposing methods for calculating confidence intervals and running hypothesis tests for correlations using two resampling methods, bootstrapping and permutation. After evaluating which of the proposed methods is most appropriate for summarization through two simulation experiments, we analyze the results of applying these methods to several different automatic evaluation metrics across three sets of human annotations. We find that the confidence intervals are rather wide, demonstrating high…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsTopic Modeling · Natural Language Processing Techniques · Advanced Text Analysis Techniques
