A Statistical Analysis of Summarization Evaluation Metrics using   Resampling Methods

Daniel Deutsch; Rotem Dror; Dan Roth

arXiv:2104.00054·cs.CL·July 28, 2021·5 cites

A Statistical Analysis of Summarization Evaluation Metrics using Resampling Methods

Daniel Deutsch, Rotem Dror, Dan Roth

PDF

Open Access 1 Repo

TL;DR

This paper introduces resampling-based methods to assess the statistical reliability of summarization evaluation metrics, revealing high uncertainty and identifying some metrics that outperform ROUGE.

Contribution

It proposes confidence interval and hypothesis testing techniques for correlation estimates in summarization metrics using bootstrapping and permutation methods.

Findings

01

Confidence intervals are wide, indicating high uncertainty.

02

Some metrics like QAEval and BERTScore outperform ROUGE in certain settings.

03

Many metrics do not show significant improvement over ROUGE.

Abstract

The quality of a summarization evaluation metric is quantified by calculating the correlation between its scores and human annotations across a large number of summaries. Currently, it is unclear how precise these correlation estimates are, nor whether differences between two metrics' correlations reflect a true difference or if it is due to mere chance. In this work, we address these two problems by proposing methods for calculating confidence intervals and running hypothesis tests for correlations using two resampling methods, bootstrapping and permutation. After evaluating which of the proposed methods is most appropriate for summarization through two simulation experiments, we analyze the results of applying these methods to several different automatic evaluation metrics across three sets of human annotations. We find that the confidence intervals are rather wide, demonstrating high…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

CogComp/stat-analysis-experiments
noneOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsTopic Modeling · Natural Language Processing Techniques · Advanced Text Analysis Techniques