TL;DR
This paper questions whether human scoring is the best way to evaluate summaries and explores an alternative criterion that might be more universally applicable across different summary styles.
Contribution
It introduces a new evaluation criterion for summary quality that does not depend on correlation with human scores, challenging traditional assessment methods.
Findings
BLANC measures show the criterion is universal across summary styles
Correlation with human scores may not be the best evaluation indicator
Proposes an alternative measure for summary evaluation
Abstract
Normally, summary quality measures are compared with quality scores produced by human annotators. A higher correlation with human scores is considered to be a fair indicator of a better measure. We discuss observations that cast doubt on this view. We attempt to show a possibility of an alternative indicator. Given a family of measures, we explore a criterion of selecting the best measure not relying on correlations with human scores. Our observations for the BLANC family of measures suggest that the criterion is universal across very different styles of summaries.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
MethodsBLANC
