Is human scoring the best criteria for summary evaluation?

Oleg Vasilyev; John Bohannon

arXiv:2012.14602·cs.CL·January 1, 2021

Is human scoring the best criteria for summary evaluation?

Oleg Vasilyev, John Bohannon

PDF

1 Repo

TL;DR

This paper questions whether human scoring is the best way to evaluate summaries and explores an alternative criterion that might be more universally applicable across different summary styles.

Contribution

It introduces a new evaluation criterion for summary quality that does not depend on correlation with human scores, challenging traditional assessment methods.

Findings

01

BLANC measures show the criterion is universal across summary styles

02

Correlation with human scores may not be the best evaluation indicator

03

Proposes an alternative measure for summary evaluation

Abstract

Normally, summary quality measures are compared with quality scores produced by human annotators. A higher correlation with human scores is considered to be a fair indicator of a better measure. We discuss observations that cast doubt on this view. We attempt to show a possibility of an alternative indicator. Given a family of measures, we explore a criterion of selecting the best measure not relying on correlations with human scores. Our observations for the BLANC family of measures suggest that the criterion is universal across very different styles of summaries.

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

PrimerAI/blanc
pytorchOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

MethodsBLANC