Perturbation CheckLists for Evaluating NLG Evaluation Metrics

Ananya B. Sai; Tanay Dixit; Dev Yashpal Sheth; Sreyas Mohan; Mitesh M.; Khapra

arXiv:2109.05771·cs.CL·September 14, 2021

Perturbation CheckLists for Evaluating NLG Evaluation Metrics

Ananya B. Sai, Tanay Dixit, Dev Yashpal Sheth, Sreyas Mohan, Mitesh M., Khapra

PDF

Open Access 1 Repo

TL;DR

This paper highlights the inadequacy of current NLG evaluation metrics by showing their poor correlation with human judgments across multiple criteria and proposes CheckLists with targeted perturbations for more nuanced assessment.

Contribution

The authors introduce CheckLists, a novel framework with templates for fine-grained evaluation of automatic NLG metrics through specific perturbations.

Findings

01

Existing metrics poorly correlate with human scores across criteria.

02

Most metrics are not robust against simple, targeted perturbations.

03

CheckLists reveal limitations of current evaluation metrics.

Abstract

Natural Language Generation (NLG) evaluation is a multifaceted task requiring assessment of multiple desirable criteria, e.g., fluency, coherency, coverage, relevance, adequacy, overall quality, etc. Across existing datasets for 6 NLG tasks, we observe that the human evaluation scores on these multiple criteria are often not correlated. For example, there is a very low correlation between human scores on fluency and data coverage for the task of structured data to text generation. This suggests that the current recipe of proposing new automatic evaluation metrics for NLG by showing that they correlate well with scores assigned by humans for a single criteria (overall quality) alone is inadequate. Indeed, our extensive study involving 25 automatic evaluation metrics across 6 different tasks and 18 different evaluation criteria shows that there is no single metric which correlates well…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

iitmnlp/evaleval
noneOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsTopic Modeling · Natural Language Processing Techniques · Software Engineering Research