Near-Negative Distinction: Giving a Second Life to Human Evaluation   Datasets

Philippe Laban; Chien-Sheng Wu; Wenhao Liu; Caiming Xiong

arXiv:2205.06871·cs.CL·November 10, 2022

Near-Negative Distinction: Giving a Second Life to Human Evaluation Datasets

Philippe Laban, Chien-Sheng Wu, Wenhao Liu, Caiming Xiong

PDF

Open Access 1 Repo

TL;DR

This paper introduces Near-Negative Distinction (NND), an automatic evaluation method for NLG that leverages existing human annotations to provide a cost-effective, reproducible, and accurate alternative to traditional human evaluation.

Contribution

The paper proposes NND, a novel automatic evaluation approach that repurposes human annotations for more reliable and low-cost assessment of NLG models, outperforming standard metrics.

Findings

01

NND correlates better with human judgments than traditional metrics.

02

NND can be used for detailed model analysis and training dynamics.

03

NND effectively reuses existing human annotations for evaluation.

Abstract

Precisely assessing the progress in natural language generation (NLG) tasks is challenging, and human evaluation to establish a preference in a model's output over another is often necessary. However, human evaluation is usually costly, difficult to reproduce, and non-reusable. In this paper, we propose a new and simple automatic evaluation method for NLG called Near-Negative Distinction (NND) that repurposes prior human annotations into NND tests. In an NND test, an NLG model must place a higher likelihood on a high-quality output candidate than on a near-negative candidate with a known error. Model performance is established by the number of NND tests a model passes, as well as the distribution over task-specific errors the model fails on. Through experiments on three NLG tasks (question generation, question answering, and summarization), we show that NND achieves a higher correlation…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

salesforce/nnd_evaluation
noneOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsTopic Modeling · Natural Language Processing Techniques · Multimodal Machine Learning Applications