Evaluation Discrepancy Discovery: A Sentence Compression Case-study

Yevgeniy Puzikov

arXiv:2101.09079·cs.CL·January 25, 2021

Evaluation Discrepancy Discovery: A Sentence Compression Case-study

Yevgeniy Puzikov

PDF

Open Access 1 Repo

TL;DR

This paper investigates the limitations of current evaluation methods in NLP, specifically sentence compression, revealing that high metric scores may not reflect human-perceived quality and systems can exploit datasets to appear better.

Contribution

It highlights the discrepancy between metric scores and human judgments in NLP evaluation and demonstrates how systems can artificially improve metrics without genuine quality gains.

Findings

01

High metric scores do not always correlate with human judgments.

02

Systems can exploit datasets to achieve misleadingly high performance.

03

Evaluation protocols need to be more robust and aligned with human perception.

Abstract

Reliable evaluation protocols are of utmost importance for reproducible NLP research. In this work, we show that sometimes neither metric nor conventional human evaluation is sufficient to draw conclusions about system performance. Using sentence compression as an example task, we demonstrate how a system can game a well-established dataset to achieve state-of-the-art results. In contrast with the results reported in previous work that showed correlation between human judgements and metric scores, our manual analysis of state-of-the-art system outputs demonstrates that high metric scores may only indicate a better fit to the data, but not better outputs, as perceived by humans.

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

UKPLab/arxiv2021-evaluation-discrepancy-nsc
tfOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsTopic Modeling · Natural Language Processing Techniques · Multimodal Machine Learning Applications