Evaluation Discrepancy Discovery: A Sentence Compression Case-study
Yevgeniy Puzikov

TL;DR
This paper investigates the limitations of current evaluation methods in NLP, specifically sentence compression, revealing that high metric scores may not reflect human-perceived quality and systems can exploit datasets to appear better.
Contribution
It highlights the discrepancy between metric scores and human judgments in NLP evaluation and demonstrates how systems can artificially improve metrics without genuine quality gains.
Findings
High metric scores do not always correlate with human judgments.
Systems can exploit datasets to achieve misleadingly high performance.
Evaluation protocols need to be more robust and aligned with human perception.
Abstract
Reliable evaluation protocols are of utmost importance for reproducible NLP research. In this work, we show that sometimes neither metric nor conventional human evaluation is sufficient to draw conclusions about system performance. Using sentence compression as an example task, we demonstrate how a system can game a well-established dataset to achieve state-of-the-art results. In contrast with the results reported in previous work that showed correlation between human judgements and metric scores, our manual analysis of state-of-the-art system outputs demonstrates that high metric scores may only indicate a better fit to the data, but not better outputs, as perceived by humans.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsTopic Modeling · Natural Language Processing Techniques · Multimodal Machine Learning Applications
