Why We Need New Evaluation Metrics for NLG

Jekaterina Novikova; Ond\v{r}ej Du\v{s}ek; Amanda Cercas Curry and; Verena Rieser

arXiv:1707.06875·cs.CL·September 18, 2017

Why We Need New Evaluation Metrics for NLG

Jekaterina Novikova, Ond\v{r}ej Du\v{s}ek, Amanda Cercas Curry and, Verena Rieser

PDF

1 Repo

TL;DR

This paper highlights the inadequacy of current automatic evaluation metrics for NLG, showing they poorly correlate with human judgments and are system- and data-dependent, thus emphasizing the need for new evaluation approaches.

Contribution

The paper systematically evaluates a wide range of metrics, revealing their limitations and advocating for the development of more reliable, system- and data-independent NLG evaluation methods.

Findings

01

Current metrics weakly reflect human judgments.

02

Metrics are system- and data-specific.

03

Metrics reliably support system development at system-level.

Abstract

The majority of NLG evaluation relies on automatic metrics, such as BLEU . In this paper, we motivate the need for novel, system- and data-independent automatic evaluation methods: We investigate a wide range of metrics, including state-of-the-art word-based and novel grammar-based ones, and demonstrate that they only weakly reflect human judgements of system outputs as generated by data-driven, end-to-end NLG. We also show that metric performance is data- and system-specific. Nevertheless, our results also suggest that automatic metrics perform reliably at system-level and can support system development by finding cases where a system performs poorly.

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

jeknov/EMNLP_17_submission
noneOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.