TL;DR
This paper highlights the inadequacy of current automatic evaluation metrics for NLG, showing they poorly correlate with human judgments and are system- and data-dependent, thus emphasizing the need for new evaluation approaches.
Contribution
The paper systematically evaluates a wide range of metrics, revealing their limitations and advocating for the development of more reliable, system- and data-independent NLG evaluation methods.
Findings
Current metrics weakly reflect human judgments.
Metrics are system- and data-specific.
Metrics reliably support system development at system-level.
Abstract
The majority of NLG evaluation relies on automatic metrics, such as BLEU . In this paper, we motivate the need for novel, system- and data-independent automatic evaluation methods: We investigate a wide range of metrics, including state-of-the-art word-based and novel grammar-based ones, and demonstrate that they only weakly reflect human judgements of system outputs as generated by data-driven, end-to-end NLG. We also show that metric performance is data- and system-specific. Nevertheless, our results also suggest that automatic metrics perform reliably at system-level and can support system development by finding cases where a system performs poorly.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
