A Survey of Evaluation Metrics Used for NLG Systems

Ananya B. Sai; Akash Kumar Mohankumar; Mitesh M. Khapra

arXiv:2008.12009·cs.CL·October 6, 2020

A Survey of Evaluation Metrics Used for NLG Systems

Ananya B. Sai, Akash Kumar Mohankumar, Mitesh M. Khapra

PDF

TL;DR

This survey reviews recent developments in automatic evaluation metrics for NLG systems, highlighting challenges, taxonomy, key metrics, shortcomings, and future directions to improve evaluation accuracy.

Contribution

It provides a comprehensive taxonomy and detailed analysis of NLG evaluation metrics developed since 2014, including heuristic and transformer-based approaches.

Findings

01

Current metrics often fail to capture NLG nuances

02

Shift from heuristic to trained transformer models

03

Recommendations for future metric development

Abstract

The success of Deep Learning has created a surge in interest in a wide a range of Natural Language Generation (NLG) tasks. Deep Learning has not only pushed the state of the art in several existing NLG tasks but has also facilitated researchers to explore various newer NLG tasks such as image captioning. Such rapid progress in NLG has necessitated the development of accurate automatic evaluation metrics that would allow us to track the progress in the field of NLG. However, unlike classification tasks, automatically evaluating NLG systems in itself is a huge challenge. Several works have shown that early heuristic-based metrics such as BLEU, ROUGE are inadequate for capturing the nuances in the different NLG tasks. The expanding number of NLG models and the shortcomings of the current metrics has led to a rapid surge in the number of evaluation metrics proposed since 2014. Moreover,…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.