A Survey of Evaluation Metrics Used for NLG Systems
Ananya B. Sai, Akash Kumar Mohankumar, Mitesh M. Khapra

TL;DR
This survey reviews recent developments in automatic evaluation metrics for NLG systems, highlighting challenges, taxonomy, key metrics, shortcomings, and future directions to improve evaluation accuracy.
Contribution
It provides a comprehensive taxonomy and detailed analysis of NLG evaluation metrics developed since 2014, including heuristic and transformer-based approaches.
Findings
Current metrics often fail to capture NLG nuances
Shift from heuristic to trained transformer models
Recommendations for future metric development
Abstract
The success of Deep Learning has created a surge in interest in a wide a range of Natural Language Generation (NLG) tasks. Deep Learning has not only pushed the state of the art in several existing NLG tasks but has also facilitated researchers to explore various newer NLG tasks such as image captioning. Such rapid progress in NLG has necessitated the development of accurate automatic evaluation metrics that would allow us to track the progress in the field of NLG. However, unlike classification tasks, automatically evaluating NLG systems in itself is a huge challenge. Several works have shown that early heuristic-based metrics such as BLEU, ROUGE are inadequate for capturing the nuances in the different NLG tasks. The expanding number of NLG models and the shortcomings of the current metrics has led to a rapid surge in the number of evaluation metrics proposed since 2014. Moreover,…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
