Towards a Better Metric for Evaluating Question Generation Systems
Preksha Nema, Mitesh M. Khapra

TL;DR
This paper critically examines the effectiveness of n-gram based metrics like BLEU for evaluating question generation systems, proposing an improved scoring function that better aligns with human judgments of answerability.
Contribution
It introduces a new answerability scoring function and demonstrates how integrating it with existing metrics enhances their correlation with human evaluations.
Findings
Current metrics poorly correlate with human judgments on answerability.
The proposed scoring function improves metric correlation with human assessments.
Integration of the new score with existing metrics enhances evaluation accuracy.
Abstract
There has always been criticism for using -gram based similarity metrics, such as BLEU, NIST, etc, for evaluating the performance of NLG systems. However, these metrics continue to remain popular and are recently being used for evaluating the performance of systems which automatically generate questions from documents, knowledge graphs, images, etc. Given the rising interest in such automatic question generation (AQG) systems, it is important to objectively examine whether these metrics are suitable for this task. In particular, it is important to verify whether such metrics used for evaluating AQG systems focus on answerability of the generated question by preferring questions which contain all relevant information such as question type (Wh-types), entities, relations, etc. In this work, we show that current automatic evaluation metrics based on -gram similarity do not always…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsTopic Modeling · Natural Language Processing Techniques · Multimodal Machine Learning Applications
