QGEval: Benchmarking Multi-dimensional Evaluation for Question Generation
Weiping Fu, Bifan Wei, Jianxiang Hu, Zhongmin Cai, Jun Liu

TL;DR
QGEval introduces a comprehensive multi-dimensional benchmark for evaluating question generation quality, addressing the lack of unified human evaluation criteria and revealing gaps in current automatic metrics.
Contribution
It proposes a new evaluation framework with seven dimensions for assessing question quality and analyzes their correlations, improving the reliability of QG model assessments.
Findings
Most QG models perform poorly on answerability and answer consistency.
Existing automatic metrics do not align well with human judgments across the seven dimensions.
QGEval provides a standardized way to evaluate and improve QG models and metrics.
Abstract
Automatically generated questions often suffer from problems such as unclear expression or factual inaccuracies, requiring a reliable and comprehensive evaluation of their quality. Human evaluation is widely used in the field of question generation (QG) and serves as the gold standard for automatic metrics. However, there is a lack of unified human evaluation criteria, which hampers consistent and reliable evaluations of both QG models and automatic metrics. To address this, we propose QGEval, a multi-dimensional Evaluation benchmark for Question Generation, which evaluates both generated questions and existing automatic metrics across 7 dimensions: fluency, clarity, conciseness, relevance, consistency, answerability, and answer consistency. We demonstrate the appropriateness of these dimensions by examining their correlations and distinctions. Through consistent evaluations of QG…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
Taxonomy
TopicsEducational Technology and Assessment
MethodsALIGN
