A Step Towards Mixture of Grader: Statistical Analysis of Existing Automatic Evaluation Metrics
Yun Joon Soh, Jishen Zhao

TL;DR
This paper analyzes existing automatic QA evaluation metrics, revealing their limitations and proposing a Mixture Of Grader approach to better approximate human judgment.
Contribution
It provides a statistical analysis of current metrics and introduces the concept of a Mixture Of Grader to enhance evaluation accuracy.
Findings
Existing metrics correlate well within question types
No single metric reliably estimates human judgment
Mixture Of Grader could improve evaluation quality
Abstract
The explosion of open-sourced models and Question-Answering (QA) datasets emphasizes the importance of automated QA evaluation. We studied the statistics of the existing evaluation metrics for a better understanding of their limitations. By measuring the correlation coefficients of each evaluation metric concerning human-like evaluation score, we observed the following: (1) existing metrics have a high correlation among them concerning the question type (e.g., single word, single phrase, etc.), (2) no single metric can adequately estimate the human-like evaluation. As a potential solution, we discuss how a Mixture Of Grader could potentially improve the auto QA evaluator quality.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAdvanced Statistical Methods and Models
