A Step Towards Mixture of Grader: Statistical Analysis of Existing   Automatic Evaluation Metrics

Yun Joon Soh; Jishen Zhao

arXiv:2410.10030·cs.CL·October 15, 2024

A Step Towards Mixture of Grader: Statistical Analysis of Existing Automatic Evaluation Metrics

Yun Joon Soh, Jishen Zhao

PDF

Open Access

TL;DR

This paper analyzes existing automatic QA evaluation metrics, revealing their limitations and proposing a Mixture Of Grader approach to better approximate human judgment.

Contribution

It provides a statistical analysis of current metrics and introduces the concept of a Mixture Of Grader to enhance evaluation accuracy.

Findings

01

Existing metrics correlate well within question types

02

No single metric reliably estimates human judgment

03

Mixture Of Grader could improve evaluation quality

Abstract

The explosion of open-sourced models and Question-Answering (QA) datasets emphasizes the importance of automated QA evaluation. We studied the statistics of the existing evaluation metrics for a better understanding of their limitations. By measuring the correlation coefficients of each evaluation metric concerning human-like evaluation score, we observed the following: (1) existing metrics have a high correlation among them concerning the question type (e.g., single word, single phrase, etc.), (2) no single metric can adequately estimate the human-like evaluation. As a potential solution, we discuss how a Mixture Of Grader could potentially improve the auto QA evaluator quality.

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsAdvanced Statistical Methods and Models