Evaluating Evaluation Metrics: A Framework for Analyzing NLG Evaluation Metrics using Measurement Theory
Ziang Xiao, Susu Zhang, Vivian Lai, Q. Vera Liao

TL;DR
This paper introduces MetricEval, a measurement theory-based framework for analyzing and improving the reliability and validity of evaluation metrics in Natural Language Generation, addressing current limitations and uncertainties.
Contribution
The paper presents a novel framework based on measurement theory to evaluate and quantify the reliability and validity of NLG evaluation metrics.
Findings
Identified issues in existing summarization metrics related to validity and reliability.
Demonstrated how MetricEval can quantify uncertainty in evaluation metrics.
Analyzed human and LLM-based metrics to reveal measurement errors.
Abstract
We address a fundamental challenge in Natural Language Generation (NLG) model evaluation -- the design and evaluation of evaluation metrics. Recognizing the limitations of existing automatic metrics and noises from how current human evaluation was conducted, we propose MetricEval, a framework informed by measurement theory, the foundation of educational test design, for conceptualizing and evaluating the reliability and validity of NLG evaluation metrics. The framework formalizes the source of measurement error and offers statistical tools for evaluating evaluation metrics based on empirical data. With our framework, one can quantify the uncertainty of the metrics to better interpret the result. To exemplify the use of our framework in practice, we analyzed a set of evaluation metrics for summarization and identified issues related to conflated validity structure in human-eval and…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsNatural Language Processing Techniques · Topic Modeling · Text Readability and Simplification
MethodsTest
