Evaluating Evaluation Metrics: A Framework for Analyzing NLG Evaluation   Metrics using Measurement Theory

Ziang Xiao; Susu Zhang; Vivian Lai; Q. Vera Liao

arXiv:2305.14889·cs.CL·October 24, 2023·1 cites

Evaluating Evaluation Metrics: A Framework for Analyzing NLG Evaluation Metrics using Measurement Theory

Ziang Xiao, Susu Zhang, Vivian Lai, Q. Vera Liao

PDF

Open Access 1 Repo

TL;DR

This paper introduces MetricEval, a measurement theory-based framework for analyzing and improving the reliability and validity of evaluation metrics in Natural Language Generation, addressing current limitations and uncertainties.

Contribution

The paper presents a novel framework based on measurement theory to evaluate and quantify the reliability and validity of NLG evaluation metrics.

Findings

01

Identified issues in existing summarization metrics related to validity and reliability.

02

Demonstrated how MetricEval can quantify uncertainty in evaluation metrics.

03

Analyzed human and LLM-based metrics to reveal measurement errors.

Abstract

We address a fundamental challenge in Natural Language Generation (NLG) model evaluation -- the design and evaluation of evaluation metrics. Recognizing the limitations of existing automatic metrics and noises from how current human evaluation was conducted, we propose MetricEval, a framework informed by measurement theory, the foundation of educational test design, for conceptualizing and evaluating the reliability and validity of NLG evaluation metrics. The framework formalizes the source of measurement error and offers statistical tools for evaluating evaluation metrics based on empirical data. With our framework, one can quantify the uncertainty of the metrics to better interpret the result. To exemplify the use of our framework in practice, we analyzed a set of evaluation metrics for summarization and identified issues related to conflated validity structure in human-eval and…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

isle-dev/metriceval
noneOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsNatural Language Processing Techniques · Topic Modeling · Text Readability and Simplification

MethodsTest