Beyond correlation: The Impact of Human Uncertainty in Measuring the   Effectiveness of Automatic Evaluation and LLM-as-a-Judge

Aparna Elangovan; Lei Xu; Jongwoo Ko; Mahsa Elyasi; Ling Liu; Sravan; Bodapati; Dan Roth

arXiv:2410.03775·cs.HC·January 28, 2025

Beyond correlation: The Impact of Human Uncertainty in Measuring the Effectiveness of Automatic Evaluation and LLM-as-a-Judge

Aparna Elangovan, Lei Xu, Jongwoo Ko, Mahsa Elyasi, Ling Liu, Sravan, Bodapati, Dan Roth

PDF

Open Access 1 Repo

TL;DR

This paper critically examines how reliance on correlation metrics can misrepresent the effectiveness of automatic evaluation methods for generative models, especially when human label uncertainty is high, and proposes new metrics and stratification techniques for better assessment.

Contribution

It introduces stratification by human label uncertainty and a new binned Jensen-Shannon Divergence metric to improve evaluation robustness and interpretability.

Findings

01

Correlation scores can be misleading when human label uncertainty is high.

02

Automatic evaluation may appear better than human-to-human correlation under certain conditions.

03

Stratifying data by uncertainty reveals true evaluation performance.

Abstract

The effectiveness of automatic evaluation of generative models is typically measured by comparing the labels generated via automation with labels by humans using correlation metrics. However, metrics like Krippendorff's $α$ and Randolph's $κ$ were originally designed to measure the reliability of human labeling, thus make assumptions about typical human labeling behavior, and these assumptions may not be applicable to machine generated labels. In this paper, we show how *relying on a single aggregate correlation score* can obscure fundamental differences between human labels and those from automatic evaluation, including LLM-as-a-Judge. Specifically, we demonstrate that when the proportion of samples with variation or uncertainty in human assigned labels is relatively high, machine labels (generated by automatic evaluation methods) may superficially appear to have similar or…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

amazon-science/beyondcorrelation
noneOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsEvaluation and Performance Assessment