Beyond correlation: The Impact of Human Uncertainty in Measuring the Effectiveness of Automatic Evaluation and LLM-as-a-Judge
Aparna Elangovan, Lei Xu, Jongwoo Ko, Mahsa Elyasi, Ling Liu, Sravan, Bodapati, Dan Roth

TL;DR
This paper critically examines how reliance on correlation metrics can misrepresent the effectiveness of automatic evaluation methods for generative models, especially when human label uncertainty is high, and proposes new metrics and stratification techniques for better assessment.
Contribution
It introduces stratification by human label uncertainty and a new binned Jensen-Shannon Divergence metric to improve evaluation robustness and interpretability.
Findings
Correlation scores can be misleading when human label uncertainty is high.
Automatic evaluation may appear better than human-to-human correlation under certain conditions.
Stratifying data by uncertainty reveals true evaluation performance.
Abstract
The effectiveness of automatic evaluation of generative models is typically measured by comparing the labels generated via automation with labels by humans using correlation metrics. However, metrics like Krippendorff's and Randolph's were originally designed to measure the reliability of human labeling, thus make assumptions about typical human labeling behavior, and these assumptions may not be applicable to machine generated labels. In this paper, we show how *relying on a single aggregate correlation score* can obscure fundamental differences between human labels and those from automatic evaluation, including LLM-as-a-Judge. Specifically, we demonstrate that when the proportion of samples with variation or uncertainty in human assigned labels is relatively high, machine labels (generated by automatic evaluation methods) may superficially appear to have similar or…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsEvaluation and Performance Assessment
