Handling Missing Responses under Cluster Dependence with Applications to Language Model Evaluation

Zhenghao Zeng; David Arbour; Avi Feller; Ishita Dasgupta; Atanu R Sinha; Edward H. Kennedy

arXiv:2510.20928·stat.ME·October 27, 2025

Handling Missing Responses under Cluster Dependence with Applications to Language Model Evaluation

Zhenghao Zeng, David Arbour, Avi Feller, Ishita Dasgupta, Atanu R Sinha, Edward H. Kennedy

PDF

TL;DR

This paper develops and analyzes a statistical method for accurately estimating average human annotation scores in language model evaluation, accounting for missing responses and correlations among responses from the same user.

Contribution

It introduces novel theoretical insights into the doubly robust estimator's properties under cluster dependence and demonstrates its effectiveness through simulations and real data.

Findings

01

Incorporating cluster dependence improves inference accuracy.

02

The doubly robust estimator remains reliable under complex dependence structures.

03

Empirical results validate the theoretical advantages of the proposed approach.

Abstract

Human annotations play a crucial role in evaluating the performance of GenAI models. Two common challenges in practice, however, are missing annotations (the response variable of interest) and cluster dependence among human-AI interactions (e.g., questions asked by the same user may be highly correlated). Reliable inference must address both these issues to achieve unbiased estimation and appropriately quantify uncertainty when estimating average scores from human annotations. In this paper, we analyze the doubly robust estimator, a widely used method in missing data analysis and causal inference, applied to this setting and establish novel theoretical properties under cluster dependence. We further illustrate our findings through simulations and a real-world conversation quality dataset. Our theoretical and empirical results underscore the importance of incorporating cluster dependence…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.