How Many Ratings per Item are Necessary for Reliable Significance Testing?

Christopher Homan; Flip Korn; Deepak Pandita; and Chris Welty

arXiv:2412.02968·cs.LG·January 30, 2026

How Many Ratings per Item are Necessary for Reliable Significance Testing?

Christopher Homan, Flip Korn, Deepak Pandita, and Chris Welty

PDF

Open Access 1 Repo 1 Video

TL;DR

This paper investigates the number of responses needed per item to ensure reliable significance testing in AI model evaluation, revealing that many existing datasets lack sufficient responses for trustworthy statistical conclusions.

Contribution

It adapts a reliability assessment method to determine the minimum responses per item required for valid significance testing in AI evaluation datasets.

Findings

01

5-10 responses per item are often insufficient for reliable testing

02

Existing gold standard datasets frequently lack enough responses per item

03

The proposed method guides better data collection for AI evaluation

Abstract

A cornerstone of machine learning evaluation is the (often hidden) assumption that model and human responses are reliable enough to evaluate models against unitary, authoritative, ``gold standard'' data, via simple metrics such as accuracy, precision, and recall. The generative AI revolution would seem to explode this assumption, given the critical role stochastic inference plays. Yet, in spite of public demand for more transparency in AI -- along with strong evidence that humans are unreliable judges -- estimates of model reliability are conventionally based on, at most, a few output responses per input item. We adapt a method, previously used to evaluate the reliability of various metrics and estimators for machine learning evaluation, to determine whether an (existing or planned) dataset has enough responses per item to assure reliable null hypothesis statistical testing. We show…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

google-research/vet
noneOfficial

Videos

How Many Ratings per Item are Necessary for Reliable Significance Testing?· underline

Taxonomy

TopicsRisk and Safety Analysis