How Many Ratings per Item are Necessary for Reliable Significance Testing?
Christopher Homan, Flip Korn, Deepak Pandita, and Chris Welty

TL;DR
This paper investigates the number of responses needed per item to ensure reliable significance testing in AI model evaluation, revealing that many existing datasets lack sufficient responses for trustworthy statistical conclusions.
Contribution
It adapts a reliability assessment method to determine the minimum responses per item required for valid significance testing in AI evaluation datasets.
Findings
5-10 responses per item are often insufficient for reliable testing
Existing gold standard datasets frequently lack enough responses per item
The proposed method guides better data collection for AI evaluation
Abstract
A cornerstone of machine learning evaluation is the (often hidden) assumption that model and human responses are reliable enough to evaluate models against unitary, authoritative, ``gold standard'' data, via simple metrics such as accuracy, precision, and recall. The generative AI revolution would seem to explode this assumption, given the critical role stochastic inference plays. Yet, in spite of public demand for more transparency in AI -- along with strong evidence that humans are unreliable judges -- estimates of model reliability are conventionally based on, at most, a few output responses per input item. We adapt a method, previously used to evaluate the reliability of various metrics and estimators for machine learning evaluation, to determine whether an (existing or planned) dataset has enough responses per item to assure reliable null hypothesis statistical testing. We show…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
Taxonomy
TopicsRisk and Safety Analysis
