How Many Human Judgments Are Enough? Feasibility Limits of Human Preference Evaluation
Wilson Y. Lee

TL;DR
This paper investigates the feasibility limits of human preference evaluations in model comparison, revealing that most comparisons require more judgments than usual due to small preference margins, and that reducing variability improves detectability.
Contribution
It demonstrates that proportional allocation is optimal in diffuse preference regimes and highlights the importance of controlling prompt variability for reliable human evaluation.
Findings
Most comparisons have small preference margins requiring more judgments.
Reducing prompt variability increases detectability of small effects.
Inconclusive results often stem from underpowered evaluations rather than true model similarity.
Abstract
Human preference evaluations are widely used to compare generative models, yet it remains unclear how many judgments are required to reliably detect small improvements. We show that when preference signal is diffuse across prompts (i.e., all prompt types are similarly informative), proportional allocation is minimax-optimal: no allocation strategy substantially improves detectability. Empirical analysis of large-scale human preference datasets shows that most comparisons fall into this diffuse regime, exhibiting small preference margins that require far more judgments than typically collected, even in well-sampled comparisons. These limits persist across evaluation protocols and modalities, including chat, image generation, and code generation with execution feedback. In contrast, curated benchmarks that reduce prompt induced variability systematically induce larger margins and improve…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsInnovative Human-Technology Interaction · Mobile Crowdsensing and Crowdsourcing · Ethics and Social Impacts of AI
