Forest vs Tree: The $(N, K)$ Trade-off in Reproducible ML Evaluation

Deepak Pandita; Flip Korn; Chris Welty; Christopher M. Homan

arXiv:2508.03663·cs.LG·December 12, 2025

Forest vs Tree: The $(N, K)$ Trade-off in Reproducible ML Evaluation

Deepak Pandita, Flip Korn, Chris Welty, Christopher M. Homan

PDF

1 Video

TL;DR

This paper investigates the optimal balance between the number of items and responses per item in ML evaluation, emphasizing the importance of accounting for human disagreement within budget constraints to improve reliability.

Contribution

It introduces a method to determine the optimal $(N, K)$ configuration for evaluation data collection, considering human disagreement and different metrics, to maximize reliability within a fixed budget.

Findings

01

Optimal $(N, K)$ often occurs with $K > 10$.

02

Budget of $N imes K$ typically should not exceed 1000.

03

Metrics sensitive to response distribution perform better at higher $K$.

Abstract

Reproducibility is a cornerstone of scientific validation and of the authority it confers on its results. Reproducibility in machine learning evaluations leads to greater trust, confidence, and value. However, the ground truth responses used in machine learning often necessarily come from humans, among whom disagreement is prevalent, and surprisingly little research has studied the impact of effectively ignoring disagreement in these responses, as is typically the case. One reason for the lack of research is that budgets for collecting human-annotated evaluation data are limited, and obtaining more samples from multiple raters for each example greatly increases the per-item annotation costs. We investigate the trade-off between the number of items ( $N$ ) and the number of responses per item ( $K$ ) needed for reliable machine learning evaluation. We analyze a diverse collection of…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

Forest vs Tree: The (N, K) Trade-off in Reproducible ML Evaluation· underline