TL;DR
This paper investigates the optimal balance between the number of items and responses per item in ML evaluation, emphasizing the importance of accounting for human disagreement within budget constraints to improve reliability.
Contribution
It introduces a method to determine the optimal $(N, K)$ configuration for evaluation data collection, considering human disagreement and different metrics, to maximize reliability within a fixed budget.
Findings
Optimal $(N, K)$ often occurs with $K > 10$.
Budget of $N imes K$ typically should not exceed 1000.
Metrics sensitive to response distribution perform better at higher $K$.
Abstract
Reproducibility is a cornerstone of scientific validation and of the authority it confers on its results. Reproducibility in machine learning evaluations leads to greater trust, confidence, and value. However, the ground truth responses used in machine learning often necessarily come from humans, among whom disagreement is prevalent, and surprisingly little research has studied the impact of effectively ignoring disagreement in these responses, as is typically the case. One reason for the lack of research is that budgets for collecting human-annotated evaluation data are limited, and obtaining more samples from multiple raters for each example greatly increases the per-item annotation costs. We investigate the trade-off between the number of items () and the number of responses per item () needed for reliable machine learning evaluation. We analyze a diverse collection of…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
