Principled Evaluation with Human Labels: One Rater at a Time and Rater Equivalence
Paul Resnick, Yuqing Kong, Grant Schoenebeck, Tim Weninger

TL;DR
This paper proposes a principled framework for evaluating classifiers with human labels, emphasizing single-rater scoring and the concept of rater equivalence for benchmarking.
Contribution
It introduces a utility-based scoring method using one rater at a time and develops an optimal algorithm for combining human judgments to match classifier performance.
Findings
Scoring against individual raters and averaging is more justified than majority vote.
The concept of rater equivalence quantifies the number of raters needed for benchmark comparison.
An optimal algorithm for combining human labels is provided and validated through case studies.
Abstract
In many classification tasks, there is no definitive ground truth, only human judgments that may disagree. We address two challenges that arise in such settings: (1) how to use human raters to score classifiers, and (2) how to use them for comparison benchmarks. For the first, the common practice is to score classifiers against the majority vote of an evaluation panel of several human raters. We argue that this is not justified when either of two properties fails: objectivity or equanimity. Instead, under a utility model appropriate for such settings, scoring against one rater at a time and averaging the scores across raters is a more principled approach. For the second, we introduce the concept of rater equivalence: the smallest number of human raters whose combined judgment matches the classifier's performance. We provide a provably optimal algorithm for combining benchmark panel…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
