STABLEVAL: Disagreement-Aware and Stable Evaluation of AI Systems
Akash Bonagiri, Gerard Janno Anderias, Saee Patil, Angelina Lai, Devang Borkar, Gezheng Kang, Ishant Gandhi, Setareh Rafatirad, Houman Homayoun

TL;DR
STABLEVAL is a new evaluation framework for AI systems that models annotator disagreement and item ambiguity to produce more stable and reliable system rankings.
Contribution
It introduces a disagreement-aware evaluation method that accounts for annotator reliability and ambiguity, improving stability over traditional majority vote approaches.
Findings
STABLEVAL yields more stable system rankings under annotator heterogeneity.
Majority vote scores increase error and instability with noisy or diverse annotations.
Modeling disagreement improves robustness and reproducibility of AI evaluations.
Abstract
Human evaluation remains the primary standard for assessing modern AI systems, yet annotator disagreement, bias, and variability make system rankings fragile under standard majority vote aggregation. Majority vote discards annotator reliability and item-level ambiguity, often yielding unstable comparisons across annotator subsets. We introduce STABLEVAL, a disagreement-aware evaluation framework that models latent item correctness and annotator-specific confusion patterns to produce posterior expected item credit and calibrated agent-level scores. Unlike label-denoising approaches such as Dawid-Skene, STABLEVAL is explicitly designed for stable and uncertainty-aware system evaluation rather than hard label recovery. We formalize ranking stability as a first-class evaluation objective and analyze how aggregation methods preserve or distort underlying annotator behavior. Across controlled…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
