Robust AI Evaluation through Maximal Lotteries

Hadi Khalaf; Serena L. Wang; Daniel Halpern; Itai Shapira; Flavio du Pin Calmon; Ariel D. Procaccia

arXiv:2602.21297·cs.LG·February 26, 2026

Robust AI Evaluation through Maximal Lotteries

Hadi Khalaf, Serena L. Wang, Daniel Halpern, Itai Shapira, Flavio du Pin Calmon, Ariel D. Procaccia

PDF

Open Access

TL;DR

This paper introduces robust maximal lotteries for evaluating language models, addressing the limitations of traditional ranking methods by accounting for preference heterogeneity and ensuring more reliable, fair comparisons across diverse user groups.

Contribution

It proposes robust lotteries that improve the stability and fairness of model evaluation by optimizing for worst-case preferences, advancing social choice methods in AI evaluation.

Findings

01

Robust lotteries outperform traditional rankings in stability across diverse preferences.

02

They provide more reliable win rate guarantees for models.

03

The approach supports an ecosystem of AI systems serving varied human preferences.

Abstract

The standard way to evaluate language models on subjective tasks is through pairwise comparisons: an annotator chooses the "better" of two responses to a prompt. Leaderboards aggregate these comparisons into a single Bradley-Terry (BT) ranking, forcing heterogeneous preferences into a total order and violating basic social-choice desiderata. In contrast, social choice theory provides an alternative approach called maximal lotteries, which aggregates pairwise preferences without imposing any assumptions on their structure. However, we show that maximal lotteries are highly sensitive to preference heterogeneity and can favor models that severely underperform on specific tasks or user subpopulations. We introduce robust lotteries that optimize worst-case performance under plausible shifts in the preference data. On large-scale preference datasets, robust lotteries provide more reliable win…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMobile Crowdsensing and Crowdsourcing · Ethics and Social Impacts of AI · Game Theory and Voting Systems