Why Global LLM Leaderboards Are Misleading: Small Portfolios for Heterogeneous Supervised ML
Jai Moondra, Ayela Chughtai, Bhargavi Lanka, Swati Gupta

TL;DR
This paper critiques the use of global leaderboards for LLMs, showing they are misleading due to heterogeneity in opinions across languages and tasks, and proposes small, targeted model portfolios as a better alternative.
Contribution
It introduces the $(eta, u)$-portfolio framework to address heterogeneity, providing algorithms that produce small, effective model sets covering most user preferences.
Findings
Global BT rankings are statistically indistinguishable within top 50 models.
Grouping by language significantly improves ranking agreement.
Small portfolios cover over 96% of votes, outperforming global rankings.
Abstract
Ranking LLMs via pairwise human feedback underpins current leaderboards for open-ended tasks, such as creative writing and problem-solving. We analyze ~89K comparisons in 116 languages from 52 LLMs from Arena, and show that the best-fit global Bradley-Terry (BT) ranking is misleading. Nearly 2/3 of the decisive votes cancel out, and even the top 50 models according to the global BT ranking are statistically indistinguishable (pairwise win probabilities are at most 0.53 within the top 50 models). We trace this failure to strong, structured heterogeneity of opinions across language, task, and time. Moreover, we find an important characteristic - *language* plays a key role. Grouping by language (and families) increases the agreement of votes massively, resulting in two orders of magnitude higher spread in the ELO scores (i.e., very consistent rankings). What appears as global noise is in…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
