Dropping Just a Handful of Preferences Can Change Top Large Language Model Rankings
Jenny Y. Huang, Yunyi Shen, Dennis Wei, Tamara Broderick

TL;DR
This paper introduces a fast, practical method to evaluate how small fractions of preference data can significantly alter large language model rankings, revealing high sensitivity in popular benchmarks.
Contribution
The authors develop a computationally efficient robustness evaluation method for LLM rankings and demonstrate its effectiveness on real-world benchmarks, highlighting ranking sensitivities.
Findings
Top model rankings can change with less than 0.003% preference removal.
MT-bench-based rankings are more robust than Chatbot Arena.
Different evaluation methods show varying sensitivity levels.
Abstract
We propose a method for evaluating the robustness of widely used LLM ranking systems -- variants of a Bradley--Terry model -- to dropping a worst-case very small fraction of preference data. Our approach is computationally fast and easy to adopt. When we apply our method to matchups from popular LLM ranking platforms, including Chatbot Arena and derivatives, we find that the rankings of top-performing models can be remarkably sensitive to the removal of a small fraction of preferences; for instance, dropping just 0.003% of human preferences can change the top-ranked model on Chatbot Arena. Our robustness check identifies the specific preferences most responsible for such ranking flips, allowing for inspection of these influential preferences. We observe that the rankings derived from MT-bench preferences are notably more robust than those from Chatbot Arena, likely due to MT-bench's use…
Peer Reviews
Decision·ICLR 2026 Poster
The main strengths of the paper are as follows: 1. It studies a timely problem, as benchmarking platforms such as LM Arena become the industry standard for evaluating the performance of large language models. Hence, the results will be of interest to the ICLR community. 1. It develops a relatively simple method building upon prior work in statistics for identifying small sets of pairwise comparisons that significantly affect the resulting model rankings. 1. Its analysis reveals a surprising insi
I believe that the paper has some room for improvement in terms of its presentation of certain definitions and results and its experimental evaluation. Specifically: * The definitions in page 4 are somewhat confusing. It is unclear why there is a need to introduce Definition 2.2 regarding top-1 robustness in two-player arenas, as it is immediately followed by the more general definition of top-k robustness in arenas with more than two players. Definition 2.2 doesn't seem to be used anywhere. Mor
- The general contextualization and introduction of the problem in Sections 1 and 2 are very clear and accessible. The authors make the paper easy to follow and introduce the necessary tools when needed (i.e., the main idea behind BT). The data-dropping setup and notation are also clearly communicated. Similarly, the theoretical results in the main text (most notably Prop. 3.1), while simple, are sound and well explained. - The experiments conducted by the authors are comprehensive and effectiv
Perhaps the key point I would like to inquire about is how the authors’ approach relates to other uncertainty quantification methods in the BT model. In particular, there are classical references that quantify uncertainty in BT coefficients, both in Bayesian and frequentist settings, for instance: - Gao et al., "Uncertainty quantification in the Bradley-Terry-Luce model" - Hunter "MM algorithms for generalized Bradley–Terry models" - Leonard "An Alternative Bayesian Approach to the Bradley-Terr
**Originality**: The originality of this work lies in its novel research question and its surprising findings, rather than in the invention of a new statistical method. As far as I know, it is the first paper to systematically apply the concept of worst-case data-dropping robustness to the domain of major LLM leaderboards. **Quality**: The paper shows high quality through its careful and valid application of statistical methods to real-world data. They verify the ranking flip by refitting the B
Limited depth of explanation: The paper hypothesizes why MT-Bench is more robust (expert annotators, curated prompts) and why rankings are fragile (small BT-score margins), but these remain hypotheses. A more controlled experiment to disentangle these factors would be needed for a definitive causal claim. Lack of methodological novelty: The weakness is that the core algorithm (AMIP) is adapted from prior work in statistics. The contribution is in the application and the discovery, not the inven
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsNatural Language Processing Techniques · Topic Modeling
