A Unified Perturbation Framework for Analyzing Leaderboard Stability and Manipulation
Hosna Oyarhoseini, Jimmy Lin, Amir-Hossein Karimi

TL;DR
This paper introduces a unified framework to analyze the robustness of leaderboards in benchmarking large language models, revealing their vulnerability to small targeted data perturbations.
Contribution
It presents a novel influence-based perturbation framework for assessing and manipulating leaderboard rankings, highlighting their non-robustness and proposing more reliable evaluation methods.
Findings
Modern leaderboards are highly sensitive to minimal targeted perturbations.
Small data modifications can significantly alter top-k rankings and model confidence.
The framework enables efficient targeted manipulations to promote or demote models.
Abstract
Evaluation leaderboards such as LMArena play a central role in benchmarking large language models by aggregating pairwise human preferences into model rankings, yet the robustness of these rankings remains poorly understood. We present a unified perturbation framework for analyzing Bradley-Terry leaderboards under structured data modifications using influence-based approximations. Our framework studies three match-level perturbations -- Drop, Add, and Flip -- together with player removal, and evaluates their effects on top-k membership, global ranking consistency via Kendall's tau, and confidence-interval-based uncertainty. Across Chatbot Arena and six additional pairwise-comparison datasets, we show that modern leaderboards are non-robust across all three objectives: sub-1% targeted perturbations can change the top-ranked model, degrade Kendall's tau, and alter confidence intervals.…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
