Exploring and Mitigating Adversarial Manipulation of Voting-Based Leaderboards
Yangsibo Huang, Milad Nasr, Anastasios Angelopoulos, Nicholas Carlini,, Wei-Lin Chiang, Christopher A. Choquette-Choo, Daphne Ippolito, Matthew, Jagielski, Katherine Lee, Ken Ziyu Liu, Ion Stoica, Florian Tramer, Chiyuan, Zhang

TL;DR
This paper reveals vulnerabilities in voting-based LLM benchmarks like Chatbot Arena to adversarial manipulation and proposes defenses to improve robustness against such attacks.
Contribution
It demonstrates how attackers can manipulate leaderboards with minimal votes and introduces mitigations to enhance security and integrity of voting-based benchmarks.
Findings
Attack can alter leaderboard with roughly 1,000 votes
Model identification accuracy exceeds 95%
Implemented defenses increase attack costs significantly
Abstract
It is now common to evaluate Large Language Models (LLMs) by having humans manually vote to evaluate model outputs, in contrast to typical benchmarks that evaluate knowledge or skill at some particular task. Chatbot Arena, the most popular benchmark of this type, ranks models by asking users to select the better response between two randomly selected models (without revealing which model was responsible for the generations). These platforms are widely trusted as a fair and accurate measure of LLM capabilities. In this paper, we show that if bot protection and other defenses are not implemented, these voting-based benchmarks are potentially vulnerable to adversarial manipulation. Specifically, we show that an attacker can alter the leaderboard (to promote their favorite model or demote competitors) at the cost of roughly a thousand votes (verified in a simulated, offline version of…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
