Exploring and Mitigating Adversarial Manipulation of Voting-Based   Leaderboards

Yangsibo Huang; Milad Nasr; Anastasios Angelopoulos; Nicholas Carlini,; Wei-Lin Chiang; Christopher A. Choquette-Choo; Daphne Ippolito; Matthew; Jagielski; Katherine Lee; Ken Ziyu Liu; Ion Stoica; Florian Tramer; Chiyuan; Zhang

arXiv:2501.07493·cs.LG·January 14, 2025

Exploring and Mitigating Adversarial Manipulation of Voting-Based Leaderboards

Yangsibo Huang, Milad Nasr, Anastasios Angelopoulos, Nicholas Carlini,, Wei-Lin Chiang, Christopher A. Choquette-Choo, Daphne Ippolito, Matthew, Jagielski, Katherine Lee, Ken Ziyu Liu, Ion Stoica, Florian Tramer, Chiyuan, Zhang

PDF

TL;DR

This paper reveals vulnerabilities in voting-based LLM benchmarks like Chatbot Arena to adversarial manipulation and proposes defenses to improve robustness against such attacks.

Contribution

It demonstrates how attackers can manipulate leaderboards with minimal votes and introduces mitigations to enhance security and integrity of voting-based benchmarks.

Findings

01

Attack can alter leaderboard with roughly 1,000 votes

02

Model identification accuracy exceeds 95%

03

Implemented defenses increase attack costs significantly

Abstract

It is now common to evaluate Large Language Models (LLMs) by having humans manually vote to evaluate model outputs, in contrast to typical benchmarks that evaluate knowledge or skill at some particular task. Chatbot Arena, the most popular benchmark of this type, ranks models by asking users to select the better response between two randomly selected models (without revealing which model was responsible for the generations). These platforms are widely trusted as a fair and accurate measure of LLM capabilities. In this paper, we show that if bot protection and other defenses are not implemented, these voting-based benchmarks are potentially vulnerable to adversarial manipulation. Specifically, we show that an attacker can alter the leaderboard (to promote their favorite model or demote competitors) at the cost of roughly a thousand votes (verified in a simulated, offline version of…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.