Cheating Automatic LLM Benchmarks: Null Models Achieve High Win Rates

Xiaosen Zheng; Tianyu Pang; Chao Du; Qian Liu; Jing Jiang; Min Lin

arXiv:2410.07137·cs.CL·March 4, 2025

Cheating Automatic LLM Benchmarks: Null Models Achieve High Win Rates

Xiaosen Zheng, Tianyu Pang, Chao Du, Qian Liu, Jing Jiang, Min Lin

PDF

Open Access 1 Repo 3 Reviews

TL;DR

This paper demonstrates that even null models can cheat automatic LLM benchmarks, achieving high win rates and exposing vulnerabilities that threaten the reliability of current evaluation methods.

Contribution

It reveals that simple null models can outperform genuine models on benchmarks, highlighting the need for anti-cheating mechanisms in automatic LLM evaluations.

Findings

01

Null models achieve high win rates on benchmarks

02

Cheating responses are transferable and hard to detect

03

Current benchmarks are vulnerable to manipulation

Abstract

Automatic LLM benchmarks, such as AlpacaEval 2.0, Arena-Hard-Auto, and MT-Bench, have become popular for evaluating language models due to their cost-effectiveness and scalability compared to human evaluation. Achieving high win rates on these benchmarks can significantly boost the promotional impact of newly released language models. This promotional benefit may motivate tricks, such as manipulating model output length or style to game win rates, even though several mechanisms have been developed to control length and disentangle style to reduce gameability. Nonetheless, we show that even a "null model" that always outputs a constant response (irrelevant to input instructions) can cheat automatic benchmarks and achieve top-ranked win rates: an 86.5% LC win rate on AlpacaEval 2.0; an 83.0 score on Arena-Hard-Auto; and a 9.55 score on MT-Bench. Moreover, the crafted cheating outputs are…

Peer Reviews

Decision·ICLR 2025 Oral

Reviewer 01Rating 8Confidence 4

Strengths

Pros: - I think it’s a great paper that brings up the important point about the possibility of overfitting to LLM judges. - It’s also very interesting to see that one can fool LLMs with the same (completely non-sensical) response across different requests. This clearly highlights how brittle LLM as judges are. - On the other hand, as, e.g., Figure 2 illustrates, coming up with an effective null model is not straightforward. In particular, a simple adversarial suffix isn’t effective. - The evalua

Weaknesses

(Minor) weaknesses: - It’s probably not too surprising that LLMs are vulnerable to these attacks given all the literature on adversarial examples, prompt injections, and jailbreaks for LLMs. But on the other hand, I believe there should be a clear and systematic reference that illustrates this fact for LLMs as judges that play a key role now for LLM evaluation, data filtering, etc. And this paper does a good job at this. - The name “null model” is slightly confusing since it’s not really a model

Reviewer 02Rating 10Confidence 4

Strengths

1. The paper highlights the vulnerability of using LLM judges, motivating why they are becoming more common, existing issues and how the issue they demonstrate is much more glaring. 2. The attack succeeds (much better than SOTA model performance) under a strong threat model of giving a constant output to the LLM judge. 3. The attack transfers across 3 popular benchmarks (alpaca-eval, arena-hard-auto, mt-bench) 4. The attack is extensively ablated, showing the importance of using both the structu

Weaknesses

1. It's unclear why the structured response doesn't work for llama. Could you back the claim of lower instruction following capabilities being the reason with a) Instruction Following eval scores for GPT-4, Llama-3-70B Instruct, b) showing a plot of structured response success vs instruction following results for different auto-annotator models? 2. Currently there are no comparisons to attacks on LLM-as-a-judge performed by prior work. It would be informative to see how existing attacks on LLM-

Reviewer 03Rating 10Confidence 4

Strengths

- The idea that Automatic LLM benchmarks can be cheated is clearly demonstrated. - Using the proposed method, SOTA scores are achieved for all the benchmarks tested, highlighting the core goal of the paper. - The jailbreak method exhibits some robustness because of the optimized prefix strategy, making it transferable as pointed out in the text. - The authors discuss potential anti-cheating strategies.

Weaknesses

- The paper could benefit from additional comparator methods.

Code & Models

Repositories

sail-sg/Cheating-LLM-Benchmarks
noneOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsDigital Rights Management and Security · Law, Economics, and Judicial Systems · Artificial Intelligence in Law