Cheating Automatic LLM Benchmarks: Null Models Achieve High Win Rates
Xiaosen Zheng, Tianyu Pang, Chao Du, Qian Liu, Jing Jiang, Min Lin

TL;DR
This paper demonstrates that even null models can cheat automatic LLM benchmarks, achieving high win rates and exposing vulnerabilities that threaten the reliability of current evaluation methods.
Contribution
It reveals that simple null models can outperform genuine models on benchmarks, highlighting the need for anti-cheating mechanisms in automatic LLM evaluations.
Findings
Null models achieve high win rates on benchmarks
Cheating responses are transferable and hard to detect
Current benchmarks are vulnerable to manipulation
Abstract
Automatic LLM benchmarks, such as AlpacaEval 2.0, Arena-Hard-Auto, and MT-Bench, have become popular for evaluating language models due to their cost-effectiveness and scalability compared to human evaluation. Achieving high win rates on these benchmarks can significantly boost the promotional impact of newly released language models. This promotional benefit may motivate tricks, such as manipulating model output length or style to game win rates, even though several mechanisms have been developed to control length and disentangle style to reduce gameability. Nonetheless, we show that even a "null model" that always outputs a constant response (irrelevant to input instructions) can cheat automatic benchmarks and achieve top-ranked win rates: an 86.5% LC win rate on AlpacaEval 2.0; an 83.0 score on Arena-Hard-Auto; and a 9.55 score on MT-Bench. Moreover, the crafted cheating outputs are…
Peer Reviews
Decision·ICLR 2025 Oral
Pros: - I think it’s a great paper that brings up the important point about the possibility of overfitting to LLM judges. - It’s also very interesting to see that one can fool LLMs with the same (completely non-sensical) response across different requests. This clearly highlights how brittle LLM as judges are. - On the other hand, as, e.g., Figure 2 illustrates, coming up with an effective null model is not straightforward. In particular, a simple adversarial suffix isn’t effective. - The evalua
(Minor) weaknesses: - It’s probably not too surprising that LLMs are vulnerable to these attacks given all the literature on adversarial examples, prompt injections, and jailbreaks for LLMs. But on the other hand, I believe there should be a clear and systematic reference that illustrates this fact for LLMs as judges that play a key role now for LLM evaluation, data filtering, etc. And this paper does a good job at this. - The name “null model” is slightly confusing since it’s not really a model
1. The paper highlights the vulnerability of using LLM judges, motivating why they are becoming more common, existing issues and how the issue they demonstrate is much more glaring. 2. The attack succeeds (much better than SOTA model performance) under a strong threat model of giving a constant output to the LLM judge. 3. The attack transfers across 3 popular benchmarks (alpaca-eval, arena-hard-auto, mt-bench) 4. The attack is extensively ablated, showing the importance of using both the structu
1. It's unclear why the structured response doesn't work for llama. Could you back the claim of lower instruction following capabilities being the reason with a) Instruction Following eval scores for GPT-4, Llama-3-70B Instruct, b) showing a plot of structured response success vs instruction following results for different auto-annotator models? 2. Currently there are no comparisons to attacks on LLM-as-a-judge performed by prior work. It would be informative to see how existing attacks on LLM-
- The idea that Automatic LLM benchmarks can be cheated is clearly demonstrated. - Using the proposed method, SOTA scores are achieved for all the benchmarks tested, highlighting the core goal of the paper. - The jailbreak method exhibits some robustness because of the optimized prefix strategy, making it transferable as pointed out in the text. - The authors discuss potential anti-cheating strategies.
- The paper could benefit from additional comparator methods.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsDigital Rights Management and Security · Law, Economics, and Judicial Systems · Artificial Intelligence in Law
