Do Androids Dream of Breaking the Game? Systematically Auditing AI Agent Benchmarks with BenchJack
Hao Wang, Hanchen Li, Qiuyang Mang, Alvin Cheung, Koushik Sen, Dawn Song

TL;DR
This paper introduces BenchJack, an automated system for auditing AI benchmarks to identify reward hacking vulnerabilities, improving benchmark robustness through iterative flaw discovery and patching.
Contribution
We develop BenchJack, an automated red-teaming tool that detects and patches reward-hacking flaws in AI benchmarks, enhancing their security and reliability.
Findings
BenchJack identified 219 distinct reward-hacking flaws across 10 benchmarks.
Iterative patching reduced hackable-task ratio from nearly 100% to under 10%.
WebArena and OSWorld benchmarks were fully patched within three iterations.
Abstract
Agent benchmarks have become the de facto measure of frontier AI competence, guiding model selection, investment, and deployment. However, reward hacking, where agents maximize a score without performing the intended task, emerges spontaneously in frontier models without overfitting. We argue that benchmarks must be secure by design. From past incidents of reward hacks, we derive a taxonomy of eight recurring flaw patterns and compile them into the Agent-Eval Checklist for benchmark designers. We condense the insights into BenchJack, an automated red-teaming system that drives coding agents to audit benchmarks and identify possible reward-hacking exploits in a clairvoyant manner. Moreover, we extend BenchJack to an iterative generative-adversarial pipeline that discovers new flaws and patches them iteratively to improve benchmark robustness. We apply BenchJack to 10 popular agent…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
