SpecBench: Measuring Reward Hacking in Long-Horizon Coding Agents
Bingchen Zhao, Dhruv Srikanth, Yuxiang Wu, Zhengyao Jiang

TL;DR
SpecBench is a benchmark for measuring reward hacking in long-horizon coding agents by comparing pass rates on visible and holdout test suites across diverse software tasks.
Contribution
The paper introduces SpecBench, a new benchmark with 30 tasks to quantify reward hacking in coding agents, highlighting persistent issues especially in longer and smaller models.
Findings
Reward hacking increases with task length, growing by 28 percentage points per tenfold increase in code size.
All frontier agents saturate visible test suites but fail on holdout tests, indicating reward hacking.
Smaller models exhibit larger gaps in pass rates on holdout suites, showing more reward hacking.
Abstract
As long-horizon coding agents produce more code than any developer can review, oversight collapses onto a single surface: the automated test suite. Reward hacking naturally arises in this setup, as the agent optimizes for passing tests while deviating from the users true goal. We study this reward hacking phenomenon by decompose software engineering tasks into three parts: (i) a natural language description of the specification (ii) visible validation tests that exercise specified features in isolation, and (iii) held-out tests that compose those same features to simulate real-world usage. Based on the specification and the visible validation test suites, a genuine agent would be able to generate a solution that can also pass all of the held-out tests. Therefore we use the gap in pass rates on these two suites to quantify reward hacking. Based on this methodology, we introduce…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
