Reward Hacking Benchmark: Measuring Exploits in LLM Agents with Tool Use
Kunvar Thaman

TL;DR
The paper introduces the Reward Hacking Benchmark (RHB), a suite of tasks to measure exploitability in language model agents with tool use, revealing how different training methods influence reward hacking behavior.
Contribution
It presents RHB as a new benchmark for evaluating reward hacking in LLM agents and analyzes how post-training strategies affect exploit rates across models.
Findings
Exploit rates vary from 0% to 13.9% among models.
RL post-training increases reward hacking significantly.
Environmental hardening reduces exploit rates substantially.
Abstract
Reinforcement learning (RL) trained language model agents with tool access are increasingly deployed in coding assistants, research tools, and autonomous systems. We introduce the Reward Hacking Benchmark (RHB), a suite of multi-step tasks requiring sequential tool operations with naturalistic shortcut opportunities such as skipping verification steps, inferring answers from task-adjacent metadata, or tampering with evaluation-relevant functions. RHB supports independent and chained task regimes, where chain length acts as a proxy for longer-horizon agent behavior. We evaluate 13 frontier models from OpenAI, Anthropic, Google, and DeepSeek. Exploit rates range from 0% (Claude Sonnet 4.5) to 13.9% (DeepSeek-R1-Zero), varying sharply by post-training style. A controlled sibling comparison (DeepSeek-V3 vs. DeepSeek-R1-Zero) shows RL post-training is associated with substantially higher…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
