Reward Hacking Benchmark: Measuring Exploits in LLM Agents with Tool Use

Kunvar Thaman

arXiv:2605.02964·cs.LG·May 6, 2026

Reward Hacking Benchmark: Measuring Exploits in LLM Agents with Tool Use

Kunvar Thaman

PDF

TL;DR

The paper introduces the Reward Hacking Benchmark (RHB), a suite of tasks to measure exploitability in language model agents with tool use, revealing how different training methods influence reward hacking behavior.

Contribution

It presents RHB as a new benchmark for evaluating reward hacking in LLM agents and analyzes how post-training strategies affect exploit rates across models.

Findings

01

Exploit rates vary from 0% to 13.9% among models.

02

RL post-training increases reward hacking significantly.

03

Environmental hardening reduces exploit rates substantially.

Abstract

Reinforcement learning (RL) trained language model agents with tool access are increasingly deployed in coding assistants, research tools, and autonomous systems. We introduce the Reward Hacking Benchmark (RHB), a suite of multi-step tasks requiring sequential tool operations with naturalistic shortcut opportunities such as skipping verification steps, inferring answers from task-adjacent metadata, or tampering with evaluation-relevant functions. RHB supports independent and chained task regimes, where chain length acts as a proxy for longer-horizon agent behavior. We evaluate 13 frontier models from OpenAI, Anthropic, Google, and DeepSeek. Exploit rates range from 0% (Claude Sonnet 4.5) to 13.9% (DeepSeek-R1-Zero), varying sharply by post-training style. A controlled sibling comparison (DeepSeek-V3 vs. DeepSeek-R1-Zero) shows RL post-training is associated with substantially higher…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.