TL;DR
This study systematically compares synthetic and in-the-wild reward hacking behaviors in code generation, revealing that synthetic data may not accurately reflect natural hacking and emphasizing the need for real-world data.
Contribution
It introduces a method to collect in-the-wild hacking trajectories and demonstrates that models trained on synthetic data do not generalize well to real-world hacking behaviors.
Findings
Synthetic-trained monitors fail to detect in-the-wild hacking behaviors.
Monitors trained on in-the-wild data generalize better to unseen hacking types.
Abstract
Reward hacking in code generation, where models exploit evaluation loopholes to obtain full reward without correctly solving the tasks, poses a critical challenge for Reinforcement Learning (RL) and the deployment of reasoning models. Existing studies have been conducted primarily on synthetic hacking trajectories. However, whether these synthetic behaviors faithfully represent naturally emerging hacking in the wild remains unclear. In this work, we present a systematic analysis of the synthetic vs. in-the-wild discrepancy in reward hacking. We examine to what extent hacking behaviors induced by prompting resemble those emerging during RL training, and whether monitors trained on synthetic trajectories generalize to naturally arising but previously unseen hacking. To scale up the curation of in-the-wild reward hacking trajectories, we modified Group Relative Policy Optimization (GRPO)…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
