Do Synthetic Trajectories Reflect Real Reward Hacking? A Systematic Study on Monitoring In-the-Wild Hacking in Code Generation

Lichen Li; Hengguang Zhou; Yijun Liang; Tianyi Zhou; Cho-Jui Hsieh

arXiv:2604.23488·cs.LG·April 28, 2026

Do Synthetic Trajectories Reflect Real Reward Hacking? A Systematic Study on Monitoring In-the-Wild Hacking in Code Generation

Lichen Li, Hengguang Zhou, Yijun Liang, Tianyi Zhou, Cho-Jui Hsieh

PDF

1 Repo

TL;DR

This study systematically compares synthetic and in-the-wild reward hacking behaviors in code generation, revealing that synthetic data may not accurately reflect natural hacking and emphasizing the need for real-world data.

Contribution

It introduces a method to collect in-the-wild hacking trajectories and demonstrates that models trained on synthetic data do not generalize well to real-world hacking behaviors.

Findings

01

Synthetic-trained monitors fail to detect in-the-wild hacking behaviors.

02

Monitors trained on in-the-wild data generalize better to unseen hacking types.

Abstract

Reward hacking in code generation, where models exploit evaluation loopholes to obtain full reward without correctly solving the tasks, poses a critical challenge for Reinforcement Learning (RL) and the deployment of reasoning models. Existing studies have been conducted primarily on synthetic hacking trajectories. However, whether these synthetic behaviors faithfully represent naturally emerging hacking in the wild remains unclear. In this work, we present a systematic analysis of the synthetic vs. in-the-wild discrepancy in reward hacking. We examine to what extent hacking behaviors induced by prompting resemble those emerging during RL training, and whether monitors trained on synthetic trajectories generalize to naturally arising but previously unseen hacking. To scale up the curation of in-the-wild reward hacking trajectories, we modified Group Relative Policy Optimization (GRPO)…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

LichenLillc/CoTMonitoring.git
github

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.