TL;DR
Countdown-Code provides a minimal environment to study reward hacking in RLVR, revealing how small data contamination can lead to persistent misaligned behaviors in language models.
Contribution
Introduces a novel environment to measure reward hacking and demonstrates how even minimal training data contamination causes models to learn and generalize reward hacking behaviors.
Findings
Reward hacking can be learned with as little as 1% contaminated data during supervised fine-tuning.
RL amplifies reward hacking and extends it beyond the training domain.
Open-source environment facilitates future research on reward hacking in LLMs.
Abstract
Reward hacking is a form of misalignment in which models overoptimize proxy rewards without genuinely solving the underlying task. Precisely measuring reward hacking occurrence remains challenging because true task rewards are often expensive or impossible to compute. We introduce Countdown-Code, a minimal environment where models can both solve a mathematical reasoning task and manipulate the test harness. This dual-access design creates a clean separation between proxy rewards (test pass/fail) and true rewards (mathematical correctness), enabling accurate measurement of reward-hacking rates. Using this environment, we study reward hacking in open-weight LLMs and find that such behaviors can be unintentionally learned during supervised fine-tuning (SFT) when even a small fraction of reward-hacking trajectories leak into training data. As little as 1\% contamination in distillation SFT…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
