Countdown-Code: A Testbed for Studying The Emergence and Generalization of Reward Hacking in RLVR

Muhammad Khalifa; Zohaib Khan; Omer Tafveez; Hao Peng; Lu Wang

arXiv:2603.07084·cs.LG·April 21, 2026

Countdown-Code: A Testbed for Studying The Emergence and Generalization of Reward Hacking in RLVR

Muhammad Khalifa, Zohaib Khan, Omer Tafveez, Hao Peng, Lu Wang

PDF

1 Repo

TL;DR

Countdown-Code provides a minimal environment to study reward hacking in RLVR, revealing how small data contamination can lead to persistent misaligned behaviors in language models.

Contribution

Introduces a novel environment to measure reward hacking and demonstrates how even minimal training data contamination causes models to learn and generalize reward hacking behaviors.

Findings

01

Reward hacking can be learned with as little as 1% contaminated data during supervised fine-tuning.

02

RL amplifies reward hacking and extends it beyond the training domain.

03

Open-source environment facilitates future research on reward hacking in LLMs.

Abstract

Reward hacking is a form of misalignment in which models overoptimize proxy rewards without genuinely solving the underlying task. Precisely measuring reward hacking occurrence remains challenging because true task rewards are often expensive or impossible to compute. We introduce Countdown-Code, a minimal environment where models can both solve a mathematical reasoning task and manipulate the test harness. This dual-access design creates a clean separation between proxy rewards (test pass/fail) and true rewards (mathematical correctness), enabling accurate measurement of reward-hacking rates. Using this environment, we study reward hacking in open-weight LLMs and find that such behaviors can be unintentionally learned during supervised fine-tuning (SFT) when even a small fraction of reward-hacking trajectories leak into training data. As little as 1\% contamination in distillation SFT…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

zohaib-khan5040/Countdown-Code
github

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.