Reward Hacking in Rubric-Based Reinforcement Learning
Anas Mahmoud, MohammadHossein Rezaei, Zihao Wang, Anisha Gunjal, Bing Liu, Yunzhong He

TL;DR
This paper investigates reward hacking in rubric-based reinforcement learning, revealing how verifier strength influences exploitation and proposing diagnostics to detect divergence from true policy quality.
Contribution
It introduces a framework separating verifier failure and rubric limitations, and proposes a verifier-free diagnostic to monitor policy alignment with true quality.
Findings
Weak verifiers lead to proxy rewards that do not transfer to reference verifiers.
Stronger verifiers reduce but do not eliminate reward hacking.
Disagreements between rubric-based and rubric-free judges highlight quality trade-offs.
Abstract
Reinforcement learning with verifiable rewards has enabled strong post-training gains in domains such as math and coding, though many open-ended settings rely on rubric-based rewards. We study reward hacking in rubric-based RL, where a policy is optimized against a training verifier but evaluated against a cross-family panel of three frontier judges, reducing dependence on any single evaluator. Our framework separates two sources of divergence: verifier failure, where the training verifier credits rubric criteria that reference verifiers reject, and rubric-design limitations, where even strong rubric-based verifiers favor responses that rubric-free judges rate worse overall. Across medical and science domains, weak verifiers produce large proxy-reward gains that do not transfer to the reference verifiers; exploitation grows over training and concentrates in recurring failures such as…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
