Detecting and Suppressing Reward Hacking with Gradient Fingerprints

Songtao Wang; Quang Hieu Pham; Fangcong Yin; Xinpeng Wang; Jocelyn Qiaochu Chen; Greg Durrett; Xi Ye

arXiv:2604.16242·cs.LG·April 20, 2026

Detecting and Suppressing Reward Hacking with Gradient Fingerprints

Songtao Wang, Quang Hieu Pham, Fangcong Yin, Xinpeng Wang, Jocelyn Qiaochu Chen, Greg Durrett, Xi Ye

PDF

1 Repo

TL;DR

The paper introduces Gradient Fingerprint (GRIFT), a novel method that uses models' internal gradient computations to detect reward hacking in reinforcement learning, significantly improving detection accuracy.

Contribution

GRIFT is a new gradient-based approach that effectively identifies reward hacking behaviors in reasoning tasks, outperforming existing methods and enhancing model alignment.

Findings

01

GRIFT achieves over 25% relative improvement in reward hacking detection.

02

Integrating GRIFT reduces reward hacking and improves task performance.

03

Gradient representations effectively assess reasoning trace quality.

Abstract

Reinforcement learning with verifiable rewards (RLVR) typically optimizes for outcome rewards without imposing constraints on intermediate reasoning. This leaves training susceptible to reward hacking, where models exploit loopholes (e.g., spurious patterns in training data) in the reward function to achieve high scores without solving the intended task. These reward-hacking behaviors are often implicit, as the intermediate chain-of-thought (CoT) may appear plausible on the surface, limiting the effectiveness of purely text-based monitoring. We propose Gradient Fingerprint (GRIFT), a method for detecting reward hacking using models' internal computations. Given a prompt and a model-generated CoT, GRIFT computes gradients of the CoT conditioned on the prompt and compresses them into a compact representation, which is then used to assess whether the CoT reflects reward hacking behavior.…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

songtao-x/reward_hack
github

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.