TL;DR
The paper introduces Gradient Fingerprint (GRIFT), a novel method that uses models' internal gradient computations to detect reward hacking in reinforcement learning, significantly improving detection accuracy.
Contribution
GRIFT is a new gradient-based approach that effectively identifies reward hacking behaviors in reasoning tasks, outperforming existing methods and enhancing model alignment.
Findings
GRIFT achieves over 25% relative improvement in reward hacking detection.
Integrating GRIFT reduces reward hacking and improves task performance.
Gradient representations effectively assess reasoning trace quality.
Abstract
Reinforcement learning with verifiable rewards (RLVR) typically optimizes for outcome rewards without imposing constraints on intermediate reasoning. This leaves training susceptible to reward hacking, where models exploit loopholes (e.g., spurious patterns in training data) in the reward function to achieve high scores without solving the intended task. These reward-hacking behaviors are often implicit, as the intermediate chain-of-thought (CoT) may appear plausible on the surface, limiting the effectiveness of purely text-based monitoring. We propose Gradient Fingerprint (GRIFT), a method for detecting reward hacking using models' internal computations. Given a prompt and a model-generated CoT, GRIFT computes gradients of the CoT conditioned on the prompt and compresses them into a compact representation, which is then used to assess whether the CoT reflects reward hacking behavior.…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
