Gradient Regularization Prevents Reward Hacking in Reinforcement Learning from Human Feedback and Verifiable Rewards

Johannes Ackermann; Michael Noukhovitch; Takashi Ishida; Masashi Sugiyama

arXiv:2602.18037·cs.LG·February 23, 2026

Gradient Regularization Prevents Reward Hacking in Reinforcement Learning from Human Feedback and Verifiable Rewards

Johannes Ackermann, Michael Noukhovitch, Takashi Ishida, Masashi Sugiyama

PDF

Open Access

TL;DR

This paper introduces gradient regularization (GR) as a method to prevent reward hacking in reinforcement learning from human feedback and verifiable rewards, by biasing training towards regions with more accurate reward models.

Contribution

The paper establishes a theoretical link between reward model accuracy and flatness of optima, and demonstrates how gradient regularization improves reward robustness and performance in language model training.

Findings

01

Gradient regularization correlates with reward accuracy.

02

GR outperforms KL penalty in diverse RL experiments.

03

GR prevents reward hacking in LLM tasks.

Abstract

Reinforcement Learning from Human Feedback (RLHF) or Verifiable Rewards (RLVR) are two key steps in the post-training of modern Language Models (LMs). A common problem is reward hacking, where the policy may exploit inaccuracies of the reward and learn an unintended behavior. Most previous works address this by limiting the policy update with a Kullback-Leibler (KL) penalty towards a reference model. We propose a different framing: Train the LM in a way that biases policy updates towards regions in which the reward is more accurate. First, we derive a theoretical connection between the accuracy of a reward model and the flatness of an optimum at convergence. Gradient regularization (GR) can then be used to bias training to flatter regions and thereby maintain reward model accuracy. We confirm these results by showing that the gradient norm and reward accuracy are empirically correlated…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsExplainable Artificial Intelligence (XAI) · Topic Modeling · Machine Learning and Data Classification