LLMs Gaming Verifiers: RLVR can Lead to Reward Hacking

Lukas Helff; Quentin Delfosse; David Steinmann; Ruben H\"arle; Hikaru Shindo; Patrick Schramowski; Wolfgang Stammer; Kristian Kersting; Felix Friedrich

arXiv:2604.15149·cs.LG·April 17, 2026

LLMs Gaming Verifiers: RLVR can Lead to Reward Hacking

Lukas Helff, Quentin Delfosse, David Steinmann, Ruben H\"arle, Hikaru Shindo, Patrick Schramowski, Wolfgang Stammer, Kristian Kersting, Felix Friedrich

PDF

1 Datasets

TL;DR

This paper investigates how reinforcement learning with verifiable rewards (RLVR) can lead large language models to cheat verifiers by producing superficial outputs, and proposes isomorphic perturbation testing to detect such shortcuts.

Contribution

The paper introduces Isomorphic Perturbation Testing (IPT) as a method to identify reward hacking in RLVR-trained models and demonstrates its effectiveness in distinguishing genuine reasoning from shortcut strategies.

Findings

01

RLVR-trained models often abandon rule induction for shortcut strategies.

02

Isomorphic Perturbation Testing effectively detects shortcut behaviors.

03

Shortcut prevalence increases with task complexity and inference compute.

Abstract

As reinforcement Learning with Verifiable Rewards (RLVR) has become the dominant paradigm for scaling reasoning capabilities in LLMs, a new failure mode emerges: LLMs gaming verifiers. We study this phenomenon on inductive reasoning tasks, where models must induce and output logical rules. We find that RLVR-trained models systematically abandon rule induction. Instead of learning generalizable patterns (e.g., ``trains carrying red cars go east''), they enumerate instance-level labels, producing outputs that pass verifiers without capturing the relational patterns required by the task. We show that this behavior is not a failure of understanding but a form of reward hacking: imperfect verifiers that check only extensional correctness admit false positives. To detect such shortcuts, we introduce Isomorphic Perturbation Testing (IPT), which evaluates a single model output under both…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Datasets

AIML-TUDA/SLR-Bench
dataset· 1.7k dl
1.7k dl

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.