Verifier-Free RL for LLMs via Intrinsic Gradient-Norm Reward

Xuexiang Wen; Hang Yu; Linchao Zhu; and Gaoang Wang

arXiv:2605.09920·cs.LG·May 12, 2026

Verifier-Free RL for LLMs via Intrinsic Gradient-Norm Reward

Xuexiang Wen, Hang Yu, Linchao Zhu, and Gaoang Wang

PDF

1 Repo

TL;DR

VIGOR introduces a verifier-free intrinsic reward for LLM reinforcement learning, based on gradient norms, improving performance and stability without domain-specific verifiers.

Contribution

It proposes a novel intrinsic reward method using gradient norms that enhances LLM training efficiency and transferability without relying on external verifiers.

Findings

01

VIGOR outperforms RLIF on mathematical reasoning benchmarks.

02

It improves math accuracy by +3.31% and code accuracy by +1.91%.

03

VIGOR exhibits more stable training dynamics.

Abstract

While Reinforcement Learning with Verifiable Rewards (RLVR) has recently emerged as a promising post-training paradigm for Large Language Models (LLMs), its dependency on the gold label or domain-specific verifiers limits its scalability to new tasks and domains. In this work, we propose Verifier-free Intrinsic Gradient-Norm Reward (VIGOR), a simple reward that uses only the policy model itself. Given a prompt, VIGOR samples a group of completions and assigns higher within-group rewards to outputs that induce smaller $ℓ_{2}$ norms of the teacher-forced negative log-likelihood gradients under the current parameters. Intuitively, lower gradient norms suggest the completion aligns better with the current policy, serving as an intrinsic preference signal for policy optimization. To make this intrinsic signal practical for RL, we correct the systematic length bias of averaged token-level…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

ZJUSCL/VIGOR
github

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.