TL;DR
VIGOR introduces a verifier-free intrinsic reward for LLM reinforcement learning, based on gradient norms, improving performance and stability without domain-specific verifiers.
Contribution
It proposes a novel intrinsic reward method using gradient norms that enhances LLM training efficiency and transferability without relying on external verifiers.
Findings
VIGOR outperforms RLIF on mathematical reasoning benchmarks.
It improves math accuracy by +3.31% and code accuracy by +1.91%.
VIGOR exhibits more stable training dynamics.
Abstract
While Reinforcement Learning with Verifiable Rewards (RLVR) has recently emerged as a promising post-training paradigm for Large Language Models (LLMs), its dependency on the gold label or domain-specific verifiers limits its scalability to new tasks and domains. In this work, we propose Verifier-free Intrinsic Gradient-Norm Reward (VIGOR), a simple reward that uses only the policy model itself. Given a prompt, VIGOR samples a group of completions and assigns higher within-group rewards to outputs that induce smaller norms of the teacher-forced negative log-likelihood gradients under the current parameters. Intuitively, lower gradient norms suggest the completion aligns better with the current policy, serving as an intrinsic preference signal for policy optimization. To make this intrinsic signal practical for RL, we correct the systematic length bias of averaged token-level…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
