Value-Gradient Hypothesis of RL for LLMs

Arip Asadulaev; Daniil Ognev; Karim Salta; Martin Takac

arXiv:2605.21654·cs.LG·May 22, 2026

Value-Gradient Hypothesis of RL for LLMs

Arip Asadulaev, Daniil Ognev, Karim Salta, Martin Takac

PDF

TL;DR

This paper introduces a value-gradient perspective to understand why critic-free reinforcement learning methods like PPO and GRPO are effective for fine-tuning large language models, highlighting the conditions for maximum benefit.

Contribution

It develops a theoretical framework connecting value gradients to critic-free RL updates and provides empirical insights into when RL yields the greatest improvements in LLMs.

Findings

01

Actor updates are value-gradient-like in expectation under certain conditions.

02

Autodifferentiation through attention approximates the value signal with controlled error.

03

A criterion for optimal RL impact along the pretraining trajectory is proposed.

Abstract

Reinforcement learning substantially improves pretrained language models, but it remains understudied why critic-free methods such as PPO and GRPO work as well as they do, and when they should provide the largest gains. We develop a value-gradient perspective of critic-free RL for LLM post-training. First, under a differentiable rollout and additive-noise parameterization, we show that the actor update is value-gradient-like in expectation: the backward pass propagates costates whose conditional expectation equals the value gradient. Second, for discrete transformer policies, we show that autodifferentiation through attention produces empirical costates that approximate this value signal, with an error controlled by the sampling gap and policy entropy. These results motivate a decomposition of RL impact into value gradient signal and reachable reward headroom, yielding a criterion for…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.