Response-Level Rewards Are All You Need for Online Reinforcement Learning in LLMs: A Mathematical Perspective

Shenghua He; Tian Xia; Xuan Zhou; Hui Wei

arXiv:2506.02553·cs.LG·June 4, 2025

Response-Level Rewards Are All You Need for Online Reinforcement Learning in LLMs: A Mathematical Perspective

Shenghua He, Tian Xia, Xuan Zhou, Hui Wei

PDF

Open Access

TL;DR

This paper demonstrates that response-level rewards are sufficient for unbiased policy gradient estimation in online reinforcement learning for large language models, simplifying reward design and enabling more practical fine-tuning.

Contribution

It introduces the Trajectory Policy Gradient Theorem, providing a theoretical foundation that response-level rewards suffice for unbiased token-level policy gradients, and proposes the TRePO algorithm.

Findings

01

Response-level rewards enable unbiased token-level policy gradient estimation.

02

Popular algorithms like PPO and GRPO inherently model token-level rewards.

03

TRePO is a simpler, memory-efficient alternative to PPO with strong theoretical backing.

Abstract

We study a common challenge in reinforcement learning for large language models (LLMs): the Zero-Reward Assumption, where non-terminal actions (i.e., intermediate token generations) receive zero task-specific immediate reward, while only the final token receives a reward for the entire response. This assumption arises frequently in practice, as precise token-level rewards are often difficult or infeasible to obtain in LLM applications. In this work, we provide a unifying theoretical perspective. We introduce the Trajectory Policy Gradient Theorem, which shows that the policy gradient based on true, unknown token-level rewards can be unbiasedly estimated using only a response-level reward model, regardless of whether the Zero-Reward Assumption holds or not, for algorithms in the REINFORCE and Actor-Critic families. This result reveals that widely used methods such as PPO, GRPO, ReMax,…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsViral Infectious Diseases and Gene Expression in Insects · Open Source Software Innovations

MethodsEntropy Regularization · Focus · Proximal Policy Optimization · REINFORCE