Off-Policy Value-Based Reinforcement Learning for Large Language Models

Peng-Yuan Wang; Ziniu Li; Tian Xu; Bohan Yang; Tian-Shuo Liu; ChenYang Wang; Xiong-Hui Chen; Yi-Chen Li; Tianyun Yang; Congliang Chen; Yang Yu

arXiv:2603.23355·cs.LG·March 25, 2026

Off-Policy Value-Based Reinforcement Learning for Large Language Models

Peng-Yuan Wang, Ziniu Li, Tian Xu, Bohan Yang, Tian-Shuo Liu, ChenYang Wang, Xiong-Hui Chen, Yi-Chen Li, Tianyun Yang, Congliang Chen, Yang Yu

PDF

Open Access

TL;DR

This paper introduces ReVal, a value-based off-policy reinforcement learning method for large language models, improving data efficiency and performance on reasoning benchmarks compared to traditional on-policy methods.

Contribution

ReVal is a novel Bellman-update-based framework enabling off-policy learning and efficient reuse of past trajectories for LLM training.

Findings

01

ReVal converges faster than on-policy methods.

02

ReVal outperforms GRPO on reasoning benchmarks.

03

ReVal improves training efficiency on large language models.

Abstract

Improving data utilization efficiency is critical for scaling reinforcement learning (RL) for long-horizon tasks where generating trajectories is expensive. However, the dominant RL methods for LLMs are largely on-policy: they update each batch of data only once, discard it, and then collect fresh samples, resulting in poor sample efficiency. In this work, we explore an alternative value-based RL framework for LLMs that naturally enables off-policy learning. We propose ReVal, a Bellman-update-based method that combines stepwise signals capturing internal consistency with trajectory-level signals derived from outcome verification. ReVal naturally supports replay-buffer-based training, allowing efficient reuse of past trajectories. Experiments on standard mathematical reasoning benchmarks show that ReVal not only converges faster but also outperforms GRPO in final performance. On…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsTopic Modeling · Reinforcement Learning in Robotics · Domain Adaptation and Few-Shot Learning