Trajectory Bellman Residual Minimization: A Simple Value-Based Method for LLM Reasoning

Yurun Yuan; Fan Chen; Zeyu Jia; Alexander Rakhlin; Tengyang Xie

arXiv:2505.15311·cs.LG·November 13, 2025

Trajectory Bellman Residual Minimization: A Simple Value-Based Method for LLM Reasoning

Yurun Yuan, Fan Chen, Zeyu Jia, Alexander Rakhlin, Tengyang Xie

PDF

Open Access

TL;DR

This paper introduces TBRM, a simple value-based off-policy RL method for LLM reasoning that outperforms policy-based methods on mathematical benchmarks with less complexity.

Contribution

We propose TBRM, a novel trajectory Bellman residual minimization algorithm for LLMs, demonstrating its convergence and superior performance over policy-based methods.

Findings

01

TBRM outperforms PPO and GRPO on reasoning benchmarks.

02

TBRM requires only one rollout per prompt, reducing computational overhead.

03

The method converges to near-optimal policies from arbitrary off-policy data.

Abstract

Policy-based methods currently dominate reinforcement learning (RL) pipelines for large language model (LLM) reasoning, leaving value-based approaches largely unexplored. We revisit the classical paradigm of Bellman Residual Minimization and introduce Trajectory Bellman Residual Minimization (TBRM), an algorithm that naturally adapts this idea to LLMs, yielding a simple yet effective off-policy algorithm that optimizes a single trajectory-level Bellman objective using the model's own logits as $Q$ -values. TBRM removes the need for critics, importance-sampling ratios, or clipping, and operates with only one rollout per prompt. We prove convergence to the near-optimal KL-regularized policy from arbitrary off-policy data via an improved change-of-trajectory-measure analysis. Experiments on standard mathematical-reasoning benchmarks show that TBRM consistently outperforms policy-based…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsNatural Language Processing Techniques · Formal Methods in Verification

MethodsEntropy Regularization · Proximal Policy Optimization