Trajectory Bellman Residual Minimization: A Simple Value-Based Method for LLM Reasoning
Yurun Yuan, Fan Chen, Zeyu Jia, Alexander Rakhlin, Tengyang Xie

TL;DR
This paper introduces TBRM, a simple value-based off-policy RL method for LLM reasoning that outperforms policy-based methods on mathematical benchmarks with less complexity.
Contribution
We propose TBRM, a novel trajectory Bellman residual minimization algorithm for LLMs, demonstrating its convergence and superior performance over policy-based methods.
Findings
TBRM outperforms PPO and GRPO on reasoning benchmarks.
TBRM requires only one rollout per prompt, reducing computational overhead.
The method converges to near-optimal policies from arbitrary off-policy data.
Abstract
Policy-based methods currently dominate reinforcement learning (RL) pipelines for large language model (LLM) reasoning, leaving value-based approaches largely unexplored. We revisit the classical paradigm of Bellman Residual Minimization and introduce Trajectory Bellman Residual Minimization (TBRM), an algorithm that naturally adapts this idea to LLMs, yielding a simple yet effective off-policy algorithm that optimizes a single trajectory-level Bellman objective using the model's own logits as -values. TBRM removes the need for critics, importance-sampling ratios, or clipping, and operates with only one rollout per prompt. We prove convergence to the near-optimal KL-regularized policy from arbitrary off-policy data via an improved change-of-trajectory-measure analysis. Experiments on standard mathematical-reasoning benchmarks show that TBRM consistently outperforms policy-based…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsNatural Language Processing Techniques · Formal Methods in Verification
MethodsEntropy Regularization · Proximal Policy Optimization
