TL;DR
This paper analyzes the statistical stability and generalization properties of Bellman residual minimization in offline reinforcement learning, providing new bounds that improve understanding of its sample complexity.
Contribution
It introduces a Lyapunov potential to analyze SGDA stability, achieving the first O(1/n) excess risk bounds for BRM without additional assumptions.
Findings
Achieves O(1/n) stability and excess risk bounds for BRM.
Results hold for neural networks and minibatch SGD.
Improves understanding of BRM's statistical behavior in offline RL.
Abstract
Offline reinforcement learning and offline inverse reinforcement learning aim to recover near-optimal value functions or reward models from a fixed batch of logged trajectories, yet current practice still struggles to enforce Bellman consistency. Bellman residual minimization (BRM) has emerged as an attractive remedy, as a globally convergent stochastic gradient descent-ascent based method for BRM has been recently discovered. However, its statistical behavior in the offline setting remains largely unexplored. In this paper, we close this statistical gap. Our analysis introduces a single Lyapunov potential that couples SGDA runs on neighbouring datasets and yields an O(1/n) on-average argument-stability bound-doubling the best known sample-complexity exponent for convex-concave saddle problems. The same stability constant translates into the O(1/n) excess risk bound for BRM, without…
Peer Reviews
Decision·Submitted to ICLR 2026
The paper (to my knowledge) is the first to consider stability analysis exploiting the PL condition and strong concavity of the respective problem to show generalization errors in primal and dual gaps. There are a lot of algebraic manipulations that deftly use the ghost index, contraction properties of the outer and inner problem to establish bounds on generalization error. The application to Bellman residual optimization is noteworthy although it borrows heavily from prior work.
1) My first concern is inadequate quoting of results from Kang et al 2025 that misleads reading this paper. Line 230 and 231 says that Kang et al. 2025 proved that PL condition is satisfied with respect to the parameters of the Q function (primal variables) when parameterized by a Neural Network. I read the prior paper. There are lots of caveats to the Neural Network result - it traces back to the result in https://arxiv.org/pdf/2003.00307 - where authors show that - wide and deep neural nets sa
The problem of analyzing the excess risk bound for Bellman residual minimization does seem open so far.
- The comparison to existing works in approximate dynamic programming methods e.g. projected Bellman equation-based approaches seems inadequate. Is Bellman residual minimization the only way to accommodate the difficulty of enforcing Bellman consistency? What are the other existing risk bounds when incorporating function approximations and how do these results compare? - The techniques used seem to be standard, e.g. PL for analyzing SGDA etc. It seems unclear from the manuscript what are the tec
- Without requiring independence assumptions on the sample indices nor variance reduction, the paper establishes an $O(1/n)$ on-average stability and, via stability-to-generalization transfer, an $O(1/n)$ generalization bound for BRM, doubling the exponent from $1/2$ to $1$ over prior work. - The population excess risk is cleanly decomposed into an optimization term that decays with training and a sample-size–dominated statistical term, naturally aligning with standard minibatch SGDA.
- It would be helpful to add illustrative examples and comparisons to aid understanding (see Q 1 and 2). - Sections~2 and 3 include substantial repetition of well-known material, and the exposition feels overly long. For example, the standard SGDA routine could be moved to the appendix for brevity.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
