Offline RL with Smooth OOD Generalization in Convex Hull and its Neighborhood
Qingmao Yao, Zhichao Lei, Tianyuan Chen, Ziyue Yuan, Xuefan Chen, Jianxiang Liu, Faguo Wu, Xiao Zhang

TL;DR
This paper introduces a novel method called SQOG that improves out-of-distribution Q-value estimation in offline reinforcement learning by smoothing Q-values within the convex hull and its neighborhood, leading to better generalization and performance.
Contribution
The paper proposes the Smooth Bellman Operator (SBO) and the SQOG algorithm, which enhance Q-value generalization in OOD regions within the convex hull, addressing over-constraint issues of prior methods.
Findings
SQOG achieves near-accurate Q-value estimation.
SQOG outperforms existing methods on D4RL benchmarks.
Theoretical guarantees for SBO's approximation of true Q-values.
Abstract
Offline Reinforcement Learning (RL) struggles with distributional shifts, leading to the -value overestimation for out-of-distribution (OOD) actions. Existing methods address this issue by imposing constraints; however, they often become overly conservative when evaluating OOD regions, which constrains the -function generalization. This over-constraint issue results in poor -value estimation and hinders policy improvement. In this paper, we introduce a novel approach to achieve better -value estimation by enhancing -function generalization in OOD regions within Convex Hull and its Neighborhood (CHN). Under the safety generalization guarantees of the CHN, we propose the Smooth Bellman Operator (SBO), which updates OOD -values by smoothing them with neighboring in-sample -values. We theoretically show that SBO approximates true -values for both in-sample and OOD…
Peer Reviews
Decision·ICLR 2025 Poster
The paper is well written and the structure is clear. The idea is supported by both theoretical justification and relatively sufficient empirical study. ---- The authors made a great effort in the rebuttal process and addressed all my concerns.
Using convex hulls and nearest neighbors in reinforcement learning is not a new technique, yet the current paper does not discuss those methods. Giving credit to the existing work will not hurt the novelty of this paper but will help readers have a clearer understanding of the advancements in the field. [1] Sun, Hao, et al. "Accountability in offline reinforcement learning: Explaining decisions with a corpus of examples." Advances in Neural Information Processing Systems 37 (2023). [2] Lyu, Ji
- First, I find that the problem tackled is important and very well motivated. I found in particular that the introduction did a great job of exposing to the reader the problem of over-constraining the policy. The simple but logical explanation is rooted in the design of existing policies that completely avoid to generalize OOD. Having described the problem, the authors are then clear in their ambition: leveraging the part of the state space where neural networks actually are able to generalize
Unfortunately, I found that the paper lacked clarity and mathematical rigor at several crucial moments, which severely hurt my understanding of the contribution. - Most importantly, I believe there is a mistake with the definition of perhaps the most central object of the paper, CHN. As it is defined, by construction, $N(Conv(D)) \subset (S,A)_D \subset Conv(D)$, therefore $CHN(D) = Conv(D)$. It is not clear how to change the existing definition to get to the one that seems intended by the autho
1. The paper is well-written and easy to follow. 2. SQOG achieves strong performance on the D4RL benchmark, remaining relatively simple, straightforward, and computationally efficient. 3. The sanity check on the Inverted Double Pendulum task empirically shows SBO’s effectiveness in alleviating the over-constraint issue, which is a valuable addition.
1. My key concern is the lack of discussion on the relationship between SBO and behavior cloning loss. While the paper claims that SBO alleviates the over-constraint issue in existing offline methods, it’s unclear to me that whether SBO’s effectiveness is limited to use with TD3+BC. Further discussion on this point would be beneficial. 2. The Inverted Double Pendulum task is relatively simple and differs from locomotion tasks, raising questions about whether SQOG can maintain accurate Q-value es
Code & Models
Videos
Taxonomy
TopicsReinforcement Learning in Robotics · Domain Adaptation and Few-Shot Learning · Explainable Artificial Intelligence (XAI)
