Implicit Constraint-Aware Off-Policy Correction for Offline Reinforcement Learning
Ali Baheri

TL;DR
This paper presents a novel off-policy correction method for offline reinforcement learning that incorporates structural priors directly into Bellman updates, improving policy accuracy and respecting domain constraints.
Contribution
It introduces a framework that embeds structural priors into Bellman updates via a proximal projection, ensuring contraction, uniqueness, and exact enforcement of constraints.
Findings
Eliminates monotonicity violations in synthetic auction data.
Outperforms conservative and implicit Q-learning in return and sample efficiency.
Maintains a $\gamma$-contraction and has a unique fixed point.
Abstract
Offline reinforcement learning promises policy improvement from logged interaction data alone, yet state-of-the-art algorithms remain vulnerable to value over-estimation and to violations of domain knowledge such as monotonicity or smoothness. We introduce implicit constraint-aware off-policy correction, a framework that embeds structural priors directly inside every Bellman update. The key idea is to compose the optimal Bellman operator with a proximal projection on a convex constraint set, which produces a new operator that (i) remains a -contraction, (ii) possesses a unique fixed point, and (iii) enforces the prescribed structure exactly. A differentiable optimization layer solves the projection; implicit differentiation supplies gradients for deep function approximators at a cost comparable to implicit Q-learning. On a synthetic Bid-Click auction -- where the true value is…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsReinforcement Learning in Robotics
MethodsQ-Learning
