Monotone and Conservative Policy Iteration Beyond the Tabular Case
S.R. Eshwar, Gugan Thoppe, Ananyabrata Barua, Aditya Gopalan, Gal Dalal

TL;DR
This paper introduces RPI and CRPI, new policy iteration variants that maintain theoretical guarantees under function approximation, addressing a key gap in reinforcement learning algorithms.
Contribution
The paper develops RPI and CRPI, which extend tabular policy iteration guarantees to arbitrary function approximations, ensuring stability and convergence.
Findings
RPI restores monotonicity of value estimates.
CRPI provides per-step improvement bounds.
Initial simulations show RPI and CRPI outperform traditional PI variants.
Abstract
We introduce Reliable Policy Iteration (RPI) and Conservative RPI (CRPI), variants of Policy Iteration (PI) and Conservative PI (CPI), that retain tabular guarantees under function approximation. RPI uses a novel Bellman-constrained optimization for policy evaluation. We show that RPI restores the textbook \textit{monotonicity} of value estimates and that these estimates provably \textit{lower-bound} the true return; moreover, their limit partially satisfies the \textit{unprojected} Bellman equation. CRPI shares RPI's evaluation, but updates policies conservatively by maximizing a new performance-difference \textit{lower bound} that explicitly accounts for function-approximation-induced errors. CRPI inherits RPI's guarantees and, crucially, admits per-step improvement bounds. In initial simulations, RPI and CRPI outperform PI and its variants. Our work addresses a foundational gap in…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
