On the Convergence Rate of Off-Policy Policy Optimization Methods with Density-Ratio Correction
Jiawei Huang, Nan Jiang

TL;DR
This paper analyzes the convergence rates of off-policy policy optimization algorithms with density-ratio correction, proposing two strategies with finite-time guarantees and optimal or near-optimal convergence rates.
Contribution
It introduces two new algorithms, P-SREDA and O-SPIM, with proven convergence rates for off-policy policy improvement under function approximation.
Findings
P-SREDA has an optimal convergence rate of $O(psilon^{-3})$.
O-SPIM converges to a stationary point with rate $O(psilon^{-4})$.
The methods provide finite-time convergence guarantees for off-policy algorithms.
Abstract
In this paper, we study the convergence properties of off-policy policy improvement algorithms with state-action density ratio correction under function approximation setting, where the objective function is formulated as a max-max-min optimization problem. We characterize the bias of the learning objective and present two strategies with finite-time convergence guarantees. In our first strategy, we present algorithm P-SREDA with convergence rate , whose dependency on is optimal. In our second strategy, we propose a new off-policy actor-critic style algorithm named O-SPIM. We prove that O-SPIM converges to a stationary point with total complexity , which matches the convergence rate of some recent actor-critic algorithms in the on-policy setting.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsReinforcement Learning in Robotics · Optimization and Search Problems · Machine Learning and Algorithms
