Nearly Optimal Policy Optimization with Stable at Any Time Guarantee

Tianhao Wu; Yunchang Yang; Han Zhong; Liwei Wang; Simon S. Du; Jiantao; Jiao

arXiv:2112.10935·cs.LG·December 6, 2022

Nearly Optimal Policy Optimization with Stable at Any Time Guarantee

Tianhao Wu, Yunchang Yang, Han Zhong, Liwei Wang, Simon S. Du, Jiantao, Jiao

PDF

Open Access

TL;DR

This paper introduces RPO-SAT, a novel policy optimization algorithm for tabular reinforcement learning, achieving near-optimal regret bounds and ensuring stability at any time, thus advancing theoretical understanding and practical efficiency.

Contribution

Proposes RPO-SAT, the first computationally efficient, nearly minimax optimal policy-based RL algorithm with stability guarantees, bridging the theoretical gap in regret bounds.

Findings

01

Achieves regret of rac{}{}( ilde{O}(\u221A{SAH^3K} + \u221A{AH^4K}))

02

Minimax optimal when S > H, ignoring logarithmic factors

03

First efficient policy-based algorithm with near-optimal regret in tabular RL

Abstract

Policy optimization methods are one of the most widely used classes of Reinforcement Learning (RL) algorithms. However, theoretical understanding of these methods remains insufficient. Even in the episodic (time-inhomogeneous) tabular setting, the state-of-the-art theoretical result of policy-based method in \citet{shani2020optimistic} is only $\tilde{O} (S^{2} A H^{4} K)$ where $S$ is the number of states, $A$ is the number of actions, $H$ is the horizon, and $K$ is the number of episodes, and there is a $S H$ gap compared with the information theoretic lower bound $\tilde{Ω} (S A H^{3} K)$ . To bridge such a gap, we propose a novel algorithm Reference-based Policy Optimization with Stable at Any Time guarantee (\algnameacro), which features the property "Stable at Any Time". We prove that our algorithm achieves $\tilde{O} (S A H^{3} K + A H^{4} K)$ regret. When $S > H$ ,…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsReinforcement Learning in Robotics · Advanced Bandit Algorithms Research · Adversarial Robustness in Machine Learning