Adaptive Trust Region Policy Optimization: Global Convergence and Faster   Rates for Regularized MDPs

Lior Shani; Yonathan Efroni; Shie Mannor

arXiv:1909.02769·cs.LG·December 13, 2019

Adaptive Trust Region Policy Optimization: Global Convergence and Faster Rates for Regularized MDPs

Lior Shani, Yonathan Efroni, Shie Mannor

PDF

TL;DR

This paper provides a theoretical analysis of Trust Region Policy Optimization (TRPO), showing its connection to convex optimization methods, establishing convergence rates in planning and sample-based settings, and demonstrating faster rates with regularization in Reinforcement Learning.

Contribution

It reveals that TRPO's adaptive scaling is akin to trust-region methods, proves convergence rates for TRPO in planning and sample-based RL, and shows improved rates with regularized MDPs, a novel result in RL.

Findings

01

TRPO's adaptive scaling aligns with convex trust-region methods.

02

Sample-based TRPO converges at a rate of rac{1}{\u221a{N}}.

03

Regularized MDPs enable rac{1}{N} rates, a first in RL.

Abstract

Trust region policy optimization (TRPO) is a popular and empirically successful policy search algorithm in Reinforcement Learning (RL) in which a surrogate problem, that restricts consecutive policies to be 'close' to one another, is iteratively solved. Nevertheless, TRPO has been considered a heuristic algorithm inspired by Conservative Policy Iteration (CPI). We show that the adaptive scaling mechanism used in TRPO is in fact the natural "RL version" of traditional trust-region methods from convex analysis. We first analyze TRPO in the planning setting, in which we have access to the model and the entire state space. Then, we consider sample-based TRPO and establish $\tilde{O} (1/ N)$ convergence rate to the global optimum. Importantly, the adaptive scaling mechanism allows us to analyze TRPO in regularized MDPs for which we prove fast rates of $\tilde{O} (1/ N)$ , much like results…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

MethodsTrust Region Policy Optimization