Adaptive Trust Region Policy Optimization: Global Convergence and Faster Rates for Regularized MDPs
Lior Shani, Yonathan Efroni, Shie Mannor

TL;DR
This paper provides a theoretical analysis of Trust Region Policy Optimization (TRPO), showing its connection to convex optimization methods, establishing convergence rates in planning and sample-based settings, and demonstrating faster rates with regularization in Reinforcement Learning.
Contribution
It reveals that TRPO's adaptive scaling is akin to trust-region methods, proves convergence rates for TRPO in planning and sample-based RL, and shows improved rates with regularized MDPs, a novel result in RL.
Findings
TRPO's adaptive scaling aligns with convex trust-region methods.
Sample-based TRPO converges at a rate of rac{1}{\u221a{N}}.
Regularized MDPs enable rac{1}{N} rates, a first in RL.
Abstract
Trust region policy optimization (TRPO) is a popular and empirically successful policy search algorithm in Reinforcement Learning (RL) in which a surrogate problem, that restricts consecutive policies to be 'close' to one another, is iteratively solved. Nevertheless, TRPO has been considered a heuristic algorithm inspired by Conservative Policy Iteration (CPI). We show that the adaptive scaling mechanism used in TRPO is in fact the natural "RL version" of traditional trust-region methods from convex analysis. We first analyze TRPO in the planning setting, in which we have access to the model and the entire state space. Then, we consider sample-based TRPO and establish convergence rate to the global optimum. Importantly, the adaptive scaling mechanism allows us to analyze TRPO in regularized MDPs for which we prove fast rates of , much like results…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
MethodsTrust Region Policy Optimization
