Easy Monotonic Policy Iteration

Joshua Achiam

arXiv:1602.09118·cs.LG·March 1, 2016

Easy Monotonic Policy Iteration

Joshua Achiam

PDF

Open Access

TL;DR

This paper introduces Easy Monotonic Policy Iteration, an algorithm that guarantees non-decreasing policy performance in reinforcement learning by using an average divergence measure, making it practical and easy to implement.

Contribution

It derives a new policy improvement bound replacing the sup norm with an average divergence, enabling a simple, sample-based algorithm with guaranteed monotonic improvement.

Findings

01

Guarantees non-decreasing returns in policy sequences

02

Simplifies implementation in sample-based reinforcement learning

03

Improves over prior bounds that are hard to optimize

Abstract

A key problem in reinforcement learning for control with general function approximators (such as deep neural networks and other nonlinear functions) is that, for many algorithms employed in practice, updates to the policy or $Q$ -function may fail to improve performance---or worse, actually cause the policy performance to degrade. Prior work has addressed this for policy iteration by deriving tight policy improvement bounds; by optimizing the lower bound on policy improvement, a better policy is guaranteed. However, existing approaches suffer from bounds that are hard to optimize in practice because they include sup norm terms which cannot be efficiently estimated or differentiated. In this work, we derive a better policy improvement bound where the sup norm of the policy divergence has been replaced with an average divergence; this leads to an algorithm, Easy Monotonic Policy Iteration,…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsReinforcement Learning in Robotics · Advanced Bandit Algorithms Research · Adaptive Dynamic Programming Control