
TL;DR
This paper introduces Easy Monotonic Policy Iteration, an algorithm that guarantees non-decreasing policy performance in reinforcement learning by using an average divergence measure, making it practical and easy to implement.
Contribution
It derives a new policy improvement bound replacing the sup norm with an average divergence, enabling a simple, sample-based algorithm with guaranteed monotonic improvement.
Findings
Guarantees non-decreasing returns in policy sequences
Simplifies implementation in sample-based reinforcement learning
Improves over prior bounds that are hard to optimize
Abstract
A key problem in reinforcement learning for control with general function approximators (such as deep neural networks and other nonlinear functions) is that, for many algorithms employed in practice, updates to the policy or -function may fail to improve performance---or worse, actually cause the policy performance to degrade. Prior work has addressed this for policy iteration by deriving tight policy improvement bounds; by optimizing the lower bound on policy improvement, a better policy is guaranteed. However, existing approaches suffer from bounds that are hard to optimize in practice because they include sup norm terms which cannot be efficiently estimated or differentiated. In this work, we derive a better policy improvement bound where the sup norm of the policy divergence has been replaced with an average divergence; this leads to an algorithm, Easy Monotonic Policy Iteration,…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsReinforcement Learning in Robotics · Advanced Bandit Algorithms Research · Adaptive Dynamic Programming Control
