Cautious Policy Programming: Exploiting KL Regularization in Monotonic   Policy Improvement for Reinforcement Learning

Lingwei Zhu; Toshinori Kitamura; Takamitsu Matsubara

arXiv:2107.05798·cs.LG·January 19, 2022

Cautious Policy Programming: Exploiting KL Regularization in Monotonic Policy Improvement for Reinforcement Learning

Lingwei Zhu, Toshinori Kitamura, Takamitsu Matsubara

PDF

Open Access

TL;DR

This paper introduces Cautious Policy Programming (CPP), a reinforcement learning algorithm that ensures monotonic policy improvement by leveraging KL regularization and an entropy-aware lower bound, improving stability and scalability in complex tasks.

Contribution

The paper presents a novel entropy regularization-aware lower bound for policy improvement and an interpolation scheme that enhances CPP's scalability in high-dimensional control problems.

Findings

01

CPP guarantees monotonic policy improvement.

02

CPP balances performance and stability effectively.

03

CPP scales well to high-dimensional Atari games.

Abstract

In this paper, we propose cautious policy programming (CPP), a novel value-based reinforcement learning (RL) algorithm that can ensure monotonic policy improvement during learning. Based on the nature of entropy-regularized RL, we derive a new entropy regularization-aware lower bound of policy improvement that only requires estimating the expected policy advantage function. CPP leverages this lower bound as a criterion for adjusting the degree of a policy update for alleviating policy oscillation. Different from similar algorithms that are mostly theory-oriented, we also propose a novel interpolation scheme that makes CPP better scale in high dimensional control problems. We demonstrate that the proposed algorithm can trade o? performance and stability in both didactic classic control problems and challenging high-dimensional Atari games.

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsReinforcement Learning in Robotics · Adaptive Dynamic Programming Control · Advanced Bandit Algorithms Research