Dual Policy Iteration

Wen Sun; Geoffrey J. Gordon; Byron Boots; J. Andrew Bagnell

arXiv:1805.10755·cs.LG·April 9, 2019·25 cites

Dual Policy Iteration

Wen Sun, Geoffrey J. Gordon, Byron Boots, J. Andrew Bagnell

PDF

Open Access

TL;DR

This paper introduces a dual policy iteration framework that alternates between reactive and non-reactive policies, providing convergence analysis and unifying model-free and model-based reinforcement learning for continuous control tasks.

Contribution

It develops a novel dual policy iteration method with convergence guarantees and unifies model-free and model-based RL approaches under a common framework.

Findings

01

Effective on various continuous control tasks

02

Convergence guarantees for the dual policy iteration

03

Unifies model-free and model-based RL approaches

Abstract

Recently, a novel class of Approximate Policy Iteration (API) algorithms have demonstrated impressive practical performance (e.g., ExIt from [2], AlphaGo-Zero from [27]). This new family of algorithms maintains, and alternately optimizes, two policies: a fast, reactive policy (e.g., a deep neural network) deployed at test time, and a slow, non-reactive policy (e.g., Tree Search), that can plan multiple steps ahead. The reactive policy is updated under supervision from the non-reactive policy, while the non-reactive policy is improved with guidance from the reactive policy. In this work we study this Dual Policy Iteration (DPI) strategy in an alternating optimization framework and provide a convergence analysis that extends existing API theory. We also develop a special instance of this framework which reduces the update of non-reactive policies to model-based optimal control using…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsReinforcement Learning in Robotics · Advanced Neural Network Applications · Age of Information Optimization