Dual Policy Iteration
Wen Sun, Geoffrey J. Gordon, Byron Boots, J. Andrew Bagnell

TL;DR
This paper introduces a dual policy iteration framework that alternates between reactive and non-reactive policies, providing convergence analysis and unifying model-free and model-based reinforcement learning for continuous control tasks.
Contribution
It develops a novel dual policy iteration method with convergence guarantees and unifies model-free and model-based RL approaches under a common framework.
Findings
Effective on various continuous control tasks
Convergence guarantees for the dual policy iteration
Unifies model-free and model-based RL approaches
Abstract
Recently, a novel class of Approximate Policy Iteration (API) algorithms have demonstrated impressive practical performance (e.g., ExIt from [2], AlphaGo-Zero from [27]). This new family of algorithms maintains, and alternately optimizes, two policies: a fast, reactive policy (e.g., a deep neural network) deployed at test time, and a slow, non-reactive policy (e.g., Tree Search), that can plan multiple steps ahead. The reactive policy is updated under supervision from the non-reactive policy, while the non-reactive policy is improved with guidance from the reactive policy. In this work we study this Dual Policy Iteration (DPI) strategy in an alternating optimization framework and provide a convergence analysis that extends existing API theory. We also develop a special instance of this framework which reduces the update of non-reactive policies to model-based optimal control using…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsReinforcement Learning in Robotics · Advanced Neural Network Applications · Age of Information Optimization
