Theoretically Guaranteed Policy Improvement Distilled from Model-Based Planning
Chuming Li, Ruonan Jia, Jie Liu, Yinmin Zhang, Yazhe Niu, Yaodong, Yang, Yu Liu, Wanli Ouyang

TL;DR
This paper introduces a theoretically grounded method for distilling model-based planning into RL policies, ensuring monotonic improvement and convergence, leading to enhanced sample efficiency and performance in continuous control tasks.
Contribution
It extends Soft Actor-Critic with a policy distillation approach from model-based planning, providing theoretical guarantees and a practical algorithm called MPDP.
Findings
MPDP achieves superior sample efficiency over existing methods.
MPDP demonstrates better asymptotic performance on MuJoCo benchmarks.
Theoretical analysis guarantees monotonic policy improvement and convergence.
Abstract
Model-based reinforcement learning (RL) has demonstrated remarkable successes on a range of continuous control tasks due to its high sample efficiency. To save the computation cost of conducting planning online, recent practices tend to distill optimized action sequences into an RL policy during the training phase. Although the distillation can incorporate both the foresight of planning and the exploration ability of RL policies, the theoretical understanding of these methods is yet unclear. In this paper, we extend the policy improvement step of Soft Actor-Critic (SAC) by developing an approach to distill from model-based planning to the policy. We then demonstrate that such an approach of policy improvement has a theoretical guarantee of monotonic improvement and convergence to the maximum value defined in SAC. We discuss effective design choices and implement our theory as a…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsReinforcement Learning in Robotics
MethodsAverage Pooling · Dilated Convolution · Convolution · 1x1 Convolution · Global Average Pooling · Switchable Atrous Convolution
