Theoretically Guaranteed Policy Improvement Distilled from Model-Based   Planning

Chuming Li; Ruonan Jia; Jie Liu; Yinmin Zhang; Yazhe Niu; Yaodong; Yang; Yu Liu; Wanli Ouyang

arXiv:2307.12933·cs.AI·July 25, 2023·1 cites

Theoretically Guaranteed Policy Improvement Distilled from Model-Based Planning

Chuming Li, Ruonan Jia, Jie Liu, Yinmin Zhang, Yazhe Niu, Yaodong, Yang, Yu Liu, Wanli Ouyang

PDF

Open Access

TL;DR

This paper introduces a theoretically grounded method for distilling model-based planning into RL policies, ensuring monotonic improvement and convergence, leading to enhanced sample efficiency and performance in continuous control tasks.

Contribution

It extends Soft Actor-Critic with a policy distillation approach from model-based planning, providing theoretical guarantees and a practical algorithm called MPDP.

Findings

01

MPDP achieves superior sample efficiency over existing methods.

02

MPDP demonstrates better asymptotic performance on MuJoCo benchmarks.

03

Theoretical analysis guarantees monotonic policy improvement and convergence.

Abstract

Model-based reinforcement learning (RL) has demonstrated remarkable successes on a range of continuous control tasks due to its high sample efficiency. To save the computation cost of conducting planning online, recent practices tend to distill optimized action sequences into an RL policy during the training phase. Although the distillation can incorporate both the foresight of planning and the exploration ability of RL policies, the theoretical understanding of these methods is yet unclear. In this paper, we extend the policy improvement step of Soft Actor-Critic (SAC) by developing an approach to distill from model-based planning to the policy. We then demonstrate that such an approach of policy improvement has a theoretical guarantee of monotonic improvement and convergence to the maximum value defined in SAC. We discuss effective design choices and implement our theory as a…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsReinforcement Learning in Robotics

MethodsAverage Pooling · Dilated Convolution · Convolution · 1x1 Convolution · Global Average Pooling · Switchable Atrous Convolution