Conservative Dual Policy Optimization for Efficient Model-Based   Reinforcement Learning

Shenao Zhang

arXiv:2209.07676·cs.LG·September 19, 2022

Conservative Dual Policy Optimization for Efficient Model-Based Reinforcement Learning

Shenao Zhang

PDF

Open Access 1 Video

TL;DR

This paper introduces Conservative Dual Policy Optimization (CDPO), a new approach in model-based reinforcement learning that enhances stability and exploration efficiency while maintaining theoretical guarantees of optimality.

Contribution

The paper proposes CDPO, combining a reference model and conservative updates to improve stability and exploration in MBRL without increasing regret.

Findings

01

CDPO achieves monotonic policy improvement.

02

CDPO maintains the same regret as PSRL.

03

Empirical results show improved exploration efficiency.

Abstract

Provably efficient Model-Based Reinforcement Learning (MBRL) based on optimism or posterior sampling (PSRL) is ensured to attain the global optimality asymptotically by introducing the complexity measure of the model. However, the complexity might grow exponentially for the simplest nonlinear models, where global convergence is impossible within finite iterations. When the model suffers a large generalization error, which is quantitatively measured by the model complexity, the uncertainty can be large. The sampled model that current policy is greedily optimized upon will thus be unsettled, resulting in aggressive policy updates and over-exploration. In this work, we propose Conservative Dual Policy Optimization (CDPO) that involves a Referential Update and a Conservative Update. The policy is first optimized under a reference model, which imitates the mechanism of PSRL while offering…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

Conservative Dual Policy Optimization for Efficient Model-Based Reinforcement Learning· slideslive

Taxonomy

TopicsReinforcement Learning in Robotics · Machine Learning and ELM · Advanced Bandit Algorithms Research