Conservative Dual Policy Optimization for Efficient Model-Based Reinforcement Learning
Shenao Zhang

TL;DR
This paper introduces Conservative Dual Policy Optimization (CDPO), a new approach in model-based reinforcement learning that enhances stability and exploration efficiency while maintaining theoretical guarantees of optimality.
Contribution
The paper proposes CDPO, combining a reference model and conservative updates to improve stability and exploration in MBRL without increasing regret.
Findings
CDPO achieves monotonic policy improvement.
CDPO maintains the same regret as PSRL.
Empirical results show improved exploration efficiency.
Abstract
Provably efficient Model-Based Reinforcement Learning (MBRL) based on optimism or posterior sampling (PSRL) is ensured to attain the global optimality asymptotically by introducing the complexity measure of the model. However, the complexity might grow exponentially for the simplest nonlinear models, where global convergence is impossible within finite iterations. When the model suffers a large generalization error, which is quantitatively measured by the model complexity, the uncertainty can be large. The sampled model that current policy is greedily optimized upon will thus be unsettled, resulting in aggressive policy updates and over-exploration. In this work, we propose Conservative Dual Policy Optimization (CDPO) that involves a Referential Update and a Conservative Update. The policy is first optimized under a reference model, which imitates the mechanism of PSRL while offering…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
Taxonomy
TopicsReinforcement Learning in Robotics · Machine Learning and ELM · Advanced Bandit Algorithms Research
