Deterministic Value-Policy Gradients
Qingpeng Cai, Ling Pan, Pingzhong Tang

TL;DR
This paper introduces the deterministic value-policy gradient (DVPG) algorithm, which enhances sample efficiency in continuous control tasks by combining model-based and model-free methods, supported by theoretical guarantees and extensive experiments.
Contribution
It provides the first theoretical guarantee for infinite horizon deterministic value gradients and proposes the DVPG algorithm that outperforms existing methods.
Findings
DVPG significantly outperforms baselines on continuous control benchmarks.
Theoretical guarantee of value gradients in infinite horizon setting.
Trade-off between variance and bias achieved through different rollout steps.
Abstract
Reinforcement learning algorithms such as the deep deterministic policy gradient algorithm (DDPG) has been widely used in continuous control tasks. However, the model-free DDPG algorithm suffers from high sample complexity. In this paper we consider the deterministic value gradients to improve the sample efficiency of deep reinforcement learning algorithms. Previous works consider deterministic value gradients with the finite horizon, but it is too myopic compared with infinite horizon. We firstly give a theoretical guarantee of the existence of the value gradients in this infinite setting. Based on this theoretical guarantee, we propose a class of the deterministic value gradient algorithm (DVG) with infinite horizon, and different rollout steps of the analytical gradients by the learned model trade off between the variance of the value gradients and the model bias. Furthermore, to…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsReinforcement Learning in Robotics · Adaptive Dynamic Programming Control · Fuel Cells and Related Materials
MethodsExperience Replay · Dense Connections · Weight Decay · *Communicated@Fast*How Do I Communicate to Expedia? · Adam · Convolution · Batch Normalization · Deep Deterministic Policy Gradient
