Deterministic Policy Gradients With General State Transitions

Qingpeng Cai; Ling Pan; Pingzhong Tang

arXiv:1807.03708·cs.LG·October 3, 2018·1 cites

Deterministic Policy Gradients With General State Transitions

Qingpeng Cai, Ling Pan, Pingzhong Tang

PDF

Open Access

TL;DR

This paper extends deterministic policy gradient methods to a new setting with mixed stochastic and deterministic state transitions, providing theoretical guarantees, a novel algorithm, and empirical evidence of improved performance.

Contribution

It introduces a generalized setting for deterministic policy gradients, proves their existence under certain conditions, and proposes the GDPG algorithm combining model-based and model-free techniques.

Findings

01

GDPG outperforms DDPG and other baselines in convergence and rewards

02

Theoretical proof of policy gradient existence in generalized setting

03

Closed-form expression for the policy gradient

Abstract

We study a reinforcement learning setting, where the state transition function is a convex combination of a stochastic continuous function and a deterministic function. Such a setting generalizes the widely-studied stochastic state transition setting, namely the setting of deterministic policy gradient (DPG). We firstly give a simple example to illustrate that the deterministic policy gradient may be infinite under deterministic state transitions, and introduce a theoretical technique to prove the existence of the policy gradient in this generalized setting. Using this technique, we prove that the deterministic policy gradient indeed exists for a certain set of discount factors, and further prove two conditions that guarantee the existence for all discount factors. We then derive a closed form of the policy gradient whenever exists. Furthermore, to overcome the challenge of high…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsReinforcement Learning in Robotics · Adaptive Dynamic Programming Control · Advanced Bandit Algorithms Research

MethodsExperience Replay · Deterministic Policy Gradient · Dense Connections · Weight Decay · *Communicated@Fast*How Do I Communicate to Expedia? · Adam · Convolution · Batch Normalization · Deep Deterministic Policy Gradient