Zeroth-order Deterministic Policy Gradient
Harshat Kumar, Dionysios S. Kalogerias, George J. Pappas and, Alejandro Ribeiro

TL;DR
ZDPG introduces a critic-free, model-free deterministic policy gradient method using two-point stochastic evaluations, achieving improved stability and sample complexity in reinforcement learning tasks.
Contribution
It proposes ZDPG, a novel critic-free approach that approximates policy gradients with stochastic evaluations, enhancing stability and efficiency over existing methods.
Findings
ZDPG is effective in practical reinforcement learning scenarios.
It offers improved finite sample complexity bounds.
ZDPG outperforms traditional PG and baseline methods in experiments.
Abstract
Deterministic Policy Gradient (DPG) removes a level of randomness from standard randomized-action Policy Gradient (PG), and demonstrates substantial empirical success for tackling complex dynamic problems involving Markov decision processes. At the same time, though, DPG loses its ability to learn in a model-free (i.e., actor-only) fashion, frequently necessitating the use of critics in order to obtain consistent estimates of the associated policy-reward gradient. In this work, we introduce Zeroth-order Deterministic Policy Gradient (ZDPG), which approximates policy-reward gradients via two-point stochastic evaluations of the -function, constructed by properly designed low-dimensional action-space perturbations. Exploiting the idea of random horizon rollouts for obtaining unbiased estimates of the -function, ZDPG lifts the dependence on critics and restores true model-free policy…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsReinforcement Learning in Robotics · Adversarial Robustness in Machine Learning · Stochastic Gradient Optimization Techniques
MethodsDeterministic Policy Gradient
