Truly Deterministic Policy Optimization
Ehsan Saleh, Saba Ghaffari, Timothy Bretl, Matthew West

TL;DR
This paper introduces a deterministic policy gradient method that eliminates estimation variance by avoiding noise injection, leveraging Wasserstein metrics, and demonstrating superior performance in complex robotic control tasks.
Contribution
The paper develops a novel deterministic policy gradient approach using Wasserstein metrics, providing monotonic improvement guarantees and exact advantage estimation in deterministic systems.
Findings
TDPO outperforms PPO, TRPO, DDPG, and TD3 in complex robotic environments.
The method achieves significant variance reduction in policy gradient estimation.
Experimental results include environments with non-local rewards and long horizons.
Abstract
In this paper, we present a policy gradient method that avoids exploratory noise injection and performs policy search over the deterministic landscape. By avoiding noise injection all sources of estimation variance can be eliminated in systems with deterministic dynamics (up to the initial state distribution). Since deterministic policy regularization is impossible using traditional non-metric measures such as the KL divergence, we derive a Wasserstein-based quadratic model for our purposes. We state conditions on the system model under which it is possible to establish a monotonic policy improvement guarantee, propose a surrogate function for policy gradient estimation, and show that it is possible to compute exact advantage estimates if both the state transition model and the policy are deterministic. Finally, we describe two novel robotic control environments -- one with non-local…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
Taxonomy
TopicsModel Reduction and Neural Networks · Nuclear reactor physics and engineering · Advanced Neural Network Applications
Methods*Communicated@Fast*How Do I Communicate to Expedia? · Batch Normalization · Weight Decay · Adam · Convolution · Dense Connections · Experience Replay · Trust Region Policy Optimization · Deep Deterministic Policy Gradient
