Truly Deterministic Policy Optimization

Ehsan Saleh; Saba Ghaffari; Timothy Bretl; Matthew West

arXiv:2205.15379·cs.AI·June 1, 2022

Truly Deterministic Policy Optimization

Ehsan Saleh, Saba Ghaffari, Timothy Bretl, Matthew West

PDF

Open Access 1 Repo 1 Video

TL;DR

This paper introduces a deterministic policy gradient method that eliminates estimation variance by avoiding noise injection, leveraging Wasserstein metrics, and demonstrating superior performance in complex robotic control tasks.

Contribution

The paper develops a novel deterministic policy gradient approach using Wasserstein metrics, providing monotonic improvement guarantees and exact advantage estimation in deterministic systems.

Findings

01

TDPO outperforms PPO, TRPO, DDPG, and TD3 in complex robotic environments.

02

The method achieves significant variance reduction in policy gradient estimation.

03

Experimental results include environments with non-local rewards and long horizons.

Abstract

In this paper, we present a policy gradient method that avoids exploratory noise injection and performs policy search over the deterministic landscape. By avoiding noise injection all sources of estimation variance can be eliminated in systems with deterministic dynamics (up to the initial state distribution). Since deterministic policy regularization is impossible using traditional non-metric measures such as the KL divergence, we derive a Wasserstein-based quadratic model for our purposes. We state conditions on the system model under which it is possible to establish a monotonic policy improvement guarantee, propose a surrogate function for policy gradient estimation, and show that it is possible to compute exact advantage estimates if both the state transition model and the policy are deterministic. Finally, we describe two novel robotic control environments -- one with non-local…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

ehsansaleh/code_tdpo
pytorchOfficial

Videos

Truly Deterministic Policy Optimization· slideslive

Taxonomy

TopicsModel Reduction and Neural Networks · Nuclear reactor physics and engineering · Advanced Neural Network Applications

Methods*Communicated@Fast*How Do I Communicate to Expedia? · Batch Normalization · Weight Decay · Adam · Convolution · Dense Connections · Experience Replay · Trust Region Policy Optimization · Deep Deterministic Policy Gradient