Proximal Deterministic Policy Gradient

Marco Maggipinto; Gian Antonio Susto; Pratik Chaudhari

arXiv:2008.00759·cs.LG·August 4, 2020

Proximal Deterministic Policy Gradient

Marco Maggipinto, Gian Antonio Susto, Pratik Chaudhari

PDF

TL;DR

This paper proposes two techniques to enhance off-policy RL algorithms by framing them as stochastic proximal point iterations and leveraging dual value functions for better action value estimates, leading to improved performance on benchmarks.

Contribution

It introduces a novel formulation of off-policy RL as a stochastic proximal point iteration and utilizes dual value functions for more accurate value estimation.

Findings

01

Significant performance improvements on continuous-control benchmarks

02

Effective use of dual value functions for better value estimates

03

Novel proximal point iteration formulation for off-policy RL

Abstract

This paper introduces two simple techniques to improve off-policy Reinforcement Learning (RL) algorithms. First, we formulate off-policy RL as a stochastic proximal point iteration. The target network plays the role of the variable of optimization and the value network computes the proximal operator. Second, we exploits the two value functions commonly employed in state-of-the-art off-policy algorithms to provide an improved action value estimate through bootstrapping with limited increase of computational resources. Further, we demonstrate significant performance improvement over state-of-the-art algorithms on standard continuous-control RL benchmarks.

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.