Proximal Deterministic Policy Gradient
Marco Maggipinto, Gian Antonio Susto, Pratik Chaudhari

TL;DR
This paper proposes two techniques to enhance off-policy RL algorithms by framing them as stochastic proximal point iterations and leveraging dual value functions for better action value estimates, leading to improved performance on benchmarks.
Contribution
It introduces a novel formulation of off-policy RL as a stochastic proximal point iteration and utilizes dual value functions for more accurate value estimation.
Findings
Significant performance improvements on continuous-control benchmarks
Effective use of dual value functions for better value estimates
Novel proximal point iteration formulation for off-policy RL
Abstract
This paper introduces two simple techniques to improve off-policy Reinforcement Learning (RL) algorithms. First, we formulate off-policy RL as a stochastic proximal point iteration. The target network plays the role of the variable of optimization and the value network computes the proximal operator. Second, we exploits the two value functions commonly employed in state-of-the-art off-policy algorithms to provide an improved action value estimate through bootstrapping with limited increase of computational resources. Further, we demonstrate significant performance improvement over state-of-the-art algorithms on standard continuous-control RL benchmarks.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
