Deterministic Policy Optimization by Combining Pathwise and Score Function Estimators for Discrete Action Spaces
Daniel Levy, Stefano Ermon

TL;DR
This paper introduces a hybrid policy gradient method that combines score function and pathwise estimators, enabling efficient policy optimization in discrete action spaces with significant sample complexity improvements.
Contribution
A novel hybrid policy gradient estimator for discrete actions, derived from dynamics approximation as an expectation, improving sample efficiency in reinforcement learning.
Findings
Achieved 1.7 to 25 times reduction in sample complexity.
Applicable to both discrete and continuous action spaces.
Demonstrated effectiveness on Cart Pole, Acrobot, Mountain Car, and Hand Mass environments.
Abstract
Policy optimization methods have shown great promise in solving complex reinforcement and imitation learning tasks. While model-free methods are broadly applicable, they often require many samples to optimize complex policies. Model-based methods greatly improve sample-efficiency but at the cost of poor generalization, requiring a carefully handcrafted model of the system dynamics for each task. Recently, hybrid methods have been successful in trading off applicability for improved sample-complexity. However, these have been limited to continuous action spaces. In this work, we present a new hybrid method based on an approximation of the dynamics as an expectation over the next state under the current policy. This relaxation allows us to derive a novel hybrid policy gradient estimator, combining score function and pathwise derivative estimators, that is applicable to discrete action…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
