Softmax Deep Double Deterministic Policy Gradients
Ling Pan, Qingpeng Cai, Longbo Huang

TL;DR
This paper introduces a novel softmax operator for value estimation in continuous control reinforcement learning, addressing overestimation bias and improving performance over existing methods like DDPG and TD3.
Contribution
It proposes the SD2 and SD3 algorithms that incorporate the softmax operator, providing a new approach to mitigate bias in actor-critic algorithms for continuous control.
Findings
SD3 outperforms state-of-the-art methods in continuous control tasks.
The softmax operator smooths the optimization landscape, aiding learning.
Theoretical analysis reveals benefits of the softmax operator in actor-critic algorithms.
Abstract
A widely-used actor-critic reinforcement learning algorithm for continuous control, Deep Deterministic Policy Gradients (DDPG), suffers from the overestimation problem, which can negatively affect the performance. Although the state-of-the-art Twin Delayed Deep Deterministic Policy Gradient (TD3) algorithm mitigates the overestimation issue, it can lead to a large underestimation bias. In this paper, we propose to use the Boltzmann softmax operator for value function estimation in continuous control. We first theoretically analyze the softmax operator in continuous action space. Then, we uncover an important property of the softmax operator in actor-critic algorithms, i.e., it helps to smooth the optimization landscape, which sheds new light on the benefits of the operator. We also design two new algorithms, Softmax Deep Deterministic Policy Gradients (SD2) and Softmax Deep Double…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
Taxonomy
TopicsReinforcement Learning in Robotics · Adversarial Robustness in Machine Learning · Autonomous Vehicle Technology and Safety
MethodsSoftmax
