Regularly Updated Deterministic Policy Gradient Algorithm
Shuai Han, Wenbo Zhou, Shuai L\"u, Jiayu Yu

TL;DR
This paper introduces the RUD algorithm, which improves the efficiency and stability of reinforcement learning by better utilizing data and reducing variance in Q-value estimation, outperforming previous methods in Mujoco environments.
Contribution
The paper proposes the RUD algorithm that theoretically and empirically enhances data utilization and Q-value stability in deterministic policy gradient methods.
Findings
RUD outperforms DDPG in Mujoco environments.
Theoretical proof of improved data usage with RUD.
Lower variance in Q-value estimation with RUD.
Abstract
Deep Deterministic Policy Gradient (DDPG) algorithm is one of the most well-known reinforcement learning methods. However, this method is inefficient and unstable in practical applications. On the other hand, the bias and variance of the Q estimation in the target function are sometimes difficult to control. This paper proposes a Regularly Updated Deterministic (RUD) policy gradient algorithm for these problems. This paper theoretically proves that the learning procedure with RUD can make better use of new data in replay buffer than the traditional procedure. In addition, the low variance of the Q value in RUD is more suitable for the current Clipped Double Q-learning strategy. This paper has designed a comparison experiment against previous methods, an ablation experiment with the original DDPG, and other analytical experiments in Mujoco environments. The experimental results…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsReinforcement Learning in Robotics · Adaptive Dynamic Programming Control · Machine Learning and ELM
MethodsConvolution · Double Q-learning · Experience Replay · Weight Decay · *Communicated@Fast*How Do I Communicate to Expedia? · Adam · Batch Normalization · Dense Connections · Deep Deterministic Policy Gradient · Q-Learning
