Action Candidate Based Clipped Double Q-learning for Discrete and Continuous Action Tasks
Haobo Jiang, Jin Xie, Jian Yang

TL;DR
This paper introduces an action candidate based clipped double estimator to improve the accuracy of maximum expected action value estimation in Double Q-learning, reducing bias and enhancing performance in both discrete and continuous tasks.
Contribution
It proposes a novel action candidate based estimator that reduces underestimation bias and extends to continuous actions, improving over traditional clipped Double Q-learning.
Findings
More accurate maximum expected action value estimation in toy environments.
Better performance on benchmark problems.
Bias control via the number of action candidates.
Abstract
Double Q-learning is a popular reinforcement learning algorithm in Markov decision process (MDP) problems. Clipped Double Q-learning, as an effective variant of Double Q-learning, employs the clipped double estimator to approximate the maximum expected action value. Due to the underestimation bias of the clipped double estimator, performance of clipped Double Q-learning may be degraded in some stochastic environments. In this paper, in order to reduce the underestimation bias, we propose an action candidate based clipped double estimator for Double Q-learning. Specifically, we first select a set of elite action candidates with the high action values from one set of estimators. Then, among these candidates, we choose the highest valued action from the other set of estimators. Finally, we use the maximum value in the second set of estimators to clip the action value of the chosen action…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
Taxonomy
TopicsReinforcement Learning in Robotics · Machine Learning and Data Classification · Data Stream Mining Techniques
MethodsClipped Double Q-learning · Q-Learning · Double Q-learning
