Adapting Double Q-Learning for Continuous Reinforcement Learning
Arsenii Kuznetsov

TL;DR
This paper introduces a novel bias correction method for continuous reinforcement learning by using a mixture policy evaluated by separate networks, effectively reducing overestimation bias and achieving near state-of-the-art results on MuJoCo environments.
Contribution
It proposes a new approach to bias correction in continuous RL using a mixture policy with separate networks, inspired by Double Q-Learning, addressing overestimation bias fundamentally.
Findings
Achieves near-SOTA results on MuJoCo environments
Effectively reduces overestimation bias in continuous RL
Demonstrates the viability of mixture policies for bias correction
Abstract
Majority of off-policy reinforcement learning algorithms use overestimation bias control techniques. Most of these techniques rooted in heuristics, primarily addressing the consequences of overestimation rather than its fundamental origins. In this work we present a novel approach to the bias correction, similar in spirit to Double Q-Learning. We propose using a policy in form of a mixture with two components. Each policy component is maximized and assessed by separate networks, which removes any basis for the overestimation bias. Our approach shows promising near-SOTA results on a small set of MuJoCo environments.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsReinforcement Learning in Robotics
MethodsQ-Learning
