Adapting Double Q-Learning for Continuous Reinforcement Learning

Arsenii Kuznetsov

arXiv:2309.14471·cs.LG·September 27, 2023

Adapting Double Q-Learning for Continuous Reinforcement Learning

Arsenii Kuznetsov

PDF

Open Access

TL;DR

This paper introduces a novel bias correction method for continuous reinforcement learning by using a mixture policy evaluated by separate networks, effectively reducing overestimation bias and achieving near state-of-the-art results on MuJoCo environments.

Contribution

It proposes a new approach to bias correction in continuous RL using a mixture policy with separate networks, inspired by Double Q-Learning, addressing overestimation bias fundamentally.

Findings

01

Achieves near-SOTA results on MuJoCo environments

02

Effectively reduces overestimation bias in continuous RL

03

Demonstrates the viability of mixture policies for bias correction

Abstract

Majority of off-policy reinforcement learning algorithms use overestimation bias control techniques. Most of these techniques rooted in heuristics, primarily addressing the consequences of overestimation rather than its fundamental origins. In this work we present a novel approach to the bias correction, similar in spirit to Double Q-Learning. We propose using a policy in form of a mixture with two components. Each policy component is maximized and assessed by separate networks, which removes any basis for the overestimation bias. Our approach shows promising near-SOTA results on a small set of MuJoCo environments.

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsReinforcement Learning in Robotics

MethodsQ-Learning