Decoupled Exploration and Exploitation Policies for Sample-Efficient   Reinforcement Learning

William F. Whitney; Michael Bloesch; Jost Tobias Springenberg; Abbas; Abdolmaleki; Kyunghyun Cho; Martin Riedmiller

arXiv:2101.09458·cs.LG·July 2, 2021

Decoupled Exploration and Exploitation Policies for Sample-Efficient Reinforcement Learning

William F. Whitney, Michael Bloesch, Jost Tobias Springenberg, Abbas, Abdolmaleki, Kyunghyun Cho, Martin Riedmiller

PDF

Open Access

TL;DR

This paper introduces DEEP, a method that decouples exploration from exploitation in reinforcement learning, significantly improving sample efficiency especially in sparse reward environments by addressing limitations of traditional bonus-based exploration.

Contribution

The paper proposes DEEP, a novel approach that separates exploration and exploitation policies, enhancing sample efficiency without modifying existing off-policy algorithms.

Findings

01

DEEP improves data efficiency in sparse reward tasks.

02

DEEP incurs no performance penalty in dense reward environments.

03

Decoupling exploration enhances exploration effectiveness in continuous control.

Abstract

Despite the close connection between exploration and sample efficiency, most state of the art reinforcement learning algorithms include no considerations for exploration beyond maximizing the entropy of the policy. In this work we address this seeming missed opportunity. We observe that the most common formulation of directed exploration in deep RL, known as bonus-based exploration (BBE), suffers from bias and slow coverage in the few-sample regime. This causes BBE to be actively detrimental to policy learning in many control tasks. We show that by decoupling the task policy from the exploration policy, directed exploration can be highly effective for sample-efficient continuous control. Our method, Decoupled Exploration and Exploitation Policies (DEEP), can be combined with any off-policy RL algorithm without modification. When used in conjunction with soft actor-critic, DEEP incurs no…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsReinforcement Learning in Robotics · Adaptive Dynamic Programming Control · Advanced Bandit Algorithms Research