Decoupled Exploration and Exploitation Policies for Sample-Efficient Reinforcement Learning
William F. Whitney, Michael Bloesch, Jost Tobias Springenberg, Abbas, Abdolmaleki, Kyunghyun Cho, Martin Riedmiller

TL;DR
This paper introduces DEEP, a method that decouples exploration from exploitation in reinforcement learning, significantly improving sample efficiency especially in sparse reward environments by addressing limitations of traditional bonus-based exploration.
Contribution
The paper proposes DEEP, a novel approach that separates exploration and exploitation policies, enhancing sample efficiency without modifying existing off-policy algorithms.
Findings
DEEP improves data efficiency in sparse reward tasks.
DEEP incurs no performance penalty in dense reward environments.
Decoupling exploration enhances exploration effectiveness in continuous control.
Abstract
Despite the close connection between exploration and sample efficiency, most state of the art reinforcement learning algorithms include no considerations for exploration beyond maximizing the entropy of the policy. In this work we address this seeming missed opportunity. We observe that the most common formulation of directed exploration in deep RL, known as bonus-based exploration (BBE), suffers from bias and slow coverage in the few-sample regime. This causes BBE to be actively detrimental to policy learning in many control tasks. We show that by decoupling the task policy from the exploration policy, directed exploration can be highly effective for sample-efficient continuous control. Our method, Decoupled Exploration and Exploitation Policies (DEEP), can be combined with any off-policy RL algorithm without modification. When used in conjunction with soft actor-critic, DEEP incurs no…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsReinforcement Learning in Robotics · Adaptive Dynamic Programming Control · Advanced Bandit Algorithms Research
