TL;DR
This paper introduces a novel model-free, off-policy reinforcement learning method that uses long-term visitation counts to improve exploration in environments with sparse rewards, outperforming existing methods especially with suboptimal reward modes.
Contribution
It proposes a new exploration strategy based on long-term visitation values and decouples exploration from exploitation, along with new benchmarks for evaluation.
Findings
Outperforms existing exploration methods in sparse reward environments
Scales well with environment size
Effective in environments with suboptimal reward modes
Abstract
Reinforcement learning with sparse rewards is still an open challenge. Classic methods rely on getting feedback via extrinsic rewards to train the agent, and in situations where this occurs very rarely the agent learns slowly or cannot learn at all. Similarly, if the agent receives also rewards that create suboptimal modes of the objective function, it will likely prematurely stop exploring. More recent methods add auxiliary intrinsic rewards to encourage exploration. However, auxiliary rewards lead to a non-stationary target for the Q-function. In this paper, we present a novel approach that (1) plans exploration actions far into the future by using a long-term visitation count, and (2) decouples exploration and exploitation by learning a separate function assessing the exploration value of the actions. Contrary to existing methods which use models of reward and dynamics, our approach…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
