TL;DR
This paper introduces MEPOL, a model-free policy-gradient method that maximizes a non-parametric estimate of state entropy for effective, task-agnostic exploration in high-dimensional continuous environments.
Contribution
The paper proposes a novel, practical, model-free policy search algorithm, MEPOL, for maximizing state entropy using a non-parametric k-nearest neighbors estimate.
Findings
MEPOL effectively learns maximum-entropy exploration policies in high-dimensional domains.
The learned policies facilitate downstream reward-based task learning.
MEPOL does not require modeling transition dynamics or estimating state distributions.
Abstract
In a reward-free environment, what is a suitable intrinsic objective for an agent to pursue so that it can learn an optimal task-agnostic exploration policy? In this paper, we argue that the entropy of the state distribution induced by finite-horizon trajectories is a sensible target. Especially, we present a novel and practical policy-search algorithm, Maximum Entropy POLicy optimization (MEPOL), to learn a policy that maximizes a non-parametric, -nearest neighbors estimate of the state distribution entropy. In contrast to known methods, MEPOL is completely model-free as it requires neither to estimate the state distribution of any policy nor to model transition dynamics. Then, we empirically show that MEPOL allows learning a maximum-entropy exploration policy in high-dimensional, continuous-control domains, and how this policy facilitates learning a variety of meaningful…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
