Task-Agnostic Exploration via Policy Gradient of a Non-Parametric State   Entropy Estimate

Mirco Mutti; Lorenzo Pratissoli; Marcello Restelli

arXiv:2007.04640·cs.LG·March 2, 2021

Task-Agnostic Exploration via Policy Gradient of a Non-Parametric State Entropy Estimate

Mirco Mutti, Lorenzo Pratissoli, Marcello Restelli

PDF

1 Repo 1 Video

TL;DR

This paper introduces MEPOL, a model-free policy-gradient method that maximizes a non-parametric estimate of state entropy for effective, task-agnostic exploration in high-dimensional continuous environments.

Contribution

The paper proposes a novel, practical, model-free policy search algorithm, MEPOL, for maximizing state entropy using a non-parametric k-nearest neighbors estimate.

Findings

01

MEPOL effectively learns maximum-entropy exploration policies in high-dimensional domains.

02

The learned policies facilitate downstream reward-based task learning.

03

MEPOL does not require modeling transition dynamics or estimating state distributions.

Abstract

In a reward-free environment, what is a suitable intrinsic objective for an agent to pursue so that it can learn an optimal task-agnostic exploration policy? In this paper, we argue that the entropy of the state distribution induced by finite-horizon trajectories is a sensible target. Especially, we present a novel and practical policy-search algorithm, Maximum Entropy POLicy optimization (MEPOL), to learn a policy that maximizes a non-parametric, $k$ -nearest neighbors estimate of the state distribution entropy. In contrast to known methods, MEPOL is completely model-free as it requires neither to estimate the state distribution of any policy nor to model transition dynamics. Then, we empirically show that MEPOL allows learning a maximum-entropy exploration policy in high-dimensional, continuous-control domains, and how this policy facilitates learning a variety of meaningful…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

muttimirco/mepol
pytorchOfficial

Videos

Task-Agnostic Exploration via Policy Gradient of a Non-Parametric State Entropy Estimate· underline