A maximum-entropy approach to off-policy evaluation in average-reward   MDPs

Nevena Lazic; Dong Yin; Mehrdad Farajtabar; Nir Levine; Dilan Gorur,; Chris Harris; Dale Schuurmans

arXiv:2006.12620·cs.LG·June 24, 2020

A maximum-entropy approach to off-policy evaluation in average-reward MDPs

Nevena Lazic, Dong Yin, Mehrdad Farajtabar, Nir Levine, Dilan Gorur,, Chris Harris, Dale Schuurmans

PDF

Open Access 1 Video

TL;DR

This paper introduces a maximum-entropy method for off-policy evaluation in average-reward MDPs, providing finite-sample error bounds for ergodic linear cases and a new distribution estimation approach for more general settings.

Contribution

It presents the first finite-sample error bounds for off-policy evaluation in ergodic linear average-reward MDPs and proposes a maximum-entropy distribution estimation method for approximate linear dynamics.

Findings

01

Finite-sample error bounds for ergodic linear MDPs.

02

Effective distribution estimation in general settings.

03

Successful empirical validation across multiple environments.

Abstract

This work focuses on off-policy evaluation (OPE) with function approximation in infinite-horizon undiscounted Markov decision processes (MDPs). For MDPs that are ergodic and linear (i.e. where rewards and dynamics are linear in some known features), we provide the first finite-sample OPE error bound, extending existing results beyond the episodic and discounted cases. In a more general setting, when the feature dynamics are approximately linear and for arbitrary rewards, we propose a new approach for estimating stationary distributions with function approximation. We formulate this problem as finding the maximum-entropy distribution subject to matching feature expectations under empirical dynamics. We show that this results in an exponential-family distribution whose sufficient statistics are the features, paralleling maximum-entropy approaches in supervised learning. We demonstrate the…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

A Maximum-Entropy Approach to Off-Policy Evaluation in Average-Reward MDPs· slideslive

Taxonomy

TopicsReinforcement Learning in Robotics