Towards Optimal Off-Policy Evaluation for Reinforcement Learning with   Marginalized Importance Sampling

Tengyang Xie; Yifei Ma; Yu-Xiang Wang

arXiv:1906.03393·cs.LG·April 2, 2020·28 cites

Towards Optimal Off-Policy Evaluation for Reinforcement Learning with Marginalized Importance Sampling

Tengyang Xie, Yifei Ma, Yu-Xiang Wang

PDF

Open Access

TL;DR

This paper introduces a marginalized importance sampling estimator for off-policy evaluation in reinforcement learning, achieving lower variance and polynomial dependence on the horizon, with both theoretical guarantees and empirical validation.

Contribution

The paper proposes a novel MIS estimator that reduces variance in off-policy evaluation and provides the first polynomial horizon dependence bound, supported by theoretical analysis and experiments.

Findings

01

MIS estimator achieves lower mean-squared error than existing methods.

02

Theoretical bound matches the Cramer-Rao lower bound up to a factor of H.

03

Empirical results demonstrate superiority in complex RL environments.

Abstract

Motivated by the many real-world applications of reinforcement learning (RL) that require safe-policy iterations, we consider the problem of off-policy evaluation (OPE) -- the problem of evaluating a new policy using the historical data obtained by different behavior policies -- under the model of nonstationary episodic Markov Decision Processes (MDP) with a long horizon and a large action space. Existing importance sampling (IS) methods often suffer from large variance that depends exponentially on the RL horizon $H$ . To solve this problem, we consider a marginalized importance sampling (MIS) estimator that recursively estimates the state marginal distribution for the target policy at every step. MIS achieves a mean-squared error of $$ \frac{1}{n} \sum\nolimits_{t=1}^H\mathbb{E}_{\mu}\left[\frac{d_t^\pi(s_t)^2}{d_t^\mu(s_t)^2}…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsAdvanced Causal Inference Techniques · Age of Information Optimization · Reinforcement Learning in Robotics