Towards Optimal Off-Policy Evaluation for Reinforcement Learning with Marginalized Importance Sampling
Tengyang Xie, Yifei Ma, Yu-Xiang Wang

TL;DR
This paper introduces a marginalized importance sampling estimator for off-policy evaluation in reinforcement learning, achieving lower variance and polynomial dependence on the horizon, with both theoretical guarantees and empirical validation.
Contribution
The paper proposes a novel MIS estimator that reduces variance in off-policy evaluation and provides the first polynomial horizon dependence bound, supported by theoretical analysis and experiments.
Findings
MIS estimator achieves lower mean-squared error than existing methods.
Theoretical bound matches the Cramer-Rao lower bound up to a factor of H.
Empirical results demonstrate superiority in complex RL environments.
Abstract
Motivated by the many real-world applications of reinforcement learning (RL) that require safe-policy iterations, we consider the problem of off-policy evaluation (OPE) -- the problem of evaluating a new policy using the historical data obtained by different behavior policies -- under the model of nonstationary episodic Markov Decision Processes (MDP) with a long horizon and a large action space. Existing importance sampling (IS) methods often suffer from large variance that depends exponentially on the RL horizon . To solve this problem, we consider a marginalized importance sampling (MIS) estimator that recursively estimates the state marginal distribution for the target policy at every step. MIS achieves a mean-squared error of $$ \frac{1}{n} \sum\nolimits_{t=1}^H\mathbb{E}_{\mu}\left[\frac{d_t^\pi(s_t)^2}{d_t^\mu(s_t)^2}…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAdvanced Causal Inference Techniques · Age of Information Optimization · Reinforcement Learning in Robotics
