Demystifying the Paradox of Importance Sampling with an Estimated History-Dependent Behavior Policy in Off-Policy Evaluation

Hongyi Zhou; Josiah P. Hanna; Jin Zhu; Ying Yang; Chengchun Shi

arXiv:2505.22492·cs.LG·May 29, 2025

Demystifying the Paradox of Importance Sampling with an Estimated History-Dependent Behavior Policy in Off-Policy Evaluation

Hongyi Zhou, Josiah P. Hanna, Jin Zhu, Ying Yang, Chengchun Shi

PDF

Open Access 1 Video

TL;DR

This paper provides a theoretical explanation for why estimating a history-dependent behavior policy can reduce the mean squared error in off-policy evaluation, by analyzing bias-variance trade-offs and variance reduction effects.

Contribution

It derives a bias-variance decomposition for importance sampling estimators, explaining the variance reduction and bias increase when using history-dependent behavior policies.

Findings

01

History-dependent policy estimation decreases asymptotic variance.

02

Longer history conditioning consistently reduces variance.

03

Results extend to various OPE estimators and estimation methods.

Abstract

This paper studies off-policy evaluation (OPE) in reinforcement learning with a focus on behavior policy estimation for importance sampling. Prior work has shown empirically that estimating a history-dependent behavior policy can lead to lower mean squared error (MSE) even when the true behavior policy is Markovian. However, the question of why the use of history should lower MSE remains open. In this paper, we theoretically demystify this paradox by deriving a bias-variance decomposition of the MSE of ordinary importance sampling (IS) estimators, demonstrating that history-dependent behavior policy estimation decreases their asymptotic variances while increasing their finite-sample biases. Additionally, as the estimated behavior policy conditions on a longer history, we show a consistent decrease in variance. We extend these findings to a range of other OPE estimators, including the…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

Demystifying the Paradox of Importance Sampling with an Estimated History-Dependent Behavior Policy in Off-Policy Evaluation· slideslive

Taxonomy

TopicsBehavioral and Psychological Studies · Reinforcement Learning in Robotics · Advanced Causal Inference Techniques