Policy-Adaptive Estimator Selection for Off-Policy Evaluation
Takuma Udagawa, Haruka Kiyohara, Yusuke Narita, Yuta Saito, Kei Tateno

TL;DR
This paper introduces a novel data-driven method for selecting the most accurate off-policy evaluation estimator by adaptively subsampling logged data and constructing pseudo policies, significantly improving estimator accuracy.
Contribution
It presents the first approach for adaptive estimator selection in OPE, addressing the challenge of choosing the best estimator based solely on logged data.
Findings
Substantially improves estimator selection accuracy over non-adaptive methods.
Effective on both synthetic and real-world datasets.
Demonstrates significant gains in OPE performance.
Abstract
Off-policy evaluation (OPE) aims to accurately evaluate the performance of counterfactual policies using only offline logged data. Although many estimators have been developed, there is no single estimator that dominates the others, because the estimators' accuracy can vary greatly depending on a given OPE task such as the evaluation policy, number of actions, and noise level. Thus, the data-driven estimator selection problem is becoming increasingly important and can have a significant impact on the accuracy of OPE. However, identifying the most accurate estimator using only the logged data is quite challenging because the ground-truth estimation accuracy of estimators is generally unavailable. This paper studies this challenging problem of estimator selection for OPE for the first time. In particular, we enable an estimator selection that is adaptive to a given OPE task, by…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
Taxonomy
TopicsAdvanced Causal Inference Techniques · Machine Learning and Algorithms · Efficiency Analysis Using DEA
