Counterfactual Learning with General Data-generating Policies
Yusuke Narita, Kyohei Okumura, Akihiro Shimizu, Kohei Yata

TL;DR
This paper introduces a new off-policy evaluation method capable of handling both full support and deficient support logging policies, including deterministic policies, with proven convergence and practical validation.
Contribution
It extends off-policy evaluation to a broader class of logging policies, including deterministic ones, with theoretical guarantees and real-world application.
Findings
Method converges to true policy performance as data increases
Validated on deterministic and partly deterministic logging policies
Applied to online platform coupon targeting to improve policies
Abstract
Off-policy evaluation (OPE) attempts to predict the performance of counterfactual policies using log data from a different policy. We extend its applicability by developing an OPE method for a class of both full support and deficient support logging policies in contextual-bandit settings. This class includes deterministic bandit (such as Upper Confidence Bound) as well as deterministic decision-making based on supervised and unsupervised learning. We prove that our method's prediction converges in probability to the true performance of a counterfactual policy as the sample size increases. We validate our method with experiments on partly and entirely deterministic logging policies. Finally, we apply it to evaluate coupon targeting policies by a major online platform and show how to improve the existing policy.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
Taxonomy
TopicsSmart Grid Energy Management · Recommender Systems and Techniques · Mobile Crowdsensing and Crowdsourcing
