From Weighting to Modeling: A Nonparametric Estimator for Off-Policy Evaluation
Rong J.B. Zhu

TL;DR
This paper introduces a nonparametric weighting method for off-policy evaluation in contextual bandits, reducing variance compared to traditional inverse probability weighting and improving accuracy through reward modeling.
Contribution
It proposes a novel nonparametric weighting estimator and a model-assisted variant that outperform existing methods in variance reduction and accuracy.
Findings
Our methods achieve lower variance than IPW.
The approaches maintain low bias similar to IPW.
Empirical results show superior performance over existing techniques.
Abstract
We study off-policy evaluation in the setting of contextual bandits, where we aim to evaluate a new policy using historical data that consists of contexts, actions and received rewards. This historical data typically does not faithfully represent action distribution of the new policy accurately. A common approach, inverse probability weighting (IPW), adjusts for these discrepancies in action distributions. However, this method often suffers from high variance due to the probability being in the denominator. The doubly robust (DR) estimator reduces variance through modeling reward but does not directly address variance from IPW. In this work, we address the limitation of IPW by proposing a Nonparametric Weighting (NW) approach that constructs weights using a nonparametric model. Our NW approach achieves low bias like IPW but typically exhibits significantly lower variance. To further…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAdvanced Causal Inference Techniques · Advanced Bandit Algorithms Research · Reinforcement Learning in Robotics
