Efficient Counterfactual Learning from Bandit Feedback

Yusuke Narita; Shota Yasui; Kohei Yata

arXiv:1809.03084·cs.LG·December 7, 2018

Efficient Counterfactual Learning from Bandit Feedback

Yusuke Narita, Shota Yasui, Kohei Yata

PDF

Open Access

TL;DR

This paper introduces statistically efficient estimators for off-policy evaluation in bandit settings, reducing variance and improving confidence in policy optimization from offline data, demonstrated through advertisement design improvements.

Contribution

The paper proposes new low-variance estimators for counterfactual evaluation that outperform standard methods in statistical efficiency and practical application.

Findings

01

Achieves lower variance than standard estimators.

02

Enables more confident policy improvements.

03

Demonstrates effectiveness in real-world advertisement optimization.

Abstract

What is the most statistically efficient way to do off-policy evaluation and optimization with batch data from bandit feedback? For log data generated by contextual bandit algorithms, we consider offline estimators for the expected reward from a counterfactual policy. Our estimators are shown to have lowest variance in a wide class of estimators, achieving variance reduction relative to standard estimators. We then apply our estimators to improve advertisement design by a major advertisement company. Consistent with the theoretical result, our estimators allow us to improve on the existing bandit algorithm with more statistical confidence compared to a state-of-the-art benchmark.

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsAdvanced Bandit Algorithms Research · Smart Grid Energy Management · Data Stream Mining Techniques