Minimax Off-Policy Evaluation for Multi-Armed Bandits

Cong Ma; Banghua Zhu; Jiantao Jiao; Martin J. Wainwright

arXiv:2101.07781·stat.ML·January 20, 2021

Minimax Off-Policy Evaluation for Multi-Armed Bandits

Cong Ma, Banghua Zhu, Jiantao Jiao, Martin J. Wainwright

PDF

Open Access

TL;DR

This paper investigates off-policy evaluation in multi-armed bandits, developing minimax optimal estimators under various knowledge assumptions about the behavior policy, and introduces new methods with proven optimality and practical validation.

Contribution

It introduces the Switch estimator for known policies, analyzes the impact of unknown policies on estimation error, and proposes a Chebyshev polynomial-based estimator for partial knowledge scenarios.

Findings

01

Switch estimator is minimax optimal with known behavior policy.

02

Unknown policies cause a multiplicative increase in estimation error.

03

Chebyshev polynomial estimator achieves optimal error with partial policy knowledge.

Abstract

We study the problem of off-policy evaluation in the multi-armed bandit model with bounded rewards, and develop minimax rate-optimal procedures under three settings. First, when the behavior policy is known, we show that the Switch estimator, a method that alternates between the plug-in and importance sampling estimators, is minimax rate-optimal for all sample sizes. Second, when the behavior policy is unknown, we analyze performance in terms of the competitive ratio, thereby revealing a fundamental gap between the settings of known and unknown behavior policies. When the behavior policy is unknown, any estimator must have mean-squared error larger -- relative to the oracle estimator equipped with the knowledge of the behavior policy -- by a multiplicative factor proportional to the support size of the target policy. Moreover, we demonstrate that the plug-in approach achieves this…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsAdvanced Bandit Algorithms Research · Smart Grid Energy Management · Age of Information Optimization