Statistical Inference for Online Decision-Making: In a Contextual Bandit Setting
Haoyu Chen, Wenbin Lu, Rui Song

TL;DR
This paper develops statistical inference methods for online decision-making in a contextual bandit setting, establishing asymptotic normality of estimators under correct and misspecified models, with applications to real data.
Contribution
It introduces asymptotic normality results for online estimators in contextual bandits, including under model misspecification, using martingale CLT techniques.
Findings
Online OLS estimator is asymptotically normal.
Weighted least squares estimator remains normal under misspecification.
In-sample inverse propensity weighted value estimator is asymptotically normal.
Abstract
Online decision-making problem requires us to make a sequence of decisions based on incremental information. Common solutions often need to learn a reward model of different actions given the contextual information and then maximize the long-term reward. It is meaningful to know if the posited model is reasonable and how the model performs in the asymptotic sense. We study this problem under the setup of the contextual bandit framework with a linear reward model. The -greedy policy is adopted to address the classic exploration-and-exploitation dilemma. Using the martingale central limit theorem, we show that the online ordinary least squares estimator of model parameters is asymptotically normal. When the linear model is misspecified, we propose the online weighted least squares estimator using the inverse propensity score weighting and also establish its asymptotic…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
