Online Bandits with (Biased) Offline Data: Adaptive Learning under Distribution Mismatch
Wang Chi Cheung, Lixing Lyu

TL;DR
This paper develops adaptive online algorithms that leverage biased offline data to improve learning in stochastic bandits, providing theoretical regret bounds and practical insights into when offline data is beneficial.
Contribution
It introduces MIN-UCB and MIN-COMB-UCB algorithms that adaptively utilize offline data with distribution mismatch, achieving tight regret bounds in stochastic bandit settings.
Findings
Offline data can improve online learning if distribution mismatch is bounded.
MIN-UCB outperforms classical UCB when offline data is informative.
Biases and dataset size influence offline data utility.
Abstract
Traditional online learning models are typically initialized from scratch. By contrast, contemporary real-world applications often have access to historical datasets that can potentially enhanced the online learning processes. We study how offline data can be leveraged to facilitate online learning in stochastic multi-armed bandits and combinatorial bandits. In our study, the probability distributions that govern the offline data and the online rewards can be different. We first show that, without a non-trivial upper bound on their difference, no non-anticipatory policy can outperform the classical Upper Confidence Bound (UCB) policy, even with the access to offline data. In complement, we propose an online policy MIN-UCB for multi-armed bandits. MIN-UCB outperforms the UCB when such an upper bound is available. MIN-UCB adaptively chooses to utilize the offline data when they are deemed…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAdvanced Bandit Algorithms Research · Forecasting Techniques and Applications
