Multi-Armed Bandits With Machine Learning-Generated Surrogate Rewards
Wenlong Ji, Yihan Pan, Ruihao Zhu, Lihua Lei

TL;DR
This paper introduces MLA-UCB, a novel algorithm for multi-armed bandits that leverages machine learning-generated surrogate rewards, addressing bias and improving regret bounds in offline and batched settings.
Contribution
The paper proposes MLA-UCB, a versatile algorithm that uses surrogate rewards from ML models to enhance bandit decision-making, with theoretical guarantees and practical validation.
Findings
MLA-UCB improves cumulative regret in Gaussian reward settings.
The method extends to batched, non-Gaussian reward scenarios with regret guarantees.
Simulations and real-world studies show substantial regret reduction.
Abstract
Multi-armed bandit (MAB) is a widely adopted framework for sequential decision-making under uncertainty. Traditional bandit algorithms rely solely on online data, which tends to be scarce as it must be gathered during the online phase when the arms are actively pulled. However, in many practical settings, rich auxiliary data, such as covariates of past users, is available prior to deploying any arms. We introduce a new setting for MAB where pre-trained machine learning (ML) models are applied to convert side information and historical data into \emph{surrogate rewards}. A prominent challenge of this setting is that the surrogate rewards may exhibit substantial bias, as true reward data is typically unavailable in the offline phase, forcing ML predictions to heavily rely on extrapolation. To address the issue, we propose the Machine Learning-Assisted Upper Confidence Bound (MLA-UCB)…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
