Multi-Armed Bandits With Machine Learning-Generated Surrogate Rewards

Wenlong Ji; Yihan Pan; Ruihao Zhu; Lihua Lei

arXiv:2506.16658·math.ST·April 23, 2026

Multi-Armed Bandits With Machine Learning-Generated Surrogate Rewards

Wenlong Ji, Yihan Pan, Ruihao Zhu, Lihua Lei

PDF

TL;DR

This paper introduces MLA-UCB, a novel algorithm for multi-armed bandits that leverages machine learning-generated surrogate rewards, addressing bias and improving regret bounds in offline and batched settings.

Contribution

The paper proposes MLA-UCB, a versatile algorithm that uses surrogate rewards from ML models to enhance bandit decision-making, with theoretical guarantees and practical validation.

Findings

01

MLA-UCB improves cumulative regret in Gaussian reward settings.

02

The method extends to batched, non-Gaussian reward scenarios with regret guarantees.

03

Simulations and real-world studies show substantial regret reduction.

Abstract

Multi-armed bandit (MAB) is a widely adopted framework for sequential decision-making under uncertainty. Traditional bandit algorithms rely solely on online data, which tends to be scarce as it must be gathered during the online phase when the arms are actively pulled. However, in many practical settings, rich auxiliary data, such as covariates of past users, is available prior to deploying any arms. We introduce a new setting for MAB where pre-trained machine learning (ML) models are applied to convert side information and historical data into \emph{surrogate rewards}. A prominent challenge of this setting is that the surrogate rewards may exhibit substantial bias, as true reward data is typically unavailable in the offline phase, forcing ML predictions to heavily rely on extrapolation. To address the issue, we propose the Machine Learning-Assisted Upper Confidence Bound (MLA-UCB)…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.