A UCB Bandit Algorithm for General ML-Based Estimators
Yajing Liu, Erkao Bao, Linqi Song

TL;DR
This paper introduces ML-UCB, a versatile bandit algorithm that incorporates machine learning models with unknown concentration properties, enabling effective exploration in sequential decision tasks.
Contribution
We develop a generalized UCB algorithm that models learning curves to incorporate arbitrary ML estimators into bandit frameworks without model-specific analysis.
Findings
ML-UCB achieves sublinear regret in experiments.
Significant improvement over LinUCB in recommendation system.
Framework applies to any ML model with characterized learning curve.
Abstract
We present ML-UCB, a generalized upper confidence bound algorithm that integrates arbitrary machine learning models into multi-armed bandit frameworks. A fundamental challenge in deploying sophisticated ML models for sequential decision-making is the lack of tractable concentration inequalities required for principled exploration. We overcome this limitation by directly modeling the learning curve behavior of the underlying estimator. Specifically, assuming the Mean Squared Error decreases as a power law in the number of training samples, we derive a generalized concentration inequality and prove that ML-UCB achieves sublinear regret. This framework enables the principled integration of any ML model whose learning curve can be empirically characterized, eliminating the need for model-specific theoretical analysis. We validate our approach through experiments on a collaborative filtering…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAdvanced Bandit Algorithms Research · Recommender Systems and Techniques · Gaussian Processes and Bayesian Inference
