Adapting multi-armed bandits policies to contextual bandits scenarios
David Cortes

TL;DR
This paper adapts multi-armed bandit policies to contextual bandit scenarios with binary rewards, using classification algorithms and randomness techniques to improve scalability and flexibility over existing methods.
Contribution
It introduces scalable adaptations of bandit policies for contextual scenarios, leveraging classification algorithms and randomness methods like bootstrapping.
Findings
Adaptive-Greedy outperforms UCB and Thompson sampling in many cases
The methods are more scalable and flexible with any classification algorithm
Achieves better performance with more hyperparameters to tune
Abstract
This work explores adaptations of successful multi-armed bandits policies to the online contextual bandits scenario with binary rewards using binary classification algorithms such as logistic regression as black-box oracles. Some of these adaptations are achieved through bootstrapping or approximate bootstrapping, while others rely on other forms of randomness, resulting in more scalable approaches than previous works, and the ability to work with any type of classification algorithm. In particular, the Adaptive-Greedy algorithm shows a lot of promise, in many cases achieving better performance than upper confidence bound and Thompson sampling strategies, at the expense of more hyperparameters to tune.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAdvanced Bandit Algorithms Research · Reinforcement Learning in Robotics · Smart Grid Energy Management
MethodsLogistic Regression
