Combinatorial Bandits for Maximum Value Reward Function under Max Value-Index Feedback
Yiliu Wang, Wei Chen, and Milan Vojnovi\'c

TL;DR
This paper introduces a new combinatorial bandit problem with a novel feedback structure, proposing an algorithm with regret bounds comparable to more informative feedback settings, and demonstrates its effectiveness through experiments.
Contribution
It presents a new feedback structure for combinatorial bandits, along with an algorithm and regret analysis for maximum value reward functions.
Findings
The algorithm achieves $O((k/\Delta)\log(T))$ regret bound.
The regret bounds are comparable to semi-bandit feedback scenarios.
Experimental results confirm the algorithm's effectiveness.
Abstract
We consider a combinatorial multi-armed bandit problem for maximum value reward function under maximum value and index feedback. This is a new feedback structure that lies in between commonly studied semi-bandit and full-bandit feedback structures. We propose an algorithm and provide a regret bound for problem instances with stochastic arm outcomes according to arbitrary distributions with finite supports. The regret analysis rests on considering an extended set of arms, associated with values and probabilities of arm outcomes, and applying a smoothness condition. Our algorithm achieves a distribution-dependent and a distribution-independent regret where is the number of arms selected in each round, is a distribution-dependent reward gap and is the horizon time. Perhaps surprisingly, the regret bound is comparable to…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAdvanced Bandit Algorithms Research · Decision-Making and Behavioral Economics
