Decentralized Heterogeneous Multi-Player Multi-Armed Bandits with Non-Zero Rewards on Collisions
Akshayaa Magesh, Venugopal V. Veeravalli

TL;DR
This paper introduces a decentralized algorithm for multi-player multi-armed bandits with heterogeneous rewards and non-zero collision rewards, achieving near-optimal regret without prior knowledge of the time horizon.
Contribution
It proposes a novel policy for decentralized multi-player bandits with heterogeneity and non-zero collision rewards, achieving near order-optimal regret.
Findings
Achieves expected regret of order O(log^{1+δ} T)
Handles more players than arms without communication
Supports non-zero rewards on collisions
Abstract
We consider a fully decentralized multi-player stochastic multi-armed bandit setting where the players cannot communicate with each other and can observe only their own actions and rewards. The environment may appear differently to different players, , the reward distributions for a given arm are heterogeneous across players. In the case of a collision (when more than one player plays the same arm), we allow for the colliding players to receive non-zero rewards. The time-horizon for which the arms are played is \emph{not} known to the players. Within this setup, where the number of players is allowed to be greater than the number of arms, we present a policy that achieves near order-optimal expected regret of order for some over a time-horizon of duration . This paper is accepted at IEEE Transactions on Information Theory.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAdvanced Bandit Algorithms Research · Auction Theory and Applications · Smart Grid Energy Management
