Distributed Learning in Multi-Armed Bandit with Multiple Players
Keqin Liu, Qing Zhao

TL;DR
This paper studies decentralized multi-armed bandit problems with multiple players competing for arms, proposing an order-optimal policy that achieves logarithmic regret growth without requiring communication or pre-agreement.
Contribution
It introduces a decentralized policy based on Time Division Fair Sharing that attains the same regret order as centralized solutions, ensuring fairness and broad applicability.
Findings
Decentralized policy achieves logarithmic regret growth similar to centralized systems.
The proposed policy guarantees fairness among players.
A lower bound on regret growth rate for decentralized policies is established.
Abstract
We formulate and study a decentralized multi-armed bandit (MAB) problem. There are M distributed players competing for N independent arms. Each arm, when played, offers i.i.d. reward according to a distribution with an unknown parameter. At each time, each player chooses one arm to play without exchanging observations or any information with other players. Players choosing the same arm collide, and, depending on the collision model, either no one receives reward or the colliding players share the reward in an arbitrary way. We show that the minimum system regret of the decentralized MAB grows with time at the same logarithmic order as in the centralized counterpart where players act collectively as a single entity by exchanging observations and making decisions jointly. A decentralized policy is constructed to achieve this optimal order while ensuring fairness among players and without…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
