On Regret-Optimal Learning in Decentralized Multi-player Multi-armed Bandits
Naumaan Nayyar, Dileep Kalathil, Rahul Jain

TL;DR
This paper introduces decentralized policies for multi-armed bandit problems that achieve near-optimal regret growth, improving previous bounds and addressing the challenge of decentralized coordination without communication.
Contribution
Proposes two new decentralized algorithms, E^3 and E^3-TS, that attain near-logarithmic regret growth in multi-player multi-armed bandits, reducing the regret gap in decentralized learning.
Findings
Regret grows at most as O(log^{1+ε} T) with the new policies.
Improves regret bounds from O(log^2 T) to near O(log T).
Addresses decentralized learning costs, showing they are at most an ε-factor worse.
Abstract
We consider the problem of learning in single-player and multiplayer multiarmed bandit models. Bandit problems are classes of online learning problems that capture exploration versus exploitation tradeoffs. In a multiarmed bandit model, players can pick among many arms, and each play of an arm generates an i.i.d. reward from an unknown distribution. The objective is to design a policy that maximizes the expected reward over a time horizon for a single player setting and the sum of expected rewards for the multiplayer setting. In the multiplayer setting, arms may give different rewards to different players. There is no separate channel for coordination among the players. Any attempt at communication is costly and adds to regret. We propose two decentralizable policies, (-) and -, that can be used in both single player and multiplayer settings.…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
