A Decentralized Policy with Logarithmic Regret for a Class of   Multi-Agent Multi-Armed Bandit Problems with Option Unavailability   Constraints and Stochastic Communication Protocols

Pathmanathan Pankayaraj; D. H. S. Maithripala; and J. M. Berg

arXiv:2003.12968·cs.LG·April 1, 2020·1 cites

A Decentralized Policy with Logarithmic Regret for a Class of Multi-Agent Multi-Armed Bandit Problems with Option Unavailability Constraints and Stochastic Communication Protocols

Pathmanathan Pankayaraj, D. H. S. Maithripala, and J. M. Berg

PDF

Open Access

TL;DR

This paper introduces a decentralized UCB-based policy for multi-agent multi-armed bandit problems with spatial and communication constraints, achieving logarithmic regret in complex, uncertain environments.

Contribution

It proposes the first decentralized policy with logarithmic regret for multi-agent MABs under option unavailability and stochastic communication constraints.

Findings

01

Achieves logarithmic regret in decentralized multi-agent MAB settings.

02

Matches or exceeds prior results in fully connected, stationary communication scenarios.

03

First to address non-fully connected graphs with stochastic communication protocols.

Abstract

This paper considers a multi-armed bandit (MAB) problem in which multiple mobile agents receive rewards by sampling from a collection of spatially dispersed stochastic processes, called bandits. The goal is to formulate a decentralized policy for each agent, in order to maximize the total cumulative reward over all agents, subject to option availability and inter-agent communication constraints. The problem formulation is motivated by applications in which a team of autonomous mobile robots cooperates to accomplish an exploration and exploitation task in an uncertain environment. Bandit locations are represented by vertices of the spatial graph. At any time, an agent's option consist of sampling the bandit at its current location, or traveling along an edge of the spatial graph to a new bandit location. Communication constraints are described by a directed, non-stationary, stochastic…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsAdvanced Bandit Algorithms Research · Optimization and Search Problems · Reinforcement Learning in Robotics