Voting-Based Multi-Agent Reinforcement Learning for Intelligent IoT

Yue Xu; Zengde Deng; Mengdi Wang; Wenjun Xu; Anthony Man-Cho So,; Shuguang Cui

arXiv:1907.01385·cs.LG·September 1, 2020

Voting-Based Multi-Agent Reinforcement Learning for Intelligent IoT

Yue Xu, Zengde Deng, Mengdi Wang, Wenjun Xu, Anthony Man-Cho So,, Shuguang Cui

PDF

Open Access

TL;DR

This paper introduces a voting-based multi-agent reinforcement learning framework for IoT systems, utilizing a distributed primal-dual algorithm to achieve efficient, consensus-driven decision making with proven convergence.

Contribution

It formulates a novel voting-based MARL approach for IoT, proposing a distributed primal-dual algorithm that guarantees convergence and efficiency comparable to centralized methods.

Findings

01

The proposed algorithm converges sublinearly in simulations.

02

Distributed learning matches centralized convergence rates.

03

Case studies demonstrate practical effectiveness in IoT systems.

Abstract

The recent success of single-agent reinforcement learning (RL) in Internet of things (IoT) systems motivates the study of multi-agent reinforcement learning (MARL), which is more challenging but more useful in large-scale IoT. In this paper, we consider a voting-based MARL problem, in which the agents vote to make group decisions and the goal is to maximize the globally averaged returns. To this end, we formulate the MARL problem based on the linear programming form of the policy optimization problem and propose a distributed primal-dual algorithm to obtain the optimal solution. We also propose a voting mechanism through which the distributed learning achieves the same sublinear convergence rate as centralized learning. In other words, the distributed decision making does not slow down the process of achieving global consensus on optimality. Lastly, we verify the convergence of our…

Tables1

Table 1. TABLE I: Parameters

Parameters	Values
CBR ( $C_{u}$ )	$128$ kbps
Total 2D area	$4$ km²
Total bandwidth	$20$ MHz
Carrier frequency ( $f_{c}$ )	$2$ GHz
PRB bandwith ( $B$ )	$180$ kHz
Max user velocity ( $c_{max}$ )	$10$ m/s
Ground BS max transmit power ( $P_{m}$ )	$46$ dBm
UAV-BS max transmit power ( $P_{U}$ )	$20$ dBm
Additional LoS path loss ( $η_{L o S}$ )	$1$ dB
Noise power spectral density ( $N_{0}$ )	$- 174$ dBm/Hz

Equations158

(S, A, P, {R_{m}}_{m = 1}^{M}),

(S, A, P, {R_{m}}_{m = 1}^{M}),

\max_{\pi^{g}}\!\bigg{\{}\!\bar{v}^{\pi^{g}}\!\!=\!\!\lim_{T\rightarrow\infty}\mathbb{E}^{\pi^{g}}\!\bigg{[}\frac{1}{T}\sum_{t=1}^{T}\!\sum_{m=1}^{M}r^{m}_{i_{t}i_{t+1}}(a_{t})\Big{|}i_{1}\!=\!i\bigg{]},i\in\mathcal{S}\bigg{\}}\!,

\max_{\pi^{g}}\!\bigg{\{}\!\bar{v}^{\pi^{g}}\!\!=\!\!\lim_{T\rightarrow\infty}\mathbb{E}^{\pi^{g}}\!\bigg{[}\frac{1}{T}\sum_{t=1}^{T}\!\sum_{m=1}^{M}r^{m}_{i_{t}i_{t+1}}(a_{t})\Big{|}i_{1}\!=\!i\bigg{]},i\in\mathcal{S}\bigg{\}}\!,

\overset{v}{ˉ}^{*} + v^{*} (i) = a \in A max ⎩ ⎨ ⎧ j \in S \sum p_{ij} (a) v^{*} (j) + j \in S \sum p_{ij} (a) m = 1 \sum M r_{ij}^{m} (a) ⎭ ⎬ ⎫, \forall i \in S,

\overset{v}{ˉ}^{*} + v^{*} (i) = a \in A max ⎩ ⎨ ⎧ j \in S \sum p_{ij} (a) v^{*} (j) + j \in S \sum p_{ij} (a) m = 1 \sum M r_{ij}^{m} (a) ⎭ ⎬ ⎫, \forall i \in S,

\overset{v}{ˉ}, v min

\overset{v}{ˉ}, v min

s.t.

μ max

μ max

s.t.

i \in S \sum a \in A \sum μ_{i, a} = 1, μ_{i, a} \geq 0,

0 = a \in A \sum (μ_{a}^{*})^{⊤} (\overset{v}{ˉ}^{*} \cdot e + (I - P_{a}) v^{*} - m = 1 \sum M \overset{ˉ}{r}_{a}^{m}) = \overset{v}{ˉ}^{*} + a \in A \sum (μ_{a}^{*})^{⊤} ((I - P_{a}) v^{*} - m = 1 \sum M \overset{ˉ}{r}_{a}^{m}) .

0 = a \in A \sum (μ_{a}^{*})^{⊤} (\overset{v}{ˉ}^{*} \cdot e + (I - P_{a}) v^{*} - m = 1 \sum M \overset{ˉ}{r}_{a}^{m}) = \overset{v}{ˉ}^{*} + a \in A \sum (μ_{a}^{*})^{⊤} ((I - P_{a}) v^{*} - m = 1 \sum M \overset{ˉ}{r}_{a}^{m}) .

v \in V min μ \in U max a \in A \sum μ_{a}^{⊤} ((P_{a} - I) v + m = 1 \sum M \overset{ˉ}{r}_{a}^{m}) .

v \in V min μ \in U max a \in A \sum μ_{a}^{⊤} ((P_{a} - I) v + m = 1 \sum M \overset{ˉ}{r}_{a}^{m}) .

\mathcal{V}=\mathbb{R}^{|\mathcal{S}|},\ \mathcal{U}=\left\{\bm{\mu}\in\mathbb{R}^{|\mathcal{S}|\times|\mathcal{A}|}\,\Big{|}\,\sum_{i\in\mathcal{S}}\sum_{a\in\mathcal{A}}\mu_{i,a}=1,\ \bm{\mu}\geq\bm{0}\right\}

\mathcal{V}=\mathbb{R}^{|\mathcal{S}|},\ \mathcal{U}=\left\{\bm{\mu}\in\mathbb{R}^{|\mathcal{S}|\times|\mathcal{A}|}\,\Big{|}\,\sum_{i\in\mathcal{S}}\sum_{a\in\mathcal{A}}\mu_{i,a}=1,\ \bm{\mu}\geq\bm{0}\right\}

μ_{i, a}^{g, t} \propto m = 1 \prod M μ_{i, a}^{m, t} .

μ_{i, a}^{g, t} \propto m = 1 \prod M μ_{i, a}^{m, t} .

\mu^{m,t+1}_{i,a}\!=\!\left\{\begin{array}[]{ll}\mu^{m,t}_{i,a}\exp\big{\{}\Delta^{m,t}_{i,a}\big{\}},&\text{if}~{}i=i_{t},\,a=a_{t},\\ \mu^{m,t}_{i,a},&\text{otherwise},\end{array}\right.

\mu^{m,t+1}_{i,a}\!=\!\left\{\begin{array}[]{ll}\mu^{m,t}_{i,a}\exp\big{\{}\Delta^{m,t}_{i,a}\big{\}},&\text{if}~{}i=i_{t},\,a=a_{t},\\ \mu^{m,t}_{i,a},&\text{otherwise},\end{array}\right.

Δ_{i, a}^{m, t} = β (\frac{\frac{1}{β} lo g x ^{t} + v _{j}^{t} - v _{i}^{t} - C}{M} + r_{ij}^{m} (a))

Δ_{i, a}^{m, t} = β (\frac{\frac{1}{β} lo g x ^{t} + v _{j}^{t} - v _{i}^{t} - C}{M} + r_{ij}^{m} (a))

x^{t} = \frac{1}{\sum _{i \in S, a \in A} \prod _{m = 1}^{M} μ _{i, a}^{m, t}} .

x^{t} = \frac{1}{\sum _{i \in S, a \in A} \prod _{m = 1}^{M} μ _{i, a}^{m, t}} .

p_{i_{t}, a_{t}}^{primal} = \frac{\prod _{m = 1}^{M} μ _{i_{t}, a_{t}}^{m, t}}{\sum _{i \in S, a \in A} \prod _{m = 1}^{M} μ _{i, a}^{m, t}} .

p_{i_{t}, a_{t}}^{primal} = \frac{\prod _{m = 1}^{M} μ _{i_{t}, a_{t}}^{m, t}}{\sum _{i \in S, a \in A} \prod _{m = 1}^{M} μ _{i, a}^{m, t}} .

v^{t + 1} = Π_{V} {v^{t} + d^{t}},

v^{t + 1} = Π_{V} {v^{t} + d^{t}},

d^{t} = α (e_{i} - e_{j})

d^{t} = α (e_{i} - e_{j})

μ_{i, a}^{g, t} = x^{t} m = 1 \prod M μ_{i, a}^{m, t},

μ_{i, a}^{g, t} = x^{t} m = 1 \prod M μ_{i, a}^{m, t},

μ_{i, a}^{g, t + 1}

μ_{i, a}^{g, t + 1}

v^{t + 1}

Δ_{i, a}^{g, t} = β (v_{j}^{t} - v_{i}^{t} - C + m = 1 \sum M r_{ij}^{m} (a))

Δ_{i, a}^{g, t} = β (v_{j}^{t} - v_{i}^{t} - C + m = 1 \sum M r_{ij}^{m} (a))

μ_{i, a}^{g, t + 1}

μ_{i, a}^{g, t + 1}

\displaystyle=x^{t+1}\prod_{m=1}^{M}\Big{(}\mu^{m,t}_{i,a}\exp\big{\{}\Delta^{m,t}_{i,a}\big{\}}\Big{)}

= x^{t + 1} m = 1 \prod M μ_{i, a}^{m, t} exp {m = 1 \sum M Δ_{i, a}^{m, t}}

= x^{t + 1} (x^{t})^{- 1} μ_{i, a}^{g, t} exp {m = 1 \sum M Δ_{i, a}^{m, t}}

= x^{t + 1} μ_{i, a}^{g, t} exp {β (v_{j}^{t} - v_{i}^{t} - C + m = 1 \sum M r_{ij}^{m} (a))} .

\displaystyle\mathbb{E}\big{[}\Delta^{g,t}_{i,a}\mid\mathcal{F}_{t}\big{]}

\displaystyle\mathbb{E}\big{[}\Delta^{g,t}_{i,a}\mid\mathcal{F}_{t}\big{]}

= \frac{β}{∣ S ∣ \cdot ∣ A ∣} ((P_{a} - I) v^{t} + m = 1 \sum M \overset{ˉ}{r}_{a}^{m} - C \cdot e)_{i},

\forall i \in S, a \in A .

E [d^{t} ∣ F_{t}] = α a \in A \sum (I - P_{a})^{⊤} μ_{a}^{g, t} .

E [d^{t} ∣ F_{t}] = α a \in A \sum (I - P_{a})^{⊤} μ_{a}^{g, t} .

\frac{1}{β} \cdot E [Δ_{i, a}^{g, t} ∣ F_{t}]

\frac{1}{β} \cdot E [Δ_{i, a}^{g, t} ∣ F_{t}]

= \frac{1}{∣ S ∣ \cdot ∣ A ∣} j \in S \sum p_{ij} (a) v_{j}^{t} - v_{i}^{t} + \frac{1}{∣ S ∣ \cdot ∣ A ∣} j \in S \sum m = 1 \sum M p_{ij} (a) r_{ij}^{m} (a) - C

= \frac{1}{∣ S ∣ \cdot ∣ A ∣} ((P_{a} - I) v^{t} + m = 1 \sum M \overset{ˉ}{r}_{a}^{m} - C \cdot e)_{i} .

E [d^{t} ∣ F_{t}]

E [d^{t} ∣ F_{t}]

= α i \in S \sum Pr (i_{t} = i ∣ F_{t}) e_{i} - j \in S \sum Pr (j_{t} = j ∣ F_{t}) e_{j}

= α i \in S \sum a \in A \sum μ_{i, a}^{g, t} e_{i} - j \in S \sum i \in S \sum a \in A \sum p_{ij} (a) μ_{i, a}^{g, t} e_{j}

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsDistributed Control Multi-Agent Systems · Reinforcement Learning in Robotics · Adaptive Dynamic Programming Control

Full text

Voting-Based Multi-Agent Reinforcement Learning for Intelligent IoT

Yue Xu1,4, Zengde Deng2, Mengdi Wang3, Wenjun Xu1, Anthony Man-Cho So2, Shuguang Cui4

1Key Lab of Universal Wireless Communications, Ministry of Education

Beijing University of Posts and Telecommunications

2Department of Systems Engineering and Engineering Management,

The Chinese University of Hong Kong, Hong Kong

3Department of Operations Research and Financial Engineering, Princeton University

4Shenzhen Research Institute of Big Data and The Chinese University of Hong Kong, Shenzhen

Abstract

The recent success of single-agent reinforcement learning (RL) in Internet of things (IoT) systems motivates the study of multi-agent reinforcement learning (MARL), which is more challenging but more useful in large-scale IoT. In this paper, we consider a voting-based MARL problem, in which the agents vote to make group decisions and the goal is to maximize the globally averaged returns. To this end, we formulate the MARL problem based on the linear programming form of the policy optimization problem and propose a primal-dual algorithm to obtain the optimal solution. We also propose a voting mechanism through which the distributed learning achieves the same sublinear convergence rate as centralized learning. In other words, the distributed decision making does not slow down the process of achieving global consensus on optimality. Lastly, we verify the convergence of our proposed algorithm with numerical simulations and conduct case studies in practical multi-agent IoT systems.

Index Terms:

Multi-agent reinforcement learning, voting mechanism, primal-dual algorithm

I Introduction

Reinforcement learning (RL) aims at maximizing a cumulative reward by selecting a sequence of optimal actions to interact with a stochastic unknown environment, where the dynamics is usually modeled as a Markov decision process (MDP) [1]. Recently, single-agent RL has been successfully applied to contribute adaptive and autonomous intelligence in many Internet of things (IoT) applications, including smart cellular networks [2, 3, 4], smart vehicle networks [5, 6, 7], and smart unmanned aerial vehicles (UAV) networks [8, 9, 10]. Despite these successes, many recent studies envision that the IoT entities, e.g., smartphones, sensors, and UAVs, will become more decentralized, ad-hoc, and autonomous in nature [11, 12]. This encourages the extension from single-agent RL to multi-agent RL (MARL) to study the smart collaboration among local entities in order to deliver a superior collective intelligence, instead of simply treating them as independent learners. However, MARL is more challenging since each agent interacts with not only the environment but also the other agents.

Although a number of collaborative learning models based on MARL have been recently proposed [13, 14, 15, 16, 17, 18, 19, 20, 21], they usually impose a discount factor $\gamma\in(0,1)$ on the future rewards to render the problem more tractable, e.g., bounding the cumulative reward [22, 23, 24]. However, many optimization tasks in the IoT systems, e.g., resource allocation and admission control, are long-run or non-terminating tasks. Existing studies reveal that the RL methods based on discounted MDP may yield a poor performance in the continuing tasks and become computationally challenging when the discount factor is close to one [1, 25, 26, 27]. This necessitates the development of MARL models based on the undiscounted average-reward MDP (AMDP) to tackle the continuing optimization tasks in IoT systems. Moreover, existing MARL models usually exhibit a performance degradation compared with their centralized versions [28, 21] and only provide asymptotic convergence to an optimal point [28, 21] or simply give empirical evaluations without theoretical guarantees [16, 17, 18, 19, 20]. In contrast, in this paper, we give a sublinear convergence rate and theoretically prove that our proposed MARL model achieves the same convergence rate as centralized learning, which makes it a decent learning paradigm for distributed IoT systems.

Meanwhile, it is critical to specify a proper collaboration protocol in order to promote safe and efficient cooperations in MARL systems. Many existing MARL models are built upon the centralized learning with decentralized execution framework where the agents perform iterative parameter consensus with a centralized server [18, 14, 29, 30]. Moreover, the centralized server is assumed to have access to the behavioral policy or value functions of all distributed agents for model training. However, in many IoT applications (e.g., location services), the privacy-sensitive data (e.g., policy or value functions) should not be logged onto a centralized center due to privacy and security concerns. On the other hand, recent works also propose a number of decentralized solutions which coordinate the agents through iterative parameter consensus among neighboring agents [21, 23, 22, 24]. However, this may give rise to massive communication overhead in large-scale IoT networks. Besides, their convergence depends on the connectivity properties of the networked agents, which can be topology prohibitive in a randomly deployed IoT network. The above issues motivate us to propose a new collaboration protocol for MARL which can coordinate the local entities in a safe and communication-efficient way.

In this paper, we consider a collaborative MARL setting where the agents vote to make group decisions and the aim is to maximize the globally averaged return of all agents in the environment. Our primary interest is to develop a sample-efficient model-free MARL algorithm built upon voting-based coordinations in the context of infinite-horizon AMDP. Particularly, the considered AMDP does not assume the future rewards to be discounted while only needing to satisfy certain fast mixing property. This significantly complicates our analysis when compared with the discounted cases. The main contributions are summarized as follows.

•

We formulate the MARL problem in the context of AMDP based on the linear programming form of the policy optimization problem and propose a primal-dual algorithm to obtain the optimal solution.

•

We provide the first sublinear convergence rate for solving the MARL problem for infinite-horizon AMDP. The proposed algorithm and theoretical analysis also cover the single-agent RL as a special case, which makes them more general.

•

We propose a voting-based collaboration protocol for the proposed MARL algorithm, through which the distributed learning achieves the same sublinear convergence as centralized learning. In other words, the proposed distributed decision-making process does not slow down the process of achieving global optimality. Moreover, the proposed voting-based protocol has superior data privacy and communication-efficiency than existing parameter-consensus-based protocols.

In addition, we also verify the convergence of our proposed algorithm through numerical simulations and conduct a case study in a multi-agent IoT system to justify the learning effectiveness.

The proposed model is promising for solving the long-run or non-terminating optimization tasks in multi-agent IoT systems, where distributed agents vote to determine a joint action, aiming at maximizing the globally averaged return of all agents. For example, the model can be employed to learn the optimal resource (e.g., communication bandwidth and channel) allocation policy for a group of IoT devices to improve the overall capacity; learn the optimal on/off policy for a group of base stations to improve the overall energy efficiency; learn the optimal trajectory planning policy for a group of UAVs to avoid collisions. Moreover, since the distributed agents only need to exchange their vote information for collaboration, without revealing their policy or value functions to each other, the proposed model would be preferable in privacy-sensitive applications, e.g., location services.

The remainder of this paper is organized as follows. Section II reviews the existing works on MARL. Section III introduces the problem formulations. Section IV presents the voting-based multi-agent reinforcement learning algorithm. Section V presents the convergence analysis of our proposed algorithm. Section VI discusses the simulation results. Finally, Section VII concludes the paper.

Notation: For a vector $\bm{x}\in\mathbb{R}^{n}$ , we denote its $i$ -th component as $x_{i}$ , its transpose as $\bm{x}^{\top}$ , and its Euclidean norm as $\|\bm{x}\|=\sqrt{\bm{x}^{\top}\bm{x}}$ . For a positive number $x$ , we write $\log x$ for its natural logarithm. For a vector $\bm{e}=(1,\ldots,1)^{\top}$ , we denote by $\bm{e}_{i}$ the vector with its $i$ -th entry equaling $1$ and other entries equaling [math]. For two probability distributions $p,q$ over a finite set $X$ , we denote their Kullback-Leibler (KL) divergence as $D_{KL}(p||q)=\sum_{x\in X}p(x)\log\frac{p(x)}{q(x)}$ .

II Related Work

Many existing model-free MARL algorithms are based on the framework of Markov games [31, 32, 33, 34, 35] or temporal-difference RL [21, 16, 17, 18, 19, 20]. In the context of Markov games, the study of MARL usually models the MARL as stochastic games, such as cooperative games [31], zero-sum stochastic games [32, 36, 37, 38], general-sum stochastic games [33], decentralized Q-Learning [35], and the recent mean-field MARL [34]. Alternatively, the study of MARL in the context of temporal-difference RL mainly originates from dynamic programming, which learns by following the Bellman equation, including the ones based on deep neural networks [16, 17, 18, 19, 20] and the ones based on linear function approximators [21]. However, first, the above MARL models can only provide asymptotic convergence [21] to an optimal point or simply provide empirical evaluations without theoretical guarantees [16, 17, 18, 19, 20]. Second, they are all based on the discounted MDP, instead of the undiscounted AMDP. On the other hand, though average-reward RL has received much attention in recent years, most of them focus on the single-agent cases [25, 39, 40, 26, 27]. The research on average-reward MARL still undergoes exploration.

There are two lines of research in existing literature that focus on the saddle-point formulation of RL. One line studies the saddle-point formulation resulted from the fixed-point problem of policy evaluation [22, 41, 42, 23, 24], i.e., learning the value function of a fixed policy. Among others, the works [24, 23] provided the sample complexity analysis of policy evaluation in the context of MARL, where the policies of all agents are fixed. The other line, which includes this paper, focuses on the saddle-point formulation resulted from the policy optimization problem [39, 40], where the policy is continuously updated towards the optimal one. This makes the analysis substantially more challenging than that for policy evaluation. In the single-agent setting, our work is closely related to [39]. However, to the best of our knowledge, our work is the first to consider solving a saddle-point policy optimization in the context of MARL, which takes the coordination among multiple agents into account. Moreover, we also provide numerical simulations and case studies to corroborate our theoretical results, while previous works mainly focus on theoretical analysis [39, 40].

Finally, most MARL models are based on the parameter-consensus-based coordination, where the local agents consensus their parameters with a centralized server [18, 14, 29, 30] or their neighboring agents [21, 23, 22, 24]. Although many works adopted voting-based coordination in their proposed learning algorithms [43, 44, 45], they are not developed for MARL. A relevant work is [46], which proposed a dedicated majority voting rule to coordinate the MARL agents under discounted MDP, which, however, is a heuristic strategy without theoretical guarantees and may not perform well on non-terminating tasks.

III Problem Formulation

In this paper, we consider the MARL in the presence of a generative model of the MDP [47, 48, 49]. The underlying MDP is unknown but having access to a sampling oracle, which takes an arbitrary state-action pair $(i,a)$ as input and generates the next state $j$ with probability $p_{ij}(a)$ , along with an immediate reward for each individual agent. The goal is to find the optimal policy of the unknown AMDP by interacting with the sampling oracle. Such a simulator-defined MDP has been studied by existing literatures in the context of single-agent RL, including the model-based RL [47, 48, 49] and model-free RL [50, 51]. In what follows, we first introduce the settings of the multi-agent AMDP and then formulate the multi-agent policy optimization problem as a primal-dual saddle point optimization problem.

III-A Multi-Agent AMDP

We focus on the infinite-horizon AMDP, which aims at optimizing the average-per-time-step reward over an infinite decision sequence. Existing works on RL usually impose a discount factor $\gamma\in(0,1)$ on the future rewards to render the problem more tractable; e.g., by making the cumulative reward bounded. However, discounted RL may yield a poor performance over long-run (especially non-terminating) tasks and become computationally challenging when the discount factor is close to one [1, 25, 26, 27]. In this paper, we do not assume that the future rewards are discounted. Rather, we assume that the AMDP satisfies certain fast mixing property (given in Sec. V), which significantly complicates our analysis when compared with the discounted cases.

A multi-agent AMDP can be described by the tuple

[TABLE]

where $\mathcal{S}$ is the state space, $\mathcal{A}$ is the action space, $\mathcal{P}=\left\{p_{ij}(a)\mid i,j\in\mathcal{S},a\in\mathcal{A}\right\}$ is the collection of state-to-state transition probabilities, and $\left\{\mathcal{R}_{m}\right\}^{M}_{m=1}$ is the collection of local reward functions with $\mathcal{R}_{m}=\left\{r_{ij}^{m}(a)\mid i,j\in\mathcal{S},a\in\mathcal{A}\right\}$ and $M$ being the number of agents. We consider the setting where the reward functions of the agents may differ from each other and are private to each corresponding agent. We assume that the reward $r_{ij}^{m}(a)$ , where $i,j\in\mathcal{S}$ , $a\in\mathcal{A}$ , and $m=1,\ldots,M$ , lie in $[0,1]$ . This public state with private reward setting is widely considered in many recent works on collaborative MARL [23, 24, 21]. Moreover, we assume that the multi-agent AMDP is ergodic (i.e., aperiodic and recurrent), so that there is a unique stationary distribution under any stationary policy. The MARL system selects the action to take according to the votes from local agents. Each agent determines its vote individually without communicating with others. In particular, at each time step $t$ , the MARL system works as follows:

all agents observe the state $i_{t}\in\mathcal{S}$ ;
each agent votes for the action $a_{t}$ to take under $i_{t}$ ;
the system executes $a_{t}$ according to the votes;
the system shifts to a new state $i_{t+1}\in\mathcal{S}$ with probability $p_{i_{t}i_{t+1}}(a_{t})$ and returns the rewards $\{r^{m}_{i_{t}i_{t+1}}(a_{t})\}^{M}_{m=1}$ to the agents.

III-B Multi-Agent Policy Optimization

We denote the global acting policy, which determines the joint action to take, as $\pi^{g}\in\Xi\subseteq\mathbb{R}^{|\mathcal{S}|\times|\mathcal{A}|}$ , where $\Xi$ consists of non-negative matrices whose $(i,a)$ -th entry $\pi^{g}_{i,a}$ specifies the probability of taking action $a$ in state $i$ . The multi-agent policy optimization problem aims at improving the global acting policy by maximizing the sum of local average-rewards, i.e.,

[TABLE]

where $\mathbb{E}^{\pi^{g}}[\cdot]$ denotes the expectation over all the state-action trajectories generated by the MARL system when following the acting policy $\pi^{g}$ . According to the theory of dynamic programming [52, 53], the value $\bar{v}^{*}$ is the optimal average reward to problem (1) if and only if it satisfies the following Bellman equation:

[TABLE]

where $p_{ij}(a)$ is the transition probability from state $i$ to state $j$ after taking the action $a$ and $\bm{v}^{*}\in\mathbb{R}^{|\mathcal{S}|}$ is known as the difference-of-value vector that characterizes the transient effect of each initial state under the optimal policy [39]. Note that there exist infinitely many $\bm{v}^{*}$ that satisfy (2); e.g., by adding constant shifts. However, this does not affect our analysis. More detailed descriptions of $\bm{v}^{*}$ can be found in [39].

III-C Saddle-Point Formulation

The Bellman equation in (2) can be written as the following linear programming problem:

[TABLE]

where $P_{a}\in\mathbb{R}^{|\mathcal{S}|\times|\mathcal{S}|}$ is the MDP transition matrix under action $a$ whose $(i,j)$ -th entry is $p_{ij}(a)$ and $\bar{\bm{r}}^{m}_{a}\in\mathbb{R}^{|\mathcal{S}|}$ is the expected state-transition reward under action $a$ with $\bar{r}^{m}_{i,a}=\sum_{j\in\mathcal{S}}p_{ij}(a)r^{m}_{ij}(a),\ \forall i\in\mathcal{S}$ . The dual of (3) can be written as

[TABLE]

where $\bm{\mu}$ is the dual variable. By linear programming strong duality, if $(\bar{v}^{*},\bm{v}^{*})$ and $\bm{\mu}^{*}$ are optimal solutions to the primal and dual problems (3) and (4), respectively, then they satisfy the zero complementarity gap condition:

[TABLE]

Observe that problems (3) and (4) involve rather complicated constraints. Hence, it is common to consider their saddle-point formulation, whose constraints are simpler:

[TABLE]

Here,

[TABLE]

are the primal and dual constraint sets, respectively. Later, we shall focus on multi-agent AMDPs that satisfy certain fast mixing property. This will allow us to use a smaller but still structured primal constraint set $\mathcal{V}$ ; see Sec. V.

It is known that there is a correspondence between randomized stationary policies and feasible solutions to the dual problem (4) [52]. In particular, given an optimal dual solution $\bm{\mu}^{*}\in\mathbb{R}^{|\mathcal{S}|\times|\mathcal{A}|}$ , the optimal acting policy $\pi^{g}$ can be obtained via $\pi^{*}_{i,a}=\mu^{*}_{i,a}/\sum_{a\in\mathcal{A}}\mu_{i,a}^{*}$ . Hence, our goal now is to obtain an optimal dual solution $\bm{\mu}^{*}$ .

IV Voting-Based Learning Algorithm

In this section, we propose a voting mechanism that specifies how local votes determine the global action. Then, we prove that the voting mechanism yields an equivalence between the update on the global acting policy and that on the distributed voting policies. Consequently, problem (6) can be solved in a distributed manner, and we propose a primal-dual learning algorithm for it.

IV-A Voting Mechanism

We denote the pair of primal and dual variables corresponding to the global acting policy $\pi^{g}$ as $\bm{v}^{g}$ and $\bm{\mu}^{g}$ , respectively. We also introduce a pair of local primal and dual variables corresponding to each local voting $\pi^{m}$ ( $m=1,\ldots,M$ ) as $\bm{v}^{m}$ and $\bm{\mu}^{m}$ , where $\pi^{m}\in\Xi\subseteq\mathbb{R}^{|\mathcal{S}|\times|\mathcal{A}|}$ is a randomized stationary policy. Then, the voting mechanism takes the form

[TABLE]

The voting mechanism indeed reveals the relationship between the global acting policy and the local voting policies.

IV-B Primal-Dual Learning Algorithm

We now develop a primal-dual learning algorithm to solve problem (6) in a distributed manner based on a double-sampling strategy. Recall that we consider the MARL under a generative MDP, where the agents are interacting with a black-box sampling oracle to learn the optimal policy. The sampling oracle works in a similar way as the experience replay used in deep RL models [18, 14, 20, 4]. In practical applications, the sampling oracle or experience replay can be placed in a centralized node which can communicate with the local agents, as in many existing MARL frameworks [18, 14, 29, 30]. However, it only needs to collect the vote information $\mu^{m}_{i,a}$ from the agents in order to coordinate the sampling during the learning process, instead of performing iterative parameter consensus as existing methods [14], [18], [29], [30]. The detailed procedure is provided in Algorithm 1. In what follows, we first introduce the local dual and primal updates in our algorithm. Then, we prove that the local updates are equivalent to the global updates if the voting mechanism is specified properly.

IV-B1 Local Dual Update

We update the local dual variables based on uniform sampling. Specifically, the first state-action pair $\left(i_{t},a_{t}\right)$ to update the local dual variables is sampled with uniform probability $p^{\text{dual}}_{i,a}=\frac{1}{|\mathcal{S}|\cdot|\mathcal{A}|}$ . The MARL system then shifts to the next state $j_{t}$ conditioned on $(i_{t},a_{t})$ and returns the local rewards $\{r^{m}_{i_{t}j_{t}}(a_{t})\}^{M}_{m=1}$ to the agents. The local dual variable $\bm{\mu}^{m,t}$ of agent $m$ is updated as

[TABLE]

where

[TABLE]

with $(i,a,j)=(i_{t},a_{t},j_{t})$ , $\beta>0$ being the step-size, $C$ being a parameter to be specfied, and

[TABLE]

Here, $x^{t}$ can be viewed as the proportion between the locally recovered partial derivatives and the global true partial derivatives of the minimax objective in (6). It also defines the explicit form of the voting mechanism; see Lemma 1 below. However, it is important to note that we do not need to compute $x^{t}$ in our algorithm, as it does not influence the sampling in the subsequent primal update step and is used purely for analysis purposes. In other words, one can remove the term of $\log x^{t}$ from (8) without influencing the learning performance.

IV-B2 Local Primal Update

We update the local primal variables based on probability sampling, where the probability is specified by the dual variables. Specifically, the second state-action pair $\left(i_{t},a_{t}\right)$ to update the local primal variables is sampled with probability

[TABLE]

The system then shifts to the next state $j_{t}$ conditioned on $(i_{t},a_{t})$ , and returns the local rewards to the agents. The local primal variable $\bm{v}^{t}$ is updated as

[TABLE]

where

[TABLE]

with $(i,j)=(i_{t},j_{t})$ ; $\alpha>0$ is the step-size; $\Pi_{\mathcal{V}}\left\{\cdot\right\}$ denotes the projector onto the search space $\mathcal{V}$ , which will be defined in Sec. V. Note that the local primal update is identical across the agents. Hence, we use the same notation $v^{t}_{i}$ in the primal update for all the agents in the sequel.

IV-B3 Communication

The centralized sampling oracle needs to collect the vote information $\mu^{m}_{i,a}$ to compute the probability $p^{\text{primal}}_{i_{t},a_{t}}$ according to (10) and returns the reward information to the agents. However, note that the vote information $\mu^{m}_{i,a}$ and the reward information $r^{m}_{i,j}(a)$ of each agent is a scalar, such that the communication overhead at each learning step of our method only scales as $\mathcal{O}(M)$ . In contrast, most existing MARL methods are developed based on parameter consensus, where local agents need to reach consensus on its value or policy function with a centralized center [18, 14, 29, 30] or their nearby agents [21, 23, 22, 24]. Since the value or policy function scales as $\mathcal{O}(|\mathcal{S}|\cdot|\mathcal{A}|)$ , the communication overhead at each learning step of their models scales as $\mathcal{O}(M\cdot|\mathcal{S}|\cdot|\mathcal{A}|)$ . Although this cost can be reduced if they adopt linear or nonlinear function to approximate the value or policy function, it is still related to the size of the function approximators, which can be enormous if they are deep neural networks. Moreover, the exchanged information in our algorithm is the vote information, instead of the privacy-sensitive policy or value information, which can alleviate privacy and security concerns considerably.

IV-B4 Equivalent Global Update

We now prove that with a properly specified voting mechanism, the primal-dual updates on the local voting policies are equivalent to the centralized primal-dual updates on the global acting policy.

Lemma 1 (Equivalent Global Update)

By specifying the voting mechanism as

[TABLE]

where $x^{t}$ is given by (9), the local primal-dual updates (7) and (11) are equivalent to the following global primal-dual updates:

[TABLE]

Here,

[TABLE]

*and $\bm{d}^{t}=\alpha(\bm{e}_{i}-\bm{e}_{j})$ , where $(i,a)=(i_{t},a_{t})$ with probability $\mu^{g,t}_{i_{t},a_{t}}$ and $j=j_{t}$ is obtained from the system by conditioning on $(i_{t},a_{t})$ . $\blacksquare$ *

We remark that the global primal-dual updates (14) are conditionally unbiased partial derivatives of the minimax objective given in (6).

Proof. Recall that the local dual variable $\bm{\mu}^{m,t}$ of agent $m$ is updated by (7). We now prove a recursive relationship between $\mu^{g,t+1}_{i,a}$ and $\mu^{g,t}_{i,a}$ as follows. Given $(i,a)=(i_{t},a_{t})$ , starting from the voting mechanism defined in (13), we have

[TABLE]

Hence, using the definition of $\Delta^{g,t}_{i,a}$ in (15), the local dual update based on $\Delta^{m,t}_{i,a}$ can be equivalently expressed as the global dual update based on $\Delta^{g,t}_{i,a}$ , i.e., (14a) holds.

As for the local primal update, since the oracle generates the second sample with probability $p^{\text{primal}}_{i,a}$ given by (10), which is exactly the same as the global dual variable $\mu^{g,t}_{i,a}$ given in (13), the local and global primal updates are identical. $\blacksquare$

Lemma 2 (Unbiasedness)

*Consider the voting mechanism in Lemma 1. Let $\mathcal{F}_{t}$ be the filtration at time $t$ , i.e., information about all the state-action pair sampling and state transition right before time $t$ . Then, the dual update weight $\Delta^{g,t}_{i,a}$ is, up to a constant shift, a multiple of the conditional partial derivative of the minimax objective in (6) with respect to $\mu_{i,a}$ : *

[TABLE]

Moreover, the primal update weight $d_{i}^{t}$ is a multiple of the conditional partial derivative of the minimax objective in (6) with respect to $v_{i}$ :

[TABLE]

$\blacksquare$ **

Proof. For arbitrary $i\in\mathcal{S}$ and $a\in\mathcal{A}$ , we use (15) to compute

[TABLE]

On the other hand, using (12) and the fact that the state-action pair for updating the primal variables is generated with probability $\bm{\mu}^{g,t}$ , we compute, for an arbitrary $i\in\mathcal{S}$ ,

[TABLE]

This completes the proof. $\blacksquare$

V Theoretical Results

In this section, we present the convergence analysis of Algorithm 1. We start by making the following assumption on the considered multi-agent AMDP. A similar assumption has also been used in [40, 39] for the case of a single-agent RL.

Assumption 1

There exists a constant $t_{\text{mix}}^{*}>0$ such that for any stationary policy $\pi^{g}$ , we have

[TABLE]

where $\|\cdot\|_{TV}$ is the total variation and $P^{\pi^{g}}(i,j)=\sum_{a\in\mathcal{A}}\pi_{i,a}^{g}p_{ij}(a)$ . $\blacksquare$

The above assumption requires the multi-agent AMDP to be sufficiently rapidly mixing, with the parameter $t_{\text{mix}}^{*}$ characterizing how fast the multi-agent AMDP reaches its stationary distribution from any state under any acting policy [39]. In particular, $t_{\text{mix}}^{*}$ controls the distance between any stationary policy and the optimal policy under the considered multi-agent AMDP. It has been shown in [39] that under Assumption 1, an optimal difference-of-value vector $\bm{v}^{*}$ satisfying $\left\lVert\bm{v}\right\rVert_{\infty}\leq 2t_{\text{mix}}^{*}$ exists.

Based on the above discussion, we can use the following smaller constraint set $\mathcal{V}$ for the global primal variable $\bm{v}$ :

[TABLE]

Now, we are ready to establish the convergence of our proposed Algorithm 1.

Theorem 1 (Finite-Iteration Duality Gap)

Let $\mathcal{M}=(\mathcal{S},\mathcal{A},\mathcal{P},\left\{\mathcal{R}_{m}\right\}^{M}_{m=1})$ be an arbitrary multi-agent AMDP tuple satisfying Assumption 1. Then, the sequence of iterates generated by Algorithm 1 satisfies

[TABLE]

where $\tilde{O}(\cdot)$ hides polylogarithmic factors. $\blacksquare$

Recall from (5) that the complementarity gap of a pair of optimal solutions to the primal-dual problems (3) and (4) is zero. Hence, Theorem 1 suggests that the iterates $\{\bm{\mu}^{g,t}\}_{t\geq 0}$ converge to an optimal solution to the dual problem (4) at a sublinear rate. The result also covers the single-agent RL [39] as a special case, which makes our model more general. We defer the proof of Theorem 1 to the appendix.

It is worth pointing out that in our proof, the scalar $M$ in Theorem 1, i.e., the number of agents, comes from the bound of the total reward of all agents $\sum_{m=1}^{M}r_{ij}^{m}(a)\in[0,M]$ , $\forall\ i,j\in\mathcal{S},\ a\in\mathcal{A}$ . As such, if we consider a normalized reward where $\sum_{m=1}^{M}r_{ij}^{m}(a)\in[0,1]$ , then the complexity in Theorem 1 will be independent of $M$ .

VI Numerical Results

In this section, we evaluate the proposed voting-based MARL algorithm through two case studies. In the first case study, we verify the convergence of our proposed algorithm with the generated MDP instances. In the second case study, we exhibit how to apply our proposed algorithm to solve the placement optimization task in a UAV-assisted IoT network, where the ground base stations and the UAV-mounted base station are treated as the IoT devices. The UAV-mounted base station collects vote information from the ground base stations, which is then used to determine the placement of the the UAV-mounted base station to maximize the overall system capacity. Our results show that the distributed decision making does not slow down the process of achieving global consensus on optimality and that voting-based learning is more efficient than letting agents behave individually and selfishly.

VI-A Empirical Convergence

We generate instances of the multi-agent MDP using a similar setup as in [54]. Specifically, given a state and an action, the multi-agent MDP shifts to the next state assigned from the entire set without replacement. The transition probabilities are generated randomly from $[0,1]$ and then normalized so that they sum to one. The optimal policy is generated with purposeful behavior by letting the agent favor a single action in each state and assigning it with a higher expected reward in $[0,1]$ .

In Fig. 1, we show the empirical convergence results of

the duality gap, i.e., the one given in Theorem 1;
the distance between the optimal policy and the learned policy, i.e., $\left\lVert\bm{\pi}^{*}-\bm{\hat{\pi}}\right\rVert_{1}$ . The convergence curves are averaged over $100$ instances. Generally, the empirical convergence rates corroborate the result given in Theorem 1. Besides, we also present 1) the performance change as the number of local agents varies from $M=5$ to $M=100$ and 2) the performance of centralized learning, which directly uses the global primal-dual updates to learn the global policy. The result shows that the empirical convergence rates of the centralized case and the distributed case are the same for different numbers of agents $M$ . This indicates that distributed decision making does not slow down the process of achieving global consensus on optimality.

VI-B Application in Multi-Agent IoT Systems

We now apply the proposed voting-based MARL algorithm to a multi-agent IoT system which contains ground base stations, smartphones, and UAVs. In particular, UAV-assisted wireless communication has recently attracted much attention [55, 9, 56, 57], due to that UAV mounted with a mobile base station (UAV-BS) can provide high-speed air-to-ground data access by using the line-of-sight (LoS) communication links. However, obtaining the best performance in an UAV-BS-assisted wireless system highly depends on the placement of the UAV-BS [55, 9, 56]. Here, we consider optimizing the placement of UAV-BS continuously through our proposed voting-based MARL algorithm.

Existing works on the placement optimization of UAV-BS have two major drawbacks. First, many of them do not consider user movements [58, 59, 56, 60, 61], but the change of user distribution can largely influence the system performance. Second, many of them determine the optimal placement of UAV-BS by assuming that the performance gain of each ground BS is public information [9, 61], which may be impractical in real-world wireless systems that have mixed wireless operators, infrastructures, and protocols. To overcome these drawbacks, we model the UAV-BS placement optimization as a voting-based MARL problem, where multiple ground BS learn to place the UAV-BS optimally with adaptation to user movements and without the need to share their reward information. The aim is to maximize the global performance gain of all ground BS.

We consider the downlink of a wireless cellular network. As shown in Fig. 2, the 2km $\times$ 2km area of interest has $M=20$ regularly deployed ground BS, one UAV-BS flying at $200$ m to provide air-to-ground communications, and $200$ mobile users moving according to the random walk model in [62], each having a constant-bit-rate communication demand. The UAV-BS can move to any one of the aerial locations from a finite set $|\mathcal{A}|$ to provide air-to-ground communication. The user mobility follows the random walk model in [62], where each user moves at an angle uniformly distributed between $[0,2\pi]$ and a random speed between $[0,c_{\text{max}}]$ with $c_{\text{max}}$ being the maximum moving speed. Table I summarizes the main parameters. The air-to-ground channel and ground-to-ground channel are modeled according to [63] (Sec. II). The load of each base station is defined as the ratio between the required number of PRBs and the total number of available PRBs according to [4] (Sec. II-B).

The learning context is defined as follows. 1) States: We divide the area of interest into $3\times 3$ grids and use the load of each grid to characterize the wireless system status. The load of each grid is indicated by one of two states: a) overloaded, if the users’ demand within the grid is higher than the mean demands of all the grids; b) underloaded, otherwise. Since the grids cannot be all overloaded or all underloaded, there are only $|\mathcal{S}|=510$ states for the wireless system with $9$ grids. 2) Actions: The action set $\mathcal{A}$ is defined as the available aerial locations for the placement of the UAV-BS. At each time $t$ , the UAV-BS chooses an action $a_{t}\in\mathcal{A}$ for placement. 3) Rewards: The reward function is defined with the aim to maximize user throughput. Specifically, we assume that users are always handed over to the BS with the best SINR, so that an increased load at the UAV-BS usually indicates an increased user throughput due to better user SINRs. Hence, we define the reward to be the increased load at the UAV-BS.

We compare the proposed voting-based MARL algorithm with four baselines:

the classic Q-learning algorithm [1], which uses centralized Q-learning to learn the optimal UAV placement policy;
the multi-agent actor-critic algorithm based on the centralized learning with decentralized execution framework [18], where distributed agents optimize the placement policy jointly by communicating with a centralized center;
the multi-agent Q-learning algorithm proposed in [15], where each agent performs independent Q-learning and treats the other agents as part of the environment;
the optimal scheme, obtained by assuming that the underlying MDP is known. We refer to them as centralized QL, multi-agent AC, multi-agent QL, and optimal for short, respectively. In addition, we adopt the majority voting rule proposed for multi-agent Q-learning in [46] to determine the joint action for both the multi-agent AC algorithm and the multi-agent QL algorithm.

In Fig. 3, we present the averaged rewards over $20$ runs. The result shows that the performance of our proposed voting-based MARL algorithm outperforms all the comparing algorithms and is close to the optimal scheme. The discount factor for discounted RL methods is set to be $0.9$ . The performance gap between our proposed method and the centralized QL indicates that undiscounted RL methods are likely to outperform discounted RL methods in continuing optimization tasks. The performance gap between centralized QL and multi-agent AC/QL indicates that existing MARL algorithms exhibit a performance degradation compared with their centralized versions. In contrast, our proposed MARL algorithm achieves an equivalent performance to its centralized version. In addition, the performance of the multi-agent QL algorithm is the worst and has a large variance. This verifies that specifying a proper collaboration protocol among the distributed agents is critical in MARL in order to improve the learning performance.

We further compare our proposed voting-based scheme with two baselines: 1) the random-voting scheme, where the MARL system randomly chooses one agent to determine the global action per iteration; 2) the greedy scheme, where the MARL system aims at maximizing the cumulative reward of a single agent. Fig. 4 presents the averaged reward of each agent over $20$ runs. The rewards of the greedy-maximizing scheme indicate the maximum obtainable reward of each agent, while the rewards of the random-voting scheme indicate the learning effectiveness without the proposed voting mechanism. The performance of our proposed voting-based scheme lies between the two baselines, which indicates that the agents are learning to compromise in order to maximize the cumulative global reward.

VII Conclusions

In this paper, we considered a collaborative MARL problem, where the agents vote to make group decisions. Specifically, the agents are coordinated to follow the proposed voting mechanism without revealing their own rewards to each other. We gave a saddle-point formulation of the concerned MARL problem and proposed a primal-dual learning algorithm for solving it. We showed that our proposed algorithm achieves the same sublinear convergence rate as centralized learning. Finally, we provided empirical results to demonstrate the learning effectiveness. More interesting applications in the IoT system and the voting mechanism in the context of competitive MARL can be explored in the future.

[Proof of Theorem 1] Our proof shares a similar spirit as that of Theorem 1 in [39]. However, the analysis in [39] does not readily extend to the case of multi-agent AMDP. As a result, we have to develop a separate new convergence analysis here.

By virtue of Lemma 1, it suffices to study the progress made by the sequences of global dual variables $\{\bm{\mu}^{g,t}\}_{t\geq 0}$ and global primal variables $\{\bm{v}^{t}\}_{t\geq 0}$ in Algorithm 1. We begin with the following lemma, which gives an estimate of the progress of the dual variables in terms of KL-divergence.

Lemma 3 (Dual Improvement in KL-Divergence)

The iterates generated by Algorithm 1 will satisfy

[TABLE]

for all $t\geq 0$ . $\blacksquare$

Proof. By definition, we have

[TABLE]

According to (9), (13), and (14a), we have

[TABLE]

where $Z=\sum_{i\in\mathcal{S}}\sum_{a\in\mathcal{A}}\mu^{g,t}_{i,a}\exp\{\Delta^{g,t}_{i,a}\}$ . It follows that

[TABLE]

Now, for any $\bm{v}^{t}\in\mathcal{V}$ , we have $\|\bm{v}^{t}\|_{\infty}\leq 2t_{\text{mix}}^{*}$ . Moreover, we have $r_{ij}^{m}(a)\in[0,1]$ by assumption. Hence, we have

[TABLE]

This, together with the fact that $C=4t^{*}_{\text{mix}}+M$ , implies $\Delta^{g,t}_{i,a}\leq 0$ , $\forall\ i\in\mathcal{S},\ a\in\mathcal{A},\ t=0,1,\ldots$ . On the other hand,

[TABLE]

where (17a) uses the fact that $\exp\left\{x\right\}\leq 1+x+\frac{1}{2}x^{2}$ for $x\leq 0$ and (17b) uses the fact that $\log(1+x)\leq x$ for $x>-1$ . Therefore, by combining the above results and taking conditional expectation $\mathbb{E}\left[\cdot\mid\mathcal{F}_{t}\right]$ on both sides, we obtain (16), as desired. $\blacksquare$

Our strategy now is to bound the two terms on the right-hand side of (16) separately.

Lemma 4

The iterates generated by Algorithm 1 satisfy

[TABLE]

for all $t\geq 0$ . $\blacksquare$

Proof. For arbitrary $i\in\mathcal{S}$ and $a\in\mathcal{A}$ , we have

[TABLE]

where (18) follows from Lemma 2 and (19) comes from the fact that

[TABLE]

This completes the proof. $\blacksquare$

Lemma 5

The iterates generated by Algorithm 1 satisfy

[TABLE]

for all $t\geq 0$ . $\blacksquare$

Proof. Using (15), the assumptions that $r_{ij}^{m}(a)\in[0,1]$ and $\bm{v}^{t}\in\mathcal{V}$ , and the definition of $C$ , we compute

[TABLE]

Since $\sum_{i\in\mathcal{S}}\sum_{a\in\mathcal{A}}\mu_{i,a}^{g,t+1}=1$ , the result follows. $\blacksquare$

Next, we give an estimate on the distance of the primal iterate $\bm{v}^{t}$ to the optimal primal variable $\bm{v}^{*}$ .

Lemma 6 (Distance to Primal Optimality)

The iterates generated by Algorithm 1 satisfy

[TABLE]

for all $t\geq 0$ . $\blacksquare$

Proof. We compute

[TABLE]

where the inequality follows from the fact that $\bm{v}^{*}\in\mathcal{V}$ and the projector $\Pi_{\mathcal{V}}\{\cdot\}$ is non-expansive. By Lemma 2, we have

[TABLE]

Finally, using the definition of $\bm{d}^{t}$ in (12), we have $\mathbb{E}\left[\|\bm{d}^{t}\|^{2}\mid\mathcal{F}_{t}\right]=2\alpha^{2}$ . This completes the proof. $\blacksquare$

We are now ready to establish the key recursion that will lead to our desired bound on the convergence rate of our proposed Algorithm 1.

Lemma 7

Define

[TABLE]

The iterates generated by Algorithm 1 satisfy

[TABLE]

for all $t\geq 0$ . $\blacksquare$

Proof. Using the results in Lemmas 3–6 and taking $\alpha=\frac{1}{|\mathcal{A}|}(4t^{*}_{\text{mix}}+M)^{2}\beta$ , we compute

[TABLE]

Now, observe that

[TABLE]

where (20a) and (20c) use the dual feasibility conditions $\sum_{a\in\mathcal{A}}(\bm{\mu}^{g,*}_{a})^{\top}(I-P_{a})=\bm{0}$ and $\sum_{i\in\mathcal{S}}\sum_{a\in\mathcal{A}}\mu^{g,*}_{i,a}=1$ in (4), respectively; (20b) uses the complementarity condition

[TABLE]

of the linear program (3). Combining the preceding relations, we obtain Lemma 7. $\blacksquare$

Proof of Theorem 1: We claim that

[TABLE]

To see this, we note that ${\mu}^{g,1}$ is the uniform distribution and $\bm{v}^{0},\bm{v}^{*}\in\mathcal{V}$ . Therefore, we have $D_{KL}(\bm{\mu}^{g,*}\|\bm{\mu}^{g,1})\leq\log(|\mathcal{S}|\cdot|\mathcal{A}|)$ and $\|\bm{v}^{t}-\bm{v}^{*}\|^{2}\leq 4|\mathcal{S}|(t^{*}_{\text{mix}})^{2}$ for $t=0,1,\ldots$ . This yields

[TABLE]

Now, we rearrange the terms in Lemma 7 and obtain

[TABLE]

Summing over $t=1,\ldots,T$ and taking the expectation, we have

[TABLE]

By taking

[TABLE]

we obtain

[TABLE]

as desired. $\blacksquare$

Bibliography63

The reference list from the paper itself. Each links out to its DOI / PubMed record.

1[1] R. S. Sutton and A. G. Barto, Reinforcement learning: An introduction . Cambridge, MA, USA: MIT press, 2018.
2[2] M. Chu, H. Li, X. Liao, and S. Cui, “Reinforcement learning-based multiaccess control and battery prediction with energy harvesting in Io T systems,” IEEE Internet Things J. , vol. 6, no. 2, pp. 2009–2020, April 2019.
3[3] N. Jiang, Y. Deng, A. Nallanathan, and J. A. Chambers, “Reinforcement learning for real-time optimization in NB-Io T networks,” IEEE J. Sel. Areas Commun. , vol. 37, no. 6, pp. 1424–1440, June 2019.
4[4] Y. Xu, W. Xu, Z. Wang, J. Lin, and S. Cui, “Load balancing for ultra-dense networks: A deep reinforcement learning based approach,” IEEE Internet Things J. , vol. 6, no. 6, pp. 9399–9412, December 2019.
5[5] H. Ye, G. Y. Li, and B. F. Juang, “Deep reinforcement learning based resource allocation for V 2V communications,” IEEE Trans. Veh. Technol. , vol. 68, no. 4, pp. 3163–3173, April 2019.
6[6] X. Zhang, M. Peng, S. Yan, and Y. Sun, “Deep reinforcement learning based mode selection and resource allocation for cellular V 2X communications,” IEEE Internet Things J. , December 2019, to appear.
7[7] Y. Liu, H. Yu, S. Xie, and Y. Zhang, “Deep reinforcement learning for offloading and resource allocation in vehicle edge computing and networks,” IEEE Trans. Veh. Technol. , vol. 68, no. 11, pp. 11 158–11 168, November 2019.
8[8] C. H. Liu, Z. Chen, J. Tang, J. Xu, and C. Piao, “Energy-efficient UAV control for effective and fair communication coverage: A deep reinforcement learning approach,” IEEE J. Sel. Areas Commun. , vol. 36, no. 9, pp. 2059–2070, Sep. 2018.

TL;DR

Contribution

Findings

Abstract

Peer Reviews

Videos

Taxonomy

Voting-Based Multi-Agent Reinforcement Learning for Intelligent IoT

Abstract

Index Terms:

I Introduction

II Related Work

III Problem Formulation

III-A Multi-Agent AMDP

III-B Multi-Agent Policy Optimization

III-C Saddle-Point Formulation

IV Voting-Based Learning Algorithm

IV-A Voting Mechanism

IV-B Primal-Dual Learning Algorithm

IV-B1 Local Dual Update

IV-B2 Local Primal Update

IV-B3 Communication

IV-B4 Equivalent Global Update

Lemma 1** (Equivalent Global Update)**

Lemma 2** (Unbiasedness)**

V Theoretical Results

Assumption 1

Theorem 1** (Finite-Iteration Duality Gap)**

VI Numerical Results

VI-A Empirical Convergence

VI-B Application in Multi-Agent IoT Systems

VII Conclusions

Lemma 3** (Dual Improvement in KL-Divergence)**

Lemma 4

Lemma 5

Lemma 6** (Distance to Primal Optimality)**

Lemma 7

Lemma 1 (Equivalent Global Update)

Lemma 2 (Unbiasedness)

Theorem 1 (Finite-Iteration Duality Gap)

Lemma 3 (Dual Improvement in KL-Divergence)

Lemma 6 (Distance to Primal Optimality)