Distributed Power Control for Large Energy Harvesting Networks: A   Multi-Agent Deep Reinforcement Learning Approach

Mohit K.Sharma; Alessio Zappone; Mohamad Assaad; Merouane Debbah,; Spyridon Vassilaras

arXiv:1904.00601·cs.LG·October 23, 2019

Distributed Power Control for Large Energy Harvesting Networks: A Multi-Agent Deep Reinforcement Learning Approach

Mohit K.Sharma, Alessio Zappone, Mohamad Assaad, Merouane Debbah,, Spyridon Vassilaras

PDF

Open Access

TL;DR

This paper introduces a multi-agent deep reinforcement learning framework for online power control in large energy harvesting networks, modeling the problem as a mean-field game and ensuring distributed learning of optimal policies.

Contribution

It develops a novel distributed MARL approach based on mean-field game theory, with proven convergence to unique stationary solutions for energy harvesting networks.

Findings

01

Distributed policies perform close to centralized ones.

02

Proposed method converges to the unique mean-field equilibrium.

03

Centralized DNN policies outperform traditional methods in large networks.

Abstract

In this paper, we develop a multi-agent reinforcement learning (MARL) framework to obtain online power control policies for a large energy harvesting (EH) multiple access channel, when only causal information about the EH process and wireless channel is available. In the proposed framework, we model the online power control problem as a discrete-time mean-field game (MFG), and analytically show that the MFG has a unique stationary solution. Next, we leverage the fictitious play property of the mean-field games, and the deep reinforcement learning technique to learn the stationary solution of the game, in a completely distributed fashion. We analytically show that the proposed procedure converges to the unique stationary solution of the MFG. This, in turn, ensures that the optimal policies can be learned in a completely distributed fashion. In order to benchmark the performance of the…

Tables5

Table 1. TABLE I : An example of normalized values for energy consumption of memory access and computation [ 40 ] . Here, the arithmetic and logic unit (ALU) contains register file (RF). The size of RF is smaller than a the processing engine (PE), which, in turn is smaller than a global buffer. The dynamic RAM (DRAM) is the largest among the all and is external to a DNN [ 41 ] .

Hierarchy of Memory Access

Normalized energy cost

MAA

1x

RF

\to

ALU

1x

PE

\to

ALU

2x

Buffer

\to

ALU

6x

DRAM

\to

ALU

200x

Table 2. TABLE II : Performance of the DNN based policy for an EH MAC with K = 5 𝐾 5 K=5 users and v = 3.5 𝑣 3.5 v=3.5 . Performance of the offline policy corresponds to 100%.

Mean

(m)

Offline Policy

(RPS in nats)

DNN policy

(RPS in nats)

DNN policy

(Percentage )

4

3.4907

3.1498

90.23%

5

3.6564

3.3107

90.54%

6

3.7877

3.4410

90.84%

7

3.8922

3.5102

90.18%

8

3.9740

3.6146

90.95%

9

4.0407

3.5676

88.29%

Table 3. TABLE III : Performance of the DNN based online policy for a point-to-point link with m = 10 𝑚 10 m=10 . The action space of DQN based policy is 𝒜 ≜ { 0 , 0.1 , … , 15 } ≜ 𝒜 0 0.1 … 15 \mathcal{A}\triangleq\{0,0.1,\ldots,15\} . On the other hand, the MDP based solution is obtained using the action space 𝒜 ≜ { 0 , 1 , … , 15 } ≜ 𝒜 0 1 … 15 \mathcal{A}\triangleq\{0,1,\ldots,15\} .

Variance

(v)

Offline Policy

(RPS in nats)

DNN Policy

(Percentage )

DQN Policy

(Percentage )

MDP Policy

(Percentage )

1

2.0434

98.41%

95.56%

83.32%

2

2.0375

98.56%

95.24%

83.60%

3

2.0372

98.38%

98.11%

83.32%

4

2.0347

95.85%

96.54%

83.37%

5

2.0310

97.72%

95.28%

83.29%

6

2.0284

98.22%

98.18%

83.21%

Table 4. TABLE IV : Summary of the inputs and outputs of the policies compared in this section. Note that, since both the MF-MARL and the cooperative Q-learning uses deep Q-learning at each individual node, their input and output values are continuous and discrete, respectively

Policy	Input	Output
Centralized DNN	Continuous	Continuous
Distributed DNN	Discrete	Continuous
Deep Q-learning	Continuous	Discrete
MDP	Discrete	Discrete

Table 5. TABLE V : Performance of the MF-MARL and cooperative multi-agent Q-learning approach for an EH MAC with K = 5 𝐾 5 K=5 users and v = 3.5 𝑣 3.5 v=3.5 . Performance of the centralized policy corresponds to 100%.

Mean

(m)

Centralized Policy

(RPS in nats)

MF-MARL policy

Cooperative Q-learning

Distributed DNN

RPS

%

RPS

%

RPS

%

4

3.1498

2.9390

93.30%

2.9354

93.19%

2.6788

85.04%

5

3.3107

3.1311

94.57%

3.0046

90.75%

2.8918

87.34%

6

3.4410

3.1072

90.29%

3.1852

92.56%

3.0765

89.40%

7

3.5102

3.2960

93.89%

3.2417

92.35%

3.1388

89.41%

8

3.6146

3.3973

93.98%

3.3064

91.47%

3.2518

89.96%

9

3.6166

3.5179

95.68%

3.4528

93.90%

3.1922

88.26%

Equations36

B_{n + 1}^{k} = min {[B_{n}^{k} + e_{n}^{k} - p_{n}^{k}]^{+}, B_{m a x}},

B_{n + 1}^{k} = min {[B_{n}^{k} + e_{n}^{k} - p_{n}^{k}]^{+}, B_{m a x}},

T (P) = n = 1 \sum N lo g (1 + k \in K \sum p_{n}^{k} g_{n}^{k}),

T (P) = n = 1 \sum N lo g (1 + k \in K \sum p_{n}^{k} g_{n}^{k}),

{P} max N \to \infty lim inf \frac{1}{N} T (P),

{P} max N \to \infty lim inf \frac{1}{N} T (P),

s.t. 0 \leq p_{n}^{k} \leq min {B_{n}^{k}, P_{m a x}},

{P} max \frac{1}{N} n = 1 \sum N lo g (1 + k \in K \sum p_{n}^{k} g_{n}^{k}),

{P} max \frac{1}{N} n = 1 \sum N lo g (1 + k \in K \sum p_{n}^{k} g_{n}^{k}),

0 \leq p_{n}^{k} \leq min {B_{n}^{k}, P_{m a x}} for all n, and 1 \leq k \leq K .

I_{j} (n) = F_{j, n} (W_{j, n}^{T} I_{j - 1} + b_{j, n}),

I_{j} (n) = F_{j, n} (W_{j, n}^{T} I_{j - 1} + b_{j, n}),

L_{av} (W, b) = \frac{1}{N _{data}} ℓ = 1 \sum N_{data} L (\hat{P_{ℓ}^{*}}, I_{h + 2, ℓ} (W, b)),

L_{av} (W, b) = \frac{1}{N _{data}} ℓ = 1 \sum N_{data} L (\hat{P_{ℓ}^{*}}, I_{h + 2, ℓ} (W, b)),

R_{k} (π_{n}, p_{n}^{k})

R_{k} (π_{n}, p_{n}^{k})

= lo g (1 + i = 1 \sum d K π_{n}^{i} p_{i} g_{i}),

π_{n + 1}^{j}

π_{n + 1}^{j}

V_{n} (π_{n}, F) = R (π_{n}, F) + V_{n + 1} (π_{n + 1}, F),

V_{n} (π_{n}, F) = R (π_{n}, F) + V_{n + 1} (π_{n + 1}, F),

V_{n} (π_{n}, F) \leq V_{n} (π_{n}, F^{*}), for all policies F .

V_{n} (π_{n}, F) \leq V_{n} (π_{n}, F^{*}), for all policies F .

G_{\tilde{π}} (\tilde{V}) = \tilde{V} and

G_{\tilde{π}} (\tilde{V}) = \tilde{V} and

K_{\tilde{V}} (\tilde{π}) = \tilde{π} .

i = 1 \sum d (p_{i}^{1} - p_{i}^{2}) (f_{i} (F^{1}) - f_{i} (F^{2})) > 0,

i = 1 \sum d (p_{i}^{1} - p_{i}^{2}) (f_{i} (F^{1}) - f_{i} (F^{2})) > 0,

i = 1 \sum d (π_{i}^{2} - π_{i}^{1}) (R_{i} (F, π^{2}) - R_{i} (F, π^{1})) \geq 0,

i = 1 \sum d (π_{i}^{2} - π_{i}^{1}) (R_{i} (F, π^{2}) - R_{i} (F, π^{1})) \geq 0,

F_{m}^{*}

F_{m}^{*}

π_{m + 1}

and \overset{ˉ}{π}_{m + 1}

i = 1 \sum d (π_{i}^{2} - π_{i}^{1}) (R_{i} (P, π^{2}) - R_{i} (P, π^{1})) \geq 0.

i = 1 \sum d (π_{i}^{2} - π_{i}^{1}) (R_{i} (P, π^{2}) - R_{i} (P, π^{1})) \geq 0.

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsEnergy Harvesting in Wireless Networks · Smart Grid Energy Management · Electric Vehicles and Infrastructure

Full text

Distributed Power Control for Large Energy Harvesting Networks: A Multi-Agent Deep Reinforcement Learning Approach ††thanks: The work in this paper will appear in part at IEEE ICASSP 2019 [1] and IEEE WiOpt 2019 [2]. Mohit K. Sharma, Mohamad Assaad, and Mérouane Debbah are with the CentraleSupelec, Université Paris-Saclay, 91192 Gif-sur-Y vette, France. (e-mails: {mohitkumar.sharma, mohamad.assaad}@centralesupelec.fr. A. Zappone was with CentraleSupelec, Gif-Sur-Yvette, France, and is now with the University of Cassino and Southern Lazio, Cassino, Italy (email: [email protected]). Mérouane Debbah and Spyridon Vassilaras are with the Mathematical and Algorithmic Sciences Lab, Huawei France R&D, Paris, France (e-mails: {merouane.debbah, spyros.vassilaras}@huawei.com). This research has been partly supported by the ERC-PoC 727682 CacheMire project. The work of A. Zappone was supported by the H2020 MSCA IF BESMART, Grant 749336.

Mohit K. Sharma, Alessio Zappone, , Mohamad Assaad, , Mérouane Debbah , and Spyridon Vassilaras

Abstract

In this paper, we develop a multi-agent reinforcement learning (MARL) framework to obtain online power control policies for a large energy harvesting (EH) multiple access channel, when only causal information about the EH process and wireless channel is available. In the proposed framework, we model the online power control problem as a discrete-time mean-field game (MFG), and analytically show that the MFG has a unique stationary solution. Next, we leverage the fictitious play property of the mean-field games, and the deep reinforcement learning technique to learn the stationary solution of the game, in a completely distributed fashion. We analytically show that the proposed procedure converges to the unique stationary solution of the MFG. This, in turn, ensures that the optimal policies can be learned in a completely distributed fashion. In order to benchmark the performance of the distributed policies, we also develop a deep neural network (DNN) based centralized as well as distributed online power control schemes. Our simulation results show the efficacy of the proposed power control policies. In particular, the DNN based centralized power control policies provide a very good performance for large EH networks for which the design of optimal policies is intractable using the conventional methods such as Markov decision processes. Further, performance of both the distributed policies is close to the throughput achieved by the centralized policies.

I Introduction

Internet-of-things (IoT) [3] networks connect a large number of low power sensors whose lifespan is typically limited by the energy that can be stored in their batteries. In this context, the advent of the energy harvesting (EH) technology [4] promises to prolong the lifespan of IoT networks by enabling the nodes to operate by harvesting energy from environmental sources, e.g., the sun, the wind, etc. However, this requires the development of new energy management methods. This is because an EH node (EHN) operates under the energy neutrality constraint which requires that the total energy consumed by the node up to any point in time can not exceed the total amount of energy harvested by the node until that point. This constraint is particularly challenging due to the random nature of environmental energy sources. In particular, the evolution over time of the intensity of the sun or wind is a random process, and thus the amount of energy that can be harvested at any given instant can not be deterministically known in advance. In addition, at a given instant, an EHN can only store an amount of energy equal to its battery capacity. Therefore, a major and challenging issue in a EH-based IoT systems is to devise power control policies to maximize the communication performance under the aforementioned constraints.

Available approaches for power control in EH-based wireless networks can be divided into two main categories: offline and online approaches. Offline approaches consider a finite time-horizon over which the optimal power control policy has to be designed, and assume that perfect information about the energy arrivals and channel states is available over the entire time-horizon [5, 6], before the start of operation. Under these assumptions, the power control problem can be formulated as a static optimization problem aimed at optimizing a given performance metric (e.g. system sum-rate, communication latency), and can be tackled by traditional optimization techniques. However, in general, offline approaches are not practically implementable because they require non-causal knowledge about the energy arrivals and propagation channels. For this reason, offline solutions are mostly considered for benchmarking purposes only.

In contrast to offline policies, online approaches target the optimization of the system performance over a longer, possibly infinite, time-horizon, and assume that only previous and present energy arrivals and channel states are known [7, 8]. As a result, the power allocation problem becomes a stochastic control problem, which, upon discretizing the state space (battery state and channel gains), can be formulated as a Markov decision processes (MDP) [9], for which optimal long-term policy can be determined numerically. However, these techniques require perfect knowledge of the statistics of the EH process and of the propagation channels, which are difficult to know in practice. In order to address this drawback, the framework of reinforcement learning (RL) [10, 11, 12, 13, 14, 15, 16, 17, 18] or that of Lyapunov optimization [19, 20, 21] have been proposed to find approximate solutions. All of these previous works take a centralized approach, in which typically the whole network is modeled as a single MDP whose solution provides the optimal long-term power allocation policy for all network nodes. However, this approach is not suitable for large networks, as the presence of a large number of nodes causes inevitable feedback overheads, and more importantly the resulting MDP is often intractable. Indeed, numerical solution techniques for the MDPs suffer from the so-called “curse-of-dimensionality” which makes them computationally infeasible.

Therefore, in absence of any a-priori knowledge about the EH process and the channel, it is essential to develop new techniques which can aid in learning the online policies for large EH-based networks, in a distributed fashion. A fully distributed approach to online power control will obviate the need for any information exchange between the nodes. Learning distributed power control policies for EH networks have been recently considered in only a handful of works [22, 23, 24, 25].

In [22], the authors use a distributed Q-learning algorithm where each node independently learns its individual Q-function. However, the proposed method is not guaranteed to converge, since each individual node experiences an inherently non-stationary environment [26]. In [23], a distributed solution is developed to minimize the communication delay in EH-based large networks, assuming the information about the statistics of the EH process and of the wireless channel are known. Interestingly, the interactions among the devices are modeled as a mean-field game (MFG), a framework specifically conceived to analyze the evolution of systems composed of a very large number of distributed decision-makers [27, 28, 29]. A multi-agent reinforcement learning (MARL) approach is considered in [24], where an online policy for sum-rate maximization is developed. However, the approach in [24] assumes that the global system state is available at each node, which renders it infeasible for large EH networks, due to the extensive signaling required to feedback the global system state to all network nodes. In [25], a two-hop network with EH relays is considered, and a MARL-based algorithm with guaranteed convergence is proposed to minimize the communication delay.

The objective of this work is to develop a mechanism to learn optimal online power control policies in a distributed fashion, for a fading-impaired multiple access channel (MAC) with a large number of EH transmitters. The authors in [6] derived a throughput-optimal offline power control policy for a fading EH MAC, which is designed in a centralized fashion. In [30, 31], centralized online policies are developed under the simplifying assumptions of binary transmit power levels, and batteries with infinite or unit-size capacity. Optimal online power control policies for fading EH MAC are not available in the literature, even in a centralized setting. In order to design a centralized power control policy we build upon the recent advances in the deep learning [32]. In particular, our main contributions are the following:

•

First, to benchmark the performance of the distributed policies, we develop a deep neural network (DNN) based centralized online policy which uses a DNN to map a system state to transmit power.

•

We model the problem of throughput maximization for a fading EH MAC as a discrete-time MFG, and exploiting the structure of the problem we show that the MFG has unique stationary solution.

•

Next, we leverage the fictitious play property of MFGs and develop a deep reinforcement learning (DRL) based approach to learn the stationary solution of the MFG. Under the proposed scheme, each node apply the DRL, individually, to learn the optimal power control in a completely distributed fashion, without any apriori knowledge about the statistics of the EH process and the channel.

•

Furthermore, we adapt the DNN based centralized approach to design an energy efficient distributed online power control policy.

•

Extensive numerical results are provided to analyze the performance of the proposed schemes. Our results illustrate that the throughput achieved by DNN based centralized policies is close to the throughput achieved by the offline policies. Moreover, the policies learned using the proposed mean-field MARL approach achieve throughput close to centralized policies.

In contrast to earlier work [22, 23], our algorithm is provably convergent and does not require any knowledge about the statistics of the EH process and of the wireless channels. In order to learn the optimal power control policy, each node only needs to know the state of its own channel and battery. The performance of the resulting online policies is very close to offline policies which exploit non-causal information. We note that our work is the first in the literature that uses the multi-agent deep reinforcement learning to obtain the optimal power control in large EH networks.

The rest of the paper is organized as follows. In the following section, we describe the system model and the problem formulation. In Sec. III, we design DNN based centralized online power control policies. Next, In Secs. IV and V, we model the throughput maximization problem as a discrete-time finite state MFG and present our mean-field MARL approach to learn the distributed power control policy, respectively. In Sec. VI, we analyze the energy cost incurred on the implementation of the proposed algorithms, and also propose an energy efficient distributed DNN based algorithm. Simulation results are presented in Sec. VII, and conclusions in Sec VIII.

II System Model and Problem Formulation

We consider a time-slotted EH network, where a large number of identical EHNs transmit their data over block fading channels to an access point (AP) which is connected to the mains. The set of transmitters is denoted by $\mathcal{K}\triangleq\{1,2,\ldots,K\}$ , where $K\gg 1$ denotes the number of EHNs. In the $n^{\text{th}}$ slot, the fading complex channel gain between the $k^{\text{th}}$ transmitter and the AP is denoted111For any symbol in the paper, the superscript and subscript represent the node index and the slot index, respectively, and if only the subscript is present then it denotes either the node index or the slot index, depending on the context. by $g_{n}^{k}\in\mathbf{G}_{k}$ . In each slot, the channel between any transmitter and the AP remains constant for the entire slot duration, and changes at the end of the slot, independently of the channel in the previous slot. We assume that the wireless channels between the nodes and the AP, $\mathbf{G}_{k}$ , are identically distributed.

In a slot, the $k^{\text{th}}$ node harvests energy according to a general stationary and ergodic harvesting process $f_{\mathcal{E}_{k}}(e_{k})$ , where the random variable $\mathcal{E}_{k}$ denotes the amount of energy harvested by the $k^{\text{th}}$ transmitter, and $e_{k}$ denotes a realization of $\mathcal{E}_{k}$ . We assume that the harvesting processes $\{\mathcal{E}_{k}\}_{k\in\mathcal{K}}$ are identically distributed across the individual nodes, but not necessarily independent of each other. At each node, the harvested energy is stored in a perfectly efficient, finite capacity battery of size $B_{\max}$ . Further, only causal and local information is available, i.e., each node knows only its own energy arrivals, battery states, and the channel states to the AP, in the current and all the previous time slots. In particular, no node has information about the battery and the channel state of the other nodes in the network. Also, at any node, no information is available about the distribution of the EH process and of the wireless channels.

Let $p_{n}^{k}\leq P_{\max}$ denote the transmit energy used by the $k^{\text{th}}$ transmitter in the $n^{\text{th}}$ slot, where $P_{\max}$ denotes the maximum transmit energy which is determined by the RF front end of the EHNs. Further, $\mathcal{P}_{n}\triangleq\{p_{n}^{k}\}_{k=1}^{K}$ denotes the vector of transmit energies used in the $n^{\text{th}}$ slot, by all the transmitters. The battery at the $k^{\text{th}}$ node evolves as

[TABLE]

where $1\leq k\leq K$ , and $[x]^{+}\triangleq\max\{0,x\}$ . In the above, $B_{n}^{k}$ and $e_{n}^{k}$ denote the battery level and the energy harvested by the $k^{\text{th}}$ node at the start of the $n^{\text{th}}$ slot, respectively. An upper bound on the successful transmission rate of the EH MAC over $N$ slots is given by[6]

[TABLE]

where $\mathcal{P}\triangleq\{\mathcal{P}_{n}|1\leq n\leq N\}$ . Note that, the above upper bound can be achieved by transmitting independent and identically distributed (i.i.d.) Gaussian signals. In (2), for simplicity, and without loss of generality, we set the power spectral density of the AWGN at the receiver as unity. 222We note that, in a scenario when all the EHNs simultaneously transmit their data, the cumulative signal-to-noise ratio (SNR) term in (2), $\sum_{k\in\mathcal{K}}p_{n}^{k}g_{n}^{k}$ , grows with the number of users in the network. In practice, this problem can be circumvented by ensuring that the transmit power of EHNs scales down in inverse proportion to the number of users, i.e., $O\left(\frac{1}{K}\right)$ . This ensures that the total energy in the network stays finite.

In the absence of information about the statistics of the EH process and the channel, our goal in this work is to learn online energy management policy at each node to maximize the time-averaged sum throughput. The optimization problem can be expressed as follows

[TABLE]

for all $n$ and $1\leq k\leq K$ . Constraint (3b) captures the fact that the maximum energy a node can use in the $n^{\text{th}}$ slot is limited by the minimum between the amount of energy available in the battery, $B_{n}^{k}$ , and the maximum allowed transmit energy $P_{\max}$ . Note that, the information about the random energy arrivals and the channel is only causally available, and for each node the battery evolves in a Markovian fashion, according to (1). Hence, the optimization problem (3) is essentially a stochastic control problem which, upon discretization of the state space, could be formulated as a Markov decision process (MDP). However, solving such an MDP in the considered setting poses at least three major challenges:

•

Infeasible complexity, since in the considered setup a large number of nodes, $K$ , is present in the network.

•

In each slot, the global information about the battery and channel states, and the value of the harvested energy of each network node would be needed for the operation of the policy. Therefore, the feedback overhead in each slot is $\mathcal{O}(K)$ . For a network with large number of nodes this would result in a significant control overhead.

•

Finally, solving the MDP also requires statistical information about the EH process and the wireless channel, which is often difficult to obtain, and indeed is not assumed in this work.

Due to these reasons, the goal of this work is to develop a framework to learn online power control policies in a distributed fashion, i.e., each node learns the optimal online power control policy without requiring to know the battery and channel states, and actions of the other nodes. In the following sections, we develop a provably convergent mean-field multi-agent reinforcement learning (MF-MARL) approach to distributively learn the throughput-optimal power control policies, leveraging the tools of DRL and MFGs. In the following section, first we present a DNN based centralized power control policy which is used for benchmarking our MF-MARL based distributed solution.

III DNN based Centralized Online Power Control Policy

To describe our DNN based centralized approach to solve the stochastic control problem in (3), we define some additional notations and formally define the online and offline policies in the context of our problem.

III-A Notations

For the $k^{\text{th}}$ node, let ${\boldsymbol{E}}_{m:n}^{k}\triangleq\{e_{m}^{k},e_{m+1}^{k},\ldots,e_{n}^{k}\}$ , ${\boldsymbol{B}}_{m:n}^{k}\triangleq\{B_{m}^{k},B_{m+1}^{k},\ldots,B_{n}^{k}\}$ , and ${\boldsymbol{G}}_{m:n}^{k}\triangleq\{g_{m}^{k},g_{m+1}^{k},\ldots,g_{n}^{k}\}$ denote the vectors containing the values of energy harvested, battery state, and the channel state, respectively, in the slots from $m$ to $n$ . Further, history up to the start of slot $n$ is denoted by a tuple $\boldsymbol{H}_{n}\triangleq\left\{({\boldsymbol{E}}_{1:n-1}^{k},{\boldsymbol{B}}_{1:n-1}^{k},{\boldsymbol{G}}_{1:n-1}^{k})\right\}_{k=1}^{K}$ , where $\boldsymbol{H}_{n}\in\mathcal{H}_{n}$ , where $\mathcal{H}_{n}$ is the set of all possible histories up to slot $n$ . Also, in the $n^{\text{th}}$ slot the current state of the system is described by the tuple $\boldsymbol{s}_{n}\triangleq\{\boldsymbol{E}_{n},\boldsymbol{B}_{n},\boldsymbol{G}_{n}\}$ , where $\boldsymbol{E_{n}}\triangleq(e_{n}^{1},e_{n}^{2},\ldots,e_{n}^{K})$ , $\boldsymbol{B_{n}}\triangleq(B_{n}^{1},B_{n}^{2},\ldots,B_{n}^{K})$ and $\boldsymbol{G_{n}}\triangleq(g_{n}^{1},g_{n}^{2},\ldots,g_{n}^{K})$ are the vectors containing the values of energy harvested, battery state, and the channel state, respectively, for all the nodes in the $n^{\text{th}}$ slot. Further, $\boldsymbol{s}_{n}\in\mathcal{S}$ where $\mathcal{S}$ denotes the set of all the possible states.

III-B Online and Offline Policies

In the $n^{\text{th}}$ slot, an online decision rule $f_{n}:\mathcal{H}_{n}\times\mathcal{S}\to\hat{\mathcal{P}}$ maps the history, $\boldsymbol{H}_{n}$ , and the current state of the system, $\boldsymbol{s}_{n}$ , to a transmit energy vector $\hat{\mathcal{P}}\in\mathbb{R}_{+}^{K}$ which contains feasible transmit energies for all the nodes. Mathematically, an online policy $\mathcal{F}$ is the collection of decision rules, i.e., $\mathcal{F}\triangleq\{f_{1},f_{2}\ldots\}$ . In contrast, for offline policy design problem the time-horizon, $N$ , is finite, and, for all the slots, the information about the amount of the energy harvested and the channel state is available non-causally, i.e., before the start of the operation, for all the slots. Hence, the stochastic control problem in (3) reduces to a static optimization problem which is written as

[TABLE]

Note that, since $N$ is finite, and the realizations of the EH processes and the channel states are known non-causally, i.e., $E_{1:N}^{k}$ and $G_{1:N}^{k}$ are known at the start of the operation, for all the nodes, the objective and constraints in (4) are deterministic convex functions in the optimization variables $p_{k}^{n}$ . Hence, the offline policy design problem in (4) is a convex optimization problem which can be solved efficiently using the iterative algorithm presented in [6], with per iteration complexity equal to $\mathcal{O}\left(KN^{2}\right)$ . The following section presents our approach to obtain the DNN based online energy management policies which, in general, can also be used for solving a stochastic control problem using the solution of an offline optimization problem.

III-C DNN based Online Energy Management

To obtain online energy management policy, we first note that due to finite state and action space of the problem, the optimal policy for the problem (3) is a Markov deterministic policy[33, Thm. 8.4.7], i.e., $\mathcal{F}\triangleq\{f,f\ldots\}$ where $f:\mathcal{S}\to\hat{\mathcal{P}}$ . Hence, the optimal online energy management policy can be obtained by finding a decision rule which maps the current state of the system $\boldsymbol{s_{n}}$ to an optimal transmit energy vector for problem (3). Furthermore, for a finite horizon problem, an offline policy also represents a mapping from the current state to a feasible transmit energy vector, i.e., the optimal offline policy maps a $(\boldsymbol{E},\boldsymbol{B},\boldsymbol{G})$ tuple to $\hat{\mathcal{P}^{*}}$ . Here, $\hat{\mathcal{P}^{*}}$ denotes the vector containing the optimal transmit power for each node. Since, a DNN is a universal function approximator[34], provided it contains a sufficient number of neurons, we propose to use a DNN to learn the optimal decision rule by using the solution of the offlline policy design problem to train the DNN. Under the proposed online scheme, in a given slot, the optimal transmit energy vector can be obtained by feeding the current state of the system as the input to the trained DNN. Our approach is illustrated in Fig. 2. In the following, we briefly describe the architecture of the DNN used and the procedure used for training the DNN.

III-D DNN Architecture

We adopt a feedforward neural network whose input layer contains $3K$ neurons, one corresponding to each input. A $3K$ -length vector, containing the states of all the transmitters, is fed to the DNN as input which is then processed by $h+1$ layers ( $h$ hidden layers and the output layer) to compute a feasible $K$ -length transmit power vector. The number of processing units, usually termed as neurons, at the $j^{\text{th}}$ layer is denoted by $N_{j}$ , where $1\leq j\leq h+2$ . Note that, $N_{1}=3K$ and $N_{h+2}=K$ . The output of the $n^{\text{th}}$ neuron of the $j^{\text{th}}$ layer, denoted by $I_{j}(n)$ , is computed as

[TABLE]

where $I_{j-1}$ is the output of the $(j-1)^{\text{th}}$ layer, which is fed as input to the $j^{\text{th}}$ layer. Also, $\boldsymbol{W}_{j,n}\in\mathbb{R}^{N_{j-1}}$ , $b_{j,n}\in\mathbb{R}$ , and $F_{j,n}$ denote the weights, bias and the nonlinear activation function for the $n^{\text{th}}$ neuron of $j^{\text{th}}$ layer, respectively. For detailed exposition on the architecture of DNNs and activation functions we refer the readers to [34].

III-E Training

The DNN can learn the optimal mapping, between the system state and the feasible transmit power vector, by appropriately adjusting the weights and biases of the neurons in the network. The weights $\boldsymbol{W}=\{\{\boldsymbol{W}_{j,n}\}_{n=1}^{N_{j}}\}_{j=1}^{h+2}$ and biases $\boldsymbol{b}=\{\{b_{j,n}\}_{n=1}^{N_{j}}\}_{j=1}^{h+2}$ of the neurons of a DNN can be tuned by minimizing a loss function over a training set which is a set of data points for which the optimal mapping is already known. In particular, the training process minimizes the average loss, over the entire training set, defined as follows

[TABLE]

where $L(\cdot)$ denotes a loss function which is a metric of distance between the desired output and the output of the DNN, and $N_{\text{data}}$ denotes the number of data points in the training set. In (6), $\mathcal{P}_{\ell}^{*}$ and $I_{h+2,\ell}(\cdot,\cdot)$ denote the actual output and the output of the DNN, respectively, for the $\ell^{\text{th}}$ data point. The training proceeds by iteratively minimizing the loss in (6), using gradient based methods over the training data set. During the training, the gradients are often estimated using small subsets of the training set which are called as mini-batches. Note that, in order to train the DNN to learn the optimal online energy management decision rule, the training data is generated by solving several instantiations of the offline problem (4), each corresponding to a different realization of $\{\boldsymbol{E}_{1:N}^{k},\boldsymbol{G}_{1:N}^{k}\}_{k=1}^{K}$ . The training data generated by solving the offline problem contains the tuples of the form $\{\left(\boldsymbol{E},\boldsymbol{B},\boldsymbol{G}\right),\boldsymbol{P}\}$ , where $\left(\boldsymbol{E},\boldsymbol{B},\boldsymbol{G}\right)$ and $\boldsymbol{P}$ represents the input to the DNN and the desired output, respectively. Further details related to the loss function, training method, and the batch size used in this work are presented in Sec. VII. A detailed discussion on the choice of the loss functions for the training, the training method for DNN, and the mini-batch size can be found in [34, Ch. 7 and 8].

Note that, our approach to design centralized online energy management policy does not require the knowledge about the statistics of the EH process and the channel. Interestingly, as observed through the simulations, the proposed DNN based approach performs marginally better than the state-of-the-art deep reinforcement learning approach. However, in contrast to deep Q-learning method, the proposed DNN-based approach requires the measurements of EH values and channels for all the nodes, which is used for training the DNN before the start of the operation. Also, the proposed approach determines the transmit energy vector for all the nodes in a centralized fashion, using the battery state, channel state, and the amount of energy harvested in the current slot, for all the nodes. To implement this scheme, the nodes are required to feedback their state in every slot and then the transmit energies to be used in the next slot are communicated to the nodes. The distributed solutions proposed in the following sections obviate the overhead involved in communication of the state information and the transmit energies.

IV Mean-field Game to Maximize the Sum Throughput

In this section, first we model the sum throughput maximization problem in (3) as a discrete time, finite state MFG[35]. Next, we present preliminaries on the discrete-time MFGs, and list the key results which are useful in showing the convergence of the proposed approach to the stationary solution of the MFG.

IV-A Throughput Maximization Game

The throughput maximization game $\mathcal{G}_{T}\triangleq\{\mathcal{K},\mathcal{S},\mathcal{F},\mathcal{R}\}$ consists of:

•

The set of players $\mathcal{K}=\{1,2,\ldots,K\}$ , each one corresponding to a unique EH transmitter, where $K>>1$ ;

•

The state space of all players $\mathcal{S}\triangleq\times_{k\in\mathcal{K}}\mathcal{S}^{k}$ , with $\mathcal{S}^{k}$ denoting the space of all the states $s^{k}$ for the $k^{\text{th}}$ transmitter, and $|\mathcal{S}^{k}|\triangleq d$ . Also, let $s_{n}^{k}\triangleq(B_{n}^{k},g_{n}^{k},e_{n}^{k})$ denote the state of the $k^{\text{th}}$ transmitter in the $n^{\text{th}}$ slot, where $B_{n}^{k}$ , $g_{n}^{k}$ , and $e_{n}^{k}$ are discrete-valued;

•

The set of energy management policies of all the nodes $\mathcal{F}\triangleq\{\mathcal{F}^{k}\}_{k\in\mathcal{K}}$ , where $\mathcal{F}^{k}$ denotes the policy of the $k^{\text{th}}$ node;

•

The set of reward functions of all the nodes $\mathcal{R}\triangleq\{\mathcal{R}_{k}\}_{k\in\mathcal{K}}$ , where $\mathcal{R}_{k}$ is the reward function of node $k$ .

Note that, since all the transmitters are identical, the state space of individual nodes, $\mathcal{S}^{k}$ , is the same set for all $k=1,\dots,K$ . In the $n^{\text{th}}$ time slot, the $k^{\text{th}}$ node uses $p_{n}^{k}$ amount of energy, prescribed by its policy $\mathcal{F}^{k}$ , and collects a reward according to its reward function $\mathcal{R}_{k}$ and evolves from one state to another.

Under the mean field hypothesis[35], the reward obtained by a given node depends on the other nodes only through the distribution of all the nodes across the states. Let $\boldsymbol{\pi}_{n}\triangleq(\pi_{n}^{1},\ldots,\pi_{n}^{d})$ denote the distribution of all the nodes across the states, in the $n^{\text{th}}$ slot, where $\pi_{n}^{i}$ denotes the fraction of nodes in the $i^{\text{th}}$ state. Since the goal is to maximize the sum-throughput of the network, each node receives a reward equal to the sum-throughput of the network. In the $n^{\text{th}}$ slot, the reward obtained by the $k^{\text{th}}$ node is equal to the total number of bits successfully received by the AP, from all the transmitters. Thus, the reward function can be mathematically expressed as

[TABLE]

where $g_{i}$ is the wireless channel gain between the nodes in the $i^{\text{th}}$ state and the AP, and $p_{i}\in\mathcal{A}_{p}\triangleq\{0,p_{\min},\ldots,P_{\max}\}$ denotes the energy level used for transmission by the nodes in the $i^{\text{th}}$ state. Here, $p_{\min}$ denotes the minimum energy required for transmission. In (7), $K\pi_{n}^{i}$ denotes the fraction of nodes in the $i^{\text{th}}$ state, in the $n^{\text{th}}$ slot. Note that, (7) is written using the fact that under the mean-field hypothesis all the nodes are identical, and hence use the same policy, which also implies that the reward function, $\mathcal{R}_{k}(\cdot,\cdot)$ , is identical for all the nodes. Hence, to simplify the notations, in the ensuing discussion we omit the node index $k$ . Also, (7) implicitly assumes that all nodes in state $i$ use the energy $p_{i}$ which is essentially motivated by the fact that for an MDP with finite state and action sets, the optimal policy is a Markov deterministic policy[33, Thm. 8.4.7], i.e., in a slot the optimal transmit energy for a node depends only on its current state.

In the $n^{\text{th}}$ slot, when a node in state $s_{n}\in\mathcal{S}$ transmits using energy $p_{s_{n}}$ , the system evolves as

[TABLE]

where $P_{ij}^{n}(\cdot)$ denotes the probability in the slot $n$ that a node in state $i$ transits to state $j$ , and depends on, $p_{i}$ , the energy used for transmission by the node in the $i^{\text{th}}$ state333In a general MFG the transition probabilities $P_{ij}^{n}$ may also depend on the actions of the other players.. In (8), the transition probabilities, $P_{ij}^{n}(\cdot)$ , are determined by the statistics of the EH process and the wireless channel, and the transmit power policy used by a node444Thus, if a node follows a transmit power policy which evolves over the time, the resulting transition probabilities are non-homogeneous over time. . In a given slot, all the nodes obtain a reward, $\mathcal{R}\left(\boldsymbol{\pi}_{n},\mathcal{F}\right)$ , equal to the total number of bits successfully decoded in that slot, by the AP.

For a given node, starting from the $n^{\text{th}}$ slot, the expected sum-throughput obtained by following a policy $\mathcal{F}$ can be expressed as

[TABLE]

where $V_{n+1}(\boldsymbol{\pi}_{n+1},\mathcal{F})$ denotes the expected throughput obtained by following a policy $\mathcal{F}$ starting from slot $n+1$ , when in the $(n+1)^{\text{th}}$ slot the distribution of the nodes across the states is given by $\boldsymbol{\pi}_{n+1}$ . In the rest of the paper $V(\cdot,\cdot)$ is termed as the value function. In the above, similar to an MDP [33], (9) is written using the fact that the expected sum-throughput obtained by following a policy $\mathcal{F}$ , starting from the time slot $n$ , is equal to the sum of the expected sum-throughput obtained in the slot $n$ and the slot $n+1$ onward. Note that, under the mean-field hypothesis, the expected sum-throughput in (9) is identical for all the nodes, and due to special structure of the reward function, the value function of each node, $V(\cdot,\cdot)$ , only depends on the distribution of the nodes across the states, $\boldsymbol{\pi}_{n}$ , not on the state of the individual nodes. Hence, (9) does not include a superscript/subscript to denote the node index. In the following, we present preliminaries on discrete-time, finite state MFGs.

IV-B Preliminaries: discrete-time finite state MFGs

In the following, we define the notion of Nash equilibrium and stationary solution for the discrete-time MFGs, and briefly summarize the key results used to prove the convergence of the proposed MARL algorithm in Sec. V. For a detailed exposition on discrete-time finite state MFGs we refer the readers to [35].

Definition 1 (Nash maximizer).

For a fixed probability vector $\boldsymbol{\pi}_{n}$ , a policy $\mathcal{F}^{*}$ is said to be a Nash maximizer if and only if

[TABLE]

That is, for a fixed $\boldsymbol{\pi}_{n}$ , the Nash maximizer is a policy that maximizes the value function. Next, for a discrete-time finite state MFG, we define the notions of solution and stationary solution.

Definition 2 (Solution of a MFG).

Suppose that for each $\boldsymbol{\pi}_{n}$ there exists a Nash maximizer $\mathcal{F}^{*}$ . Then a sequence of tuples $\{(\boldsymbol{\pi}_{n},V_{n})\text{ for }n\in\mathbb{N}\}$ is a solution of the MFG if for each $n\in\mathbb{N}$ it satisfies (8) and (9) for some Nash maximizer of $V_{n}$ .

Definition 3 (Stationary solution).

Let $\mathcal{G}_{\boldsymbol{\pi}}$ and $\mathcal{K}_{V}$ be defined as $\mathcal{G}_{\boldsymbol{\pi}_{n}}(V_{n+1})=V_{n}(\boldsymbol{\pi}_{n},\mathcal{F}),$ and $\mathcal{K}_{V_{n}}(\boldsymbol{\pi}_{n})=\boldsymbol{\pi}_{n+1}$ . A pair of tuple $(\tilde{\boldsymbol{\pi}},\tilde{V})$ is said to be a stationary solution if and only if

[TABLE]

Note that, the operators $\mathcal{K}_{V_{n}}(\cdot)$ and $\mathcal{G}_{\boldsymbol{\pi}_{n}}(\cdot)$ are backward and forward in time, respectively. Also, the operators in (10) and (11) are compact representations of (8) and (9), respectively. The stationary solution of a MFG, $(\tilde{\boldsymbol{\pi}},\tilde{V})$ , is a fixed-point of operators $\mathcal{G}_{\boldsymbol{\pi}}$ and $\mathcal{K}_{V}$ which are essentially discrete time counterparts of Hamilton-Jacobi-Bellman and Fokker-Planck equations. Next, we list the results which identify the conditions under which a stationary solution exists. We omit the proofs for brevity. These results are later used for proving the convergence of our mean-field MARL (MF-MARL) algorithm to the stationary solution.

Theorem 1 (Uniqueness of Nash maximizer (Theorem 2 [35])).

Let $f_{i}(p_{i})\triangleq\frac{\partial V(\boldsymbol{\pi},\mathcal{F})}{\partial p_{i}}$ where $p_{i}\in\left[0,P_{\max}\right]$ for all $1\leq i\leq d$ . If the value function $V_{n}$ is convex and continuous with respect to $p_{i}$ , and $f_{i}$ is strictly diagonally convex, i.e., it satisfies

[TABLE]

then there exists a unique policy which is a Nash maximizer for the value function $V$ . Here, $p_{i}^{1}$ and $p_{i}^{2}$ denote the actions prescribed in the $i^{\text{th}}$ state by two arbitrary policies $\mathcal{F}^{1}$ and $\mathcal{F}^{2}$ , respectively.

The following result shows that if the reward function is monotonic with respect to both the variables, $\boldsymbol{\pi}$ and $p_{i}$ , then the MFG admits a unique solution.

Theorem 2 (Uniqueness of solution (Proposition 4.3.1, [36])).

Let the value function be a continuous function with respect to both of its arguments, and also assume that there exists a unique Nash maximizer $\mathcal{F}_{n}$ for all $n\in\{0,1,2,\cdots\}$ . Further, let the reward function be monotone with respect to the distribution $\boldsymbol{\pi}$ , i.e,

[TABLE]

then there exists a unique solution for the MFG. In the above $\mathcal{R}_{i}(\cdot,\cdot)$ denotes the reward obtained by the nodes in the $i^{\text{th}}$ state.

In addition, the uniqueness of the Nash maximizer and the continuity of the value function in both of its arguments ensure that a stationary solution exists [35, Thm. 3]. Thus, Theorem 2 also implies that the stationary solution is unique. In the following, we establish that the MFG $\mathcal{G}_{T}$ admits a unique stationary solution.

IV-C Unique Stationary Solution for $\mathcal{G}_{T}$

Theorem 3.

The throughput maximization mean-field game $\mathcal{G}_{T}$ has a unique solution.

Proof.

Proof is relegated to Appendix A ∎

The uniqueness of the solution of a discrete-time MFG implies that if an algorithm learning the solution of the game converges, then it converges to the unique stationary solution. In the next section, we present an algorithm to learn the stationary solution of the MFG $\mathcal{G}_{T}$ as well as the corresponding Nash maximizer power control policy and provide convergence guarantees for it. The proposed approach is termed as MF-MARL approach, as in this approach each individual node uses the reinforcement learning technique, to learn the stationary solution of the game.

V MF-MARL for Distributed Power Control

In this section, we present our mean-field MARL approach to learn the online power control policies to maximize the throughput of a fading EH MAC with large number of users. We show that the proposed approach enables the distributed learning of the power control policies which eventually converge to the stationary Nash equilibrium. The proposed MF-MARL algorithm exploits the fact that discrete time finite state MFGs have the fictitious play property (FPP) [36]. The FPP for a discrete time MFG is described in the following. Let $m$ denote the iteration index and $\boldsymbol{\bar{\pi}}_{1}$ denote an arbitrary probability vector representing the initial distribution of the nodes across the states. Let

[TABLE]

The procedure described by (14), (15) and (16) is called the fictitious play procedure. As described in (14), at the $m^{\text{th}}$ iteration, a node attempts to learn the Nash maximizer, $\mathcal{F}_{m}^{*}$ , given that its belief about the distribution of the nodes across the states is $\boldsymbol{\bar{\pi}}_{m}$ . Based on the Nash maximizer learned at the $m^{\text{th}}$ iteration, $\mathcal{F}_{m}^{*}$ , the belief about the distribution is updated to $\boldsymbol{\bar{\pi}}_{m+1}$ , using (15) and (16). Next, at the $(m+1)^{\text{th}}$ iteration, each node attempts to learn the Nash maximizer, $\mathcal{F}_{m+1}^{*}$ . A discrete-time MFG is said to have FPP if and only if the procedure described by (14), (15) and (16) converges. The following result provides the conditions under which the fictitious play procedure converges to the unique stationary solution of the discrete-time MFG.

Theorem 4 (Convergence of FPP to unique stationary solution (Theorem 4.3.2 [36])).

Let $(\boldsymbol{\pi}_{m},V_{m})$ denote the sequence generated through the FPP. If a MFG has a unique Nash maximizer at each stage of the game and the reward function is continuous and monotone with respect to probability vector $\boldsymbol{\pi}$ then the sequence $(\boldsymbol{\pi}_{m},V_{m})$ converges to $(\tilde{\boldsymbol{\pi}},\tilde{V})$ the unique stationary solution of the MFG.

For the throughput maximizing MFG $\mathcal{G}_{T}$ , convergence of the FPP to the stationary solution of the game directly follows from the above result and Theorem 3. As a consequence of this result, the stationary solution of the MFG $\mathcal{G}_{T}$ can be learned through the fictitious play procedure, provided the Nash maximizer can be found at each iteration of the fictitious play procedure, and the belief about the distribution is updated correspondingly. The MF-MARL proposes to use the reinforcement learning to learn the Nash maximizer at each iteration, i.e., for a given belief distribution $\bar{\boldsymbol{\pi}}$ each node individually uses a reinforcement learning algorithm to learn the Nash maximizer. The proposed MF-MARL approach is described in Algorithm 1.

Note that, in the Alogrithm 1, ${\boldsymbol{\pi}}_{m_{n}}$ denotes the distribution of the nodes across the states, in the $n^{\text{th}}$ slot of the $m^{\text{th}}$ iteration. Further, ${\mathcal{F}_{m}^{k}}^{*}$ denotes the Nash maximizer policy of the $k^{\text{th}}$ node, at the $m^{\text{th}}$ iterations. The maximum duration of each iteration of Algorithm 1 is set to $T$ . However, in the $n^{\text{th}}$ slot of the $m^{\text{th}}$ iteration, where $n<T$ , the AP can terminate the current iteration by broadcasting the belief about the mean-field distribution $\boldsymbol{\pi}_{m_{n}}$ , depending on the update rule in Step 3 of Algorithm 1, i.e., when the previous belief of the nodes about the distribution, $\boldsymbol{\bar{\pi}_{m}}$ , is outdated. Note that, at the start of each new iteration the Q-values are initialized with the Q-values at the end of previous iteration.

In order to implement the Q-learning algorithm, a node requires to know the reward, i.e., the sum-throughput, obtained in each slot. Since the reward function is same across the nodes, this could be accomplished by using an estimate of the distribution in (7). In particular, each node uses its own policy and an estimate of the distribution to build an estimate of the reward obtained in each slot. Alternatively, in each slot the AP can directly broadcast the total number of bits successfully decoded by it. The latter method obviates the need to estimate the distribution of the nodes, albeit at a cost of higher feedback overhead. The latter method is essentially cooperative multi-agent Q-learning[37] where nodes attempt to maximize a common reward function. In our simulations it is observed that the proposed MF-MARL based approach performs marginally better than the cooperative multi-agent Q-learning method. In steps $2$ and $3$ of the Algorithm 1, the AP builds an estimate555Since transmit power used by a node determine the state of the node, in each slot, the AP can estimate the state of each node based on the transmit power. of ${\boldsymbol{\pi}}_{m_{n}}$ and periodically broadcasts it to the entire network. In the simulations, presented in Sec. VII, we use the empirical distribution as an estimate of ${\boldsymbol{\pi}}_{m_{n}}$ .

V-A Implementation via Deep Reinforcement Learning

At each node, we implement the reinforcement learning algorithm using the deep Q-learning[38] method where the Q-function is approximated using a deep neural network (DNN). In order to learn the Q-function, the DNN is successively trained using the problem data, and a fixed target network which provides the reference Q-values. The target Q-network is periodically updated using the weights of the current Q-network. For further details on the deep Q-learning with fixed target Q-networks we refer the readers to [38]. This approach of using a DNN to learn Q-function has the following advantages: $(i)$ it obviates the need to discretize the state space, as the Q-function approximation learned using the DNN is continuous over the state space, whereas in conventional approach it is learned for discrete state-action pairs, and (ii) it is inherently faster, compared to the conventional approach of implementing the Q-learning. This is because for a given state the Q-function corresponding to all the actions is learned simultaneously. We also note that in the first and second step of Algorithm 1, the use of Q-learning could be replaced by any other variant of reinforcement learning schemes, e.g., actor-critic algorithm.

In the following section, we compare both the centralized and distributed approach from the energy consumption perspective, discuss their feasibility for the low power sensor nodes, and adapt the centralized DNN based approach developed in this section to develop an energy efficient distributed implementation.

VI Energy Efficient Distributed Power Control

First, we compare the energy required for implementation of both the methods, proposed in the previous sections. In order to do this, it is important to understand the energy consumption of a DNN[39]. As observed in the previous sections, the design of a DNN involves several hyper-parameters, e.g., number of layers, number of nodes in each layer, length of weight vectors for each node, etc., which are conventionally chosen to improve the accuracy of the DNN. These parameters also affect the energy consumption of a DNN, which also depends on the algorithm being implemented by the DNN[40]. Thus, the energy consumed by the DNN is a complicated function of these parameters and is not possible to compute it beforehand. Note that, the energy consumed by a DNN is determined by not only the number of multiplication-and-accumulation (MAA) operations that need to be performed by a DNN, but also by the memory hierarchy and the data movement[40]. Indeed, as shown in Table I, the energy consumption of a DNN is overwhelmingly dominated by the energy consumed for data movement. For instance, the energy cost incurred by a single dynamic RAM access is 200 times more than a MAA operation. Even an access to local on-board memory costs more than a MAA operation. Thus, an algorithm which requires a high number of memory accesses will incur a larger energy cost, in comparison to an algorithm which does not require any memory access during runtime.

We note that, unlike the deep Q-learning, in the centralized approach the DNN is trained only once, and the training data needs to be generated only once by solving multiple instantiations of the offline problem. Thus, for the centralized approach the DNN can be trained in the cloud, before deploying the trained DNN in an EHN. On the other hand, for deep Q-learning the DQN is trained successively during the operation of the algorithm. In addition, the deep Q-learning also requires to maintain a memory buffer, for experience replay, which is essential for the stability of the algorithm. This further adds to the energy cost of the deep Q-learning algorithm.

For the centralized approach, once the DNN is trained and deployed, the proposed online power control policy only requires to perform MAA operations to generate the transmit power vector. In particular, it requires $\sum_{j=1}^{h+2}N_{j}N_{j-1}$ multiplications. Thus, it requires no external memory access for its operation which makes it more favorable for EHNs, compared to deep Q-learning. The number of MAA operations required to compute the output transmit power vector using the centralized approach can be further optimized by the use of model-compression methods [42, 43] which attempt to further reduce the number of neurons and connections in the DNN, without compromising its accuracy. However, we emphasize that the design of DNN-based centralized policies is only possible in the scenarios where some a-priori knowledge about the EH process and the channel state is available. Thus, unlike the MF-MARL approach proposed in Sec. V, the DNN based policies can not be used in the scenarios where no knowledge about the EH process and the channel state is available. However, due to the aforementioned energy concerns, it would be desirable to have a distributed implementation of the DNN based centralized approach, developed in the previous section, for the applications where offline information about the EH process and the channel state is available. The following subsection describes how the proposed DNN based centralized online power control scheme could be modified to develop an energy efficient distributed power control policy.

VI-A A Low Energy Cost Decentralized Policy for EHNs

We note that, for the centralized DNN-based online policy, proposed in Sec. III, once the DNN is trained and deployed, the input vector to the trained DNN is constituted by the state of all the nodes in the network. A distributed implementation of this scheme could be facilitated by deploying the trained DNN, obtained after the centralized training, at the individual nodes. However, to locally determine the transmit powers at the nodes, each node would require to know the values of the energy harvested, battery and channel states of all the other nodes. Thusm the amount of overhead involved in the exchange of global state information across the network forbids the distributed implementation of the centralized approach. However, as observed for the MF-MARL algorithm, the optimal transmit power of a node only depends on the other nodes through distribution of nodes across the states, $\boldsymbol{\pi}$ . The proposed distributed DNN based approach circumvents this problem by sampling the states of other nodes from the distribution of the nodes across the states, denoted by $\boldsymbol{\pi}$ . Intuitively, given the trained DNN deployed at each EHN, the optimal performance can be obtained by constructing the input vector to the DNN by sampling the states of other nodes from the distribution $\boldsymbol{\pi}$ . In particular, in the $n^{\text{th}}$ slot, the input vector at the $k^{\text{th}}$ EHN can be generated as $\left(e_{s_{n}}^{1},B_{s_{n}}^{1},g_{n_{s}}^{1},\ldots,e_{n}^{k},B_{n}^{k},g_{n}^{k},\ldots e_{s_{n}}^{K},B_{s_{n}}^{K},g_{s_{n}}^{K}\right)$ , where $(e_{s_{n}}^{1},B_{s_{n}}^{1},g_{n_{s}}^{1})$ denotes the state of the first node, sampled from the distribution $\boldsymbol{\pi}$ and $(e_{n}^{k},B_{n}^{k},g_{n}^{k})$ denotes the state of the $n^{\text{th}}$ node.

Note that, for this distributed DNN approach, unlike MF-MARL where the policy is updated at each iteration, the policy is fixed for the entire duration of the operation. However, similar to MF-MARL, the distribution $\boldsymbol{\pi}$ is estimated and updated according to (16), and is periodically broadcasted by the AP. Thus, for this policy, given a fixed trained DNN at each node only the distribution $\boldsymbol{\pi}$ evolves over time which is also guaranteed to converge while operating under a fixed policy. This follows from the finite state space of the game which, under a fixed policy, results in a positive recurrent Markov chain, provided the EH process and the wireless channel are stationary and ergodic.

Also, we explicitly observe that, although distributed, this approach still requires the generation of a training set for the off-line training of the DNN, which in turn requires to know several realizations of the channel and EH processes for all nodes beforehand. Instead, this is not required by the proposed MF-MARL method. In the following, we present the numerical results.

VII Numerical Results

We consider an EH MAC with $K=5$ EH transmitters where each EHN harvests energy according to a non-negative truncated Gaussian distribution with mean $m$ and variance $v=3.5$ , independently of the other nodes. The capacity of the battery at each transmitter is $B_{\max}=20$ and the maximum amount of energy allowed to be used for transmission in a slot is $P_{\max}=15$ . Note that, the unit of energy is $10^{-2}$ J. In the following, we first describe the architecture of the DNN and the training setup used for learning the centralized and DQN policy.

VII-A DNN Architecture and Training

To learn the centralized policy, we use a DNN with an input and output layer containing $3K$ and $K$ neurons, respectively. It consists of $30$ hidden layers, with first hidden layer containing $30K$ neurons. Each subsequent odd indexed hidden layer contains the same number of neurons as the previous even indexed layer, i.e., $N_{j}=N_{j-1}$ for $j\in\{3,\ldots,31\}$ . For each even indexed hidden layer the number of neurons is decreased by $2K$ , i.e., $N_{j}=N_{j-1}-2K$ for $j\in\{4,\ldots,30\}$ . We note that, the input layer has the index $1$ , and the indices of the first hidden layer and the output layer are $2$ and $32$ , respectively. The activation function used is Leaky rectified linear unit (ReLu). To train the network we use the mean-square error as the loss function. Training data is generated by solving $10^{4}$ instantiations of the offline problem with the horizon length $N=20$ . Thus, the training dataset contains $2\times 10^{5}$ datapoints, out of which $40000$ data points are used for validation. The performance is evaluated by computing the rate per slot (RPS) over $10^{6}$ slots. For these $10^{6}$ slots, instantiations of the EH process and the channel are generated independently of the training data.

At each node, both the deep Q network as well as the the fixed target network consist of $10$ hidden layers and one input and output layer. The input layer contains $3$ neurons, while the number of neurons in the output layer is equal to $|\mathcal{A}|=151$ , where $\mathcal{A}=\{0,0.1,0.2,\ldots,15\}$ . The first, third, fifth, seventh, and ninth hidden layer consists of $60,58,56,54,$ and $52$ neurons, respectively. As for the DNN architecture used in the centralized approach, the number of neurons in each even indexed hidden layer remains same as in the previous odd indexed hidden layer. At each layer, except the output layer, the rectified linear unit (ReLu) is used as an activation function. The output layer uses a linear activation function, motivated by the fact that using an activation function that applies cut-off values could result in low training errors simply because the output power would be artificially constrained to lie in the interval $[0,P_{max}]$ and not as a result of a proper configuration of the hidden layers. Instead, a linear output activation function allows the DNN to learn whether the adopted configuration of the hidden layers is truly leading to a small error or whether it needs to be still adjusted through further training.

The deep Q-learning algorithm uses $\gamma=0.99$ , and uses the exploration probability $\epsilon_{\max}=1$ at the start which decays to $\epsilon_{\min}=0.01$ with a decay factor equal to 0.995. The replay memory of length $2000$ is used. For all the experiments, DQN is trained with a batch size equal to 32. In Algorithm 1, we use $\epsilon_{1}=0.01$ , $\tilde{\epsilon}=0.001$ , and update frequency $T=1000$ .

VII-B Performance of Centralized Policy

We first benchmark the performance of the proposed DNN based centralized online policy, against the performance of the optimal offline policy proposed in [6]. In the centralized scheme, the online policy is learned by training a deep neural network using the data obtained by jointly optimal offline policies [6].

Table II shows the performance of the proposed DNN based policy. The last column of the table presents the RPS as the percentage of the throughput achieved by the offline policy. It can be observed that the proposed policy achieves roughly $90\%$ of the throughput obtained by the offline policy. We note that, since an offline policy is designed using non-causal information, the proposed policy can not achieve the throughput obtained by the optimal offline policy. Note that, the MDP formulation of this problem is computationally intractable, due to state space of the size of order $10^{12}$ , even with the channel gains quantized to just $8$ levels.

Table III compares the performance of the proposed DNN based policy against deep Q-learning and the MDP, for point-to-point links, i.e., $K=1$ , with mean $m=10$ . The proposed DNN based policy achieves approximately $98$ % of the time-averaged throughput achieved by the offline policy. It is interesting to note that the throughput achieved by the proposed DNN based policies is marginally better than the throughput achieved by the online policies designed using the deep Q-learning. Also, the proposed DNN based policies outperforms the MDP based policies which achieves only approximately $84$ % of the throughput achieved by the offline policy. Theoretically, an online policy designed using the MDP achieves the optimal performance. However, the performance of MDP policy degrades due to quantization of the state and action spaces. We note that the computational complexity for solving an MDP increases in direct proportion to the number of quantization levels used for state and action spaces. On the other hand, the proposed DNN based policy operates with continuous state and action spaces. In contrast, as shown in Table IV, while the DQN uses continuous state at the input, the output of DQN network is quantized which, in turn, results in performance loss, compared to the DNN-based policy. Next, we compare the performance of the proposed MF-MARL and cooperative Q-learning approaches against the DNN based centralized and distributed policy.

VII-C Performance of Distributed Policies

As observed from the results in Table V, the policies obtained using the proposed MF-MARL based approach achieve a sum-throughput which is close to the throughput achieved by the centralized policies. However, in contrast to the proposed approaches, the centralized online policy requires information about the state of all the nodes in the network. Note that, as shown in Table IV, in order to implement MF-MARL (or deep Q-learning), the actions space, $\mathcal{A}$ , has to be quantized which leads to a loss in the throughput, compared to the centralized scheme where the output transmit powers are continuous. We observe that the proposed MF-MARL based approach performs marginally better than the cooperative multi-agent Q-learning based scheme. However, in contrast to the cooperative multi-agent Q-learning approach, the MF-MARL based procedure requires significantly less feedback. Also, it is interesting to note that the proposed MF-MARL algorithm achieves the near-optimal throughput even for a network with small number of nodes.

Furthermore, we note that the distributed DNN approach proposed in Sec. VI also achieves throughput competitive to the MF-MARL approach. Both the MF-MARL and the distributed DNN approaches use the distribution vector $\boldsymbol{\pi}$ for their operation. The distribution vector $\boldsymbol{\pi}$ is estimated using the empirical distribution over the discretized (or quantized) state space. Since, in the distributed DNN approach, each EHN constructs the input vector to DNN by sampling the states of the other nodes from the distribution $\boldsymbol{\pi}$ , hence the input states for other nodes are essentially sampled from the quantized state space (see Table IV). This is in contrast to the MF-MARL approach where the input states are continuous variables, however the output transmit power variables are quantized. Thus, similar to the MF-MARL algorithm, the distributed DNN approach also has a performance gap from the centralized policy, due to quantization, which reduces with a finer quantization. In the following, we study the impact of hyperparameters such as, update frequency, replay buffer size on the performance of the MF-MARL approach. We also present the results to show the speed of convergence of the MARL approaches.

Further, the result in Fig. 3 illustrates the performance of the proposed MF-MARL approach for a fading EH MAC with $K=20$ users. In this scenario, the performance of the network is constrained by the limited capacity of the battery attached to the node. It is observed in our simulations that, even with $K=20$ nodes, our MF-MARL approach is able to learn the policies in a completely distributed fashion, and converges to a stable throughput. This could be observed by the fact that for $m=0.01$ the sum-throughput increases with the variance, i.e., the energy availability.

VII-D Convergence and Effect of Hyperparameters

Further, the results in Fig. 4 show the throughput achieved by our MF-MARL algorithm as a function of slot index. It is interesting to observe that the MF-MARL algorithm converges very fast, i.e., within first $1000$ slots, for $m=7$ and $m=8$ the obtained throughput reaches within the $99\%$ of the throughput attained finally. Although, for $m=5$ and $m=9$ the MF-MARL learns at a relatively slower pace, yet within first $5000$ slots it achieves the throughput close to $95\%$ of the final value. A similar trend is observed for cooperative Q-learning.

The results shown in Fig. 5 illustrate the impact of the size of the replay buffer on the performance of the MF-MARL algorithm. From this plot it can be concluded that the size of replay buffer has a threshold effect on the throughput achieved by the MF-MARL algorithm. A small size replay buffer prohibits the algorithm from converging to the optimal throughput. However, beyond a sufficient size of the replay buffer the throughput does not improve further. The result in Fig. 6 shows the impact of parameter $T$ , in the Algorithm 1, on the sum-throughput achieved by the distributed policies. Recall that, $T$ determines the frequency with which the AP broadcasts the updates about the estimate of the distribution $\boldsymbol{\pi}$ , hence is termed as update duration. For the MF-MARL, it is observed in the simulations that a longer update duration result in an improved throughput. This is because the estimates obtained by computing the empirical distribution over larger number of slots are more accurate, which, in turn, for the MF-MARL, leads to better estimates of the reward at each individual node. Consequently, it aids in the learning of Q-function, and leads to better DQN approximation. In contrast, performance of the distributed DNN approach is relatively independent of the update duration. This is because, the distributed DNN approach does not use the estimate of the distribution $\boldsymbol{\pi}$ for learning the policy, i.e., in the distributed DNN approach $\boldsymbol{\pi}$ is used only for sampling the states of the other nodes. Also, from (16), regardless of the update frequency, over the time, iterative estimates of the distribution $\boldsymbol{\pi}$ converge, and the states of the other nodes are sampled from the correct distribution. In contrast, for the MF-MARL, estimates of $\boldsymbol{\pi}$ are critically used for learning, therefore the estimation inaccuracies jeopardize the learning procedure, and may adversely affect the sum-throughput.

The results in Fig. 7 illustrate the variations in the sum-throughput achieved by the MF-MARL, as a function of the number of transmitters in the network. As expected, the sum-throughput of the network increases with both the number of transmitters in the network, $K$ , as well as with the mean of the harvesting process, $m$ . This shows the that the proposed MF-MARL apporach can learn effectively, even in a network where the number of transmitters is large.

VIII Conclusions

In this paper, we proposed a mean-field multi-agent reinforcement learning based framework to learn the optimal power control to maximize the throughput of large EH MAC. First, we modeled the throughput maximization problem as a discrete-time MFG and analytically established that the game has a unique stationary solution. We proposed a reinforcement learning based procedure to learn the stationary solution of the game and established the convergence of the proposed procedure. Next, to benchmark the performance of the distributed power control policies, obtained using the proposed MF-MARL framework, we also developed a DNN based centralized online power control scheme. The centralized power control approach learns the optimal online decision rule using the data obtained through the solution of offline policies. The numerical results demonstrated that both the centralized as well as the distributed power control schemes achieve a throughput close to the optimal.

Appendix A Proof of Theorem 3

Proof.

The proof follows directly from the result in Theorem 2, provided there exists a unique Nash maximizer and the reward function is monotone in variable $\boldsymbol{\pi}$ . The uniqueness of Nash maximizer can established using the result in Theorem 1. It is easy to verify that the reward and value function of the game $\mathcal{G}_{T}$ satisfies the strictly diagonally concavity property. In order to complete the proof we just need to show that the reward function is monotone with parameter $\boldsymbol{\pi}$ , i.e.,

[TABLE]

The proof follows by noting the fact that since the reward obtained by a node does not depend on the state of the node, i.e., $\mathcal{R}_{i}(\mathcal{P},\pi^{2})=\mathcal{R}(\mathcal{P},\pi^{2})$ . Hence, the RHS in the above can be expressed as $(\mathcal{R}(\mathcal{P},\pi^{2})-\mathcal{R}(\mathcal{P},\pi^{1}))\left(\sum_{i=1}^{d}\pi_{i}^{1}-\sum_{i=1}^{d}\pi_{i}^{1}\right)=0$ . ∎

Bibliography43

The reference list from the paper itself. Each links out to its DOI / PubMed record.

1[1] M. K. Sharma, A. Zappone, M. Debbah, and M. Assaad, “Deep learning based online power control for large energy harvesting networks,” in Proc. ICASSP , May 2019.
2[2] ——, “Multi-agent deep reinforcement learning based power control for large energy harvesting networks,” in Proc. 17th Int. Symp. on Modeling and Optim. in Mobile, Ad Hoc, and Wireless Networks (Wi Opt) , 2019.
3[3] M. Centenaro, L. Vangelista, A. Zanella, and M. Zorzi, “Long-range communications in unlicensed bands: the rising stars in the Io T and smart city scenarios,” IEEE Wireless Commun. Mag. , vol. 23, no. 5, pp. 60–67, Oct. 2016.
4[4] M. L. Ku, W. Li, Y. Chen, and K. J. R. Liu, “Advances in energy harvesting communications: Past, present, and future challenges,” IEEE Commun. Surveys Tuts. , vol. 18, no. 2, pp. 1384–1412, Second Quarter 2016.
5[5] K. Tutuncuoglu and A. Yener, “Optimum transmission policies for battery limited energy harvesting nodes,” IEEE Trans. Wireless Commun. , vol. 11, no. 3, pp. 1180–1189, Mar. 2012.
6[6] Z. Wang, V. Aggarwal, and X. Wang, “Iterative dynamic water-filling for fading multiple-access channels with energy harvesting,” IEEE J. Sel. Areas Commun. , vol. 33, no. 3, pp. 382–395, Mar. 2015.
7[7] M. K. Sharma and C. R. Murthy, “Distributed power control for multi-hop energy harvesting links with retransmission,” IEEE Trans. Wireless Commun. , vol. 17, no. 6, pp. 4064–4078, Jun. 2018.
8[8] A. Baknina and S. Ulukus, “Energy harvesting multiple access channels: Optimal and near-optimal online policies,” IEEE Trans. Commun. , vol. 66, no. 7, pp. 2904 – 2917, Jul. 2018.

TL;DR

Contribution

Findings

Abstract

Peer Reviews

Videos

Taxonomy

Abstract

I Introduction

II System Model and Problem Formulation

III DNN based Centralized Online Power Control Policy

III-A Notations

III-B Online and Offline Policies

III-C DNN based Online Energy Management

III-D DNN Architecture

III-E Training

IV Mean-field Game to Maximize the Sum Throughput

IV-A Throughput Maximization Game

IV-B Preliminaries: discrete-time finite state MFGs

Definition 1** (Nash maximizer).**

Definition 2** (Solution of a MFG).**

Definition 3** (Stationary solution).**

Theorem 1** (Uniqueness of Nash maximizer (Theorem 2 [35])).**

Theorem 2** (Uniqueness of solution (Proposition 4.3.1, [36])).**

IV-C Unique Stationary Solution for GT\mathcal{G}_{T}GT​

Theorem 3**.**

Proof.

V MF-MARL for Distributed Power Control

Theorem 4** (Convergence of FPP to unique stationary solution (Theorem 4.3.2 [36])).**

V-A Implementation via Deep Reinforcement Learning

VI Energy Efficient Distributed Power Control

VI-A *A Low Energy Cost Decentralized Policy for EHNs *

VII Numerical Results

VII-A DNN Architecture and Training

VII-B Performance of Centralized Policy

VII-C Performance of Distributed Policies

VII-D Convergence and Effect of Hyperparameters

VIII Conclusions

Appendix A Proof of Theorem 3

Proof.

Definition 1 (Nash maximizer).

Definition 2 (Solution of a MFG).

Definition 3 (Stationary solution).

Theorem 1 (Uniqueness of Nash maximizer (Theorem 2 [35])).

Theorem 2 (Uniqueness of solution (Proposition 4.3.1, [36])).

IV-C Unique Stationary Solution for $\mathcal{G}_{T}$

Theorem 3.

Theorem 4 (Convergence of FPP to unique stationary solution (Theorem 4.3.2 [36])).

VI-A A Low Energy Cost Decentralized Policy for EHNs