On Learning Intrinsic Rewards for Faster Multi-Agent Reinforcement   Learning based MAC Protocol Design in 6G Wireless Networks

Luciano Miuccio; Salvatore Riolo; Mehdi Bennis; and Daniela Panno

arXiv:2302.14765·cs.NI·March 1, 2023

On Learning Intrinsic Rewards for Faster Multi-Agent Reinforcement Learning based MAC Protocol Design in 6G Wireless Networks

Luciano Miuccio, Salvatore Riolo, Mehdi Bennis, and Daniela Panno

PDF

Open Access

TL;DR

This paper introduces a novel multi-agent reinforcement learning framework with intrinsic rewards for designing faster-converging MAC protocols in 6G wireless networks, significantly improving convergence speed and transmission performance.

Contribution

It proposes a new intrinsic reward learning method using LSTM networks for multi-agent MAC protocol design, enhancing convergence speed and performance.

Findings

01

75% faster convergence compared to baselines

02

Higher transmission efficiency in simulations

03

Effective coordination among agents

Abstract

In this paper, we propose a novel framework for designing a fast convergent multi-agent reinforcement learning (MARL)-based medium access control (MAC) protocol operating in a single cell scenario. The user equipments (UEs) are cast as learning agents that need to learn a proper signaling policy to coordinate the transmission of protocol data units (PDUs) to the base station (BS) over shared radio resources. In many MARL tasks, the conventional centralized training with decentralized execution (CTDE) is adopted, where each agent receives the same global extrinsic reward from the environment. However, this approach involves a long training time. To overcome this drawback, we adopt the concept of learning a per-agent intrinsic reward, in which each agent learns a different intrinsic reward signal based solely on its individual behavior. Moreover, in order to provide an intrinsic reward…

Tables1

Table 1. TABLE I: Training algorithm Parameters

Parameter	Symbol	Value
Number of UEs	$N$	2
Discount factor	$γ$	0.99
Memory length	$M$	$3$
Duration of episode	$T_{ep}$	32
Number of episodes per lifetime	$N_{ep}$	250
Number of dPDUs to deliver	$P$	${1, 2}$
Balancing parameter	$λ$	1
Act. function, intrinsic reward network		{i, t}^a
Learning rate, intrinsic reward network	$β$	$7 \cdot 10^{- 4}$
Neurons per layer, intrinsic reward network		128
Act. function per layer, policy network		{t, t, s}^a
Learning rate, policy network	$α$	$3 \cdot 10^{- 4}$
Neurons per layer, policy network		64

Equations30

G_{ep, ext} = t = 0 \sum T_{ep} - 1 γ^{t} R_{ext}^{t + 1},

G_{ep, ext} = t = 0 \sum T_{ep} - 1 γ^{t} R_{ext}^{t + 1},

J_{ep, ext} (π) = E_{o^{0}, a^{0}, \dots o^{T_{ep} - 1}, a^{T_{ep} - 1}} [G_{ep, ext}],

J_{ep, ext} (π) = E_{o^{0}, a^{0}, \dots o^{T_{ep} - 1}, a^{T_{ep} - 1}} [G_{ep, ext}],

R_{ov, i}^{t + 1} = R_{ext}^{t + 1} + λ R_{in, η_{i}}^{t + 1},

R_{ov, i}^{t + 1} = R_{ext}^{t + 1} + λ R_{in, η_{i}}^{t + 1},

G_{ep, ov, i}^{(k)} = t = 0 \sum T_{ep} - 1 γ^{t} R_{ov, i}^{t + 1} .

G_{ep, ov, i}^{(k)} = t = 0 \sum T_{ep} - 1 γ^{t} R_{ov, i}^{t + 1} .

G_{life, ext} = t = 0 \sum T - 1 γ^{t} R_{ext}^{t + 1},

G_{life, ext} = t = 0 \sum T - 1 γ^{t} R_{ext}^{t + 1},

T_{E, i}^{(k)} = {(o_{i}^{t}, a_{i}^{t}, π_{θ_{i}^{(k - 1)}} (a_{i}^{t} ∣ o_{i}^{t}), R_{ext}^{t + 1}, R_{in, η_{i}}^{t + 1})}_{t = 0}^{T_{ep} - 1},

T_{E, i}^{(k)} = {(o_{i}^{t}, a_{i}^{t}, π_{θ_{i}^{(k - 1)}} (a_{i}^{t} ∣ o_{i}^{t}), R_{ext}^{t + 1}, R_{in, η_{i}}^{t + 1})}_{t = 0}^{T_{ep} - 1},

T_{L, i} = {T_{E, i}^{(k)}}_{k = 1}^{N_{ep}} .

T_{L, i} = {T_{E, i}^{(k)}}_{k = 1}^{N_{ep}} .

J_{ep, ov, i}^{(k - 1)} = E_{o_{i}^{0}, a_{i}^{0}, \dots o_{i}^{T_{ep} - 1}, a_{i}^{T_{ep} - 1}} [G_{ep, ov, i}^{(k - 1)}],

J_{ep, ov, i}^{(k - 1)} = E_{o_{i}^{0}, a_{i}^{0}, \dots o_{i}^{T_{ep} - 1}, a_{i}^{T_{ep} - 1}} [G_{ep, ov, i}^{(k - 1)}],

θ_{i}^{(k)} = θ_{i}^{(k - 1)} + α \nabla_{θ_{i}^{(k - 1)}} J_{ep, ov, i}^{(k - 1)} .

θ_{i}^{(k)} = θ_{i}^{(k - 1)} + α \nabla_{θ_{i}^{(k - 1)}} J_{ep, ov, i}^{(k - 1)} .

θ_{i}^{(k)} \approx θ_{i}^{(k - 1)} + α G_{ep, ov, i}^{(k - 1)} \nabla_{θ_{i}^{(k - 1)}} lo g π_{θ_{i}^{(k - 1)}} (a_{i}^{t} ∣ o_{i}^{t}) .

θ_{i}^{(k)} \approx θ_{i}^{(k - 1)} + α G_{ep, ov, i}^{(k - 1)} \nabla_{θ_{i}^{(k - 1)}} lo g π_{θ_{i}^{(k - 1)}} (a_{i}^{t} ∣ o_{i}^{t}) .

J_{life, ext} = E_{o_{i}^{0}, a_{i}^{0}, \dots o_{i}^{T - 1}, a_{i}^{T - 1}} [G_{life, ext}],

J_{life, ext} = E_{o_{i}^{0}, a_{i}^{0}, \dots o_{i}^{T - 1}, a_{i}^{T - 1}} [G_{life, ext}],

η_{i}^{'} = η_{i} + β \nabla_{η_{i}} J_{life, ext} .

η_{i}^{'} = η_{i} + β \nabla_{η_{i}} J_{life, ext} .

\nabla_{η_{i}} J_{life, ext} = \nabla_{θ_{i}^{(N_{ep})}} J_{life, ext} \nabla_{η_{i}} θ_{i}^{(N_{ep})} .

\nabla_{η_{i}} J_{life, ext} = \nabla_{θ_{i}^{(N_{ep})}} J_{life, ext} \nabla_{η_{i}} θ_{i}^{(N_{ep})} .

\nabla_{θ_{i}^{(N_{ep})}} J_{life, ext} \approx G_{life, ext} \nabla_{θ_{i}^{(N_{ep})}} lo g π_{θ_{i}^{(N_{ep})}} (a_{i}^{t} ∣ o_{i}^{t}) .

\nabla_{θ_{i}^{(N_{ep})}} J_{life, ext} \approx G_{life, ext} \nabla_{θ_{i}^{(N_{ep})}} lo g π_{θ_{i}^{(N_{ep})}} (a_{i}^{t} ∣ o_{i}^{t}) .

\nabla_{θ_{i}^{(N_{ep})}} J_{life, ext} \approx G_{life, ext} \frac{\nabla _{θ_{i}^{(N_{ep})}} π _{θ_{i}^{(N_{ep})}} ( a _{i}^{t} ∣ o _{i}^{t} )}{π _{θ_{i}^{(k)}} ( a _{i}^{t} ∣ o _{i}^{t} )} .

\nabla_{θ_{i}^{(N_{ep})}} J_{life, ext} \approx G_{life, ext} \frac{\nabla _{θ_{i}^{(N_{ep})}} π _{θ_{i}^{(N_{ep})}} ( a _{i}^{t} ∣ o _{i}^{t} )}{π _{θ_{i}^{(k)}} ( a _{i}^{t} ∣ o _{i}^{t} )} .

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsAdvanced MIMO Systems Optimization · Full-Duplex Wireless Communications · Energy Harvesting in Wireless Networks

Full text

On Learning Intrinsic Rewards for Faster Multi-Agent Reinforcement Learning based MAC Protocol Design in 6G Wireless Networks

Luciano Miuccio1, Salvatore Riolo1, Mehdi Bennis2, and Daniela Panno1

emails: [email protected], {salvatore.riolo, daniela.panno}@unict.it, [email protected]

1 Department of Electrical, Electronics and Computer Engineering, University of Catania, Italy

2 Centre for Wireless Communications, University of Oulu, Finland

Abstract

In this paper, we propose a novel framework for designing a fast convergent multi-agent reinforcement learning (MARL)-based medium access control (MAC) protocol operating in a single cell scenario. The user equipments (UEs) are cast as learning agents that need to learn a proper signaling policy to coordinate the transmission of protocol data units (PDUs) to the base station (BS) over shared radio resources. In many MARL tasks, the conventional centralized training with decentralized execution (CTDE) is adopted, where each agent receives the same global extrinsic reward from the environment. However, this approach involves a long training time. To overcome this drawback, we adopt the concept of learning a per-agent intrinsic reward, in which each agent learns a different intrinsic reward signal based solely on its individual behavior. Moreover, in order to provide an intrinsic reward function that takes into account the long-term training history, we represent it as a long short-term memory (LSTM) network. As a result, each agent updates its policy network considering both the extrinsic reward, which characterizes the cooperative task, and the intrinsic reward that reflects local dynamics. The proposed learning framework yields a faster convergence and higher transmission performance compared to the baselines. Simulation results show that the proposed learning solution yields 75% improvement in convergence speed compared to the most performing baseline.

Index Terms:

6G, Intrinsic reward learning, MARL, Protocol learning.

I Introduction

The emergence of data-driven medium access control (MAC) protocols can provide a cost-effective, flexible approach to boost the performance of beyond 5G (B5G) and 6G networks. To address this problem, multi-agent reinforcement learning (MARL) methods enable agents to learn an optimal policy by interacting in the same environment [1]. Current works, such as [2, 3], have studied the MAC protocol learning in a single cell scenario, where user equipments (UEs) need to deliver MAC protocol data units (PDUs) to the base station (BS) sharing the same radio channel. UEs are cast as reinforcement learning (RL) agents that are trained to learn a new MAC protocol from their partial observations of the global state.

However, despite the good performances shown at the end of the training procedure, learning efficient and robust MAC protocols consisting of multiple agents acting and learning in the same shared environment requires very long training time. This aspect prevents the applicability of this approach to a dynamic wireless environment that requires retraining to adapt the MAC protocol to changing environments. The main causes of slowness in training stem from the partial observability and non-stationarity of the MARL problem (i.e., transitions from a state to another depend on the actions of all agents) [4]. In addition to this, [2, 3] are based on the conventional centralized training and decentralized execution (CTDE) paradigm and parameter sharing technique [5] that further slows down the convergence time.

Specifically, on the one hand, CTDE allows agents to learn their local policies in a centralized way while retaining the ability of decentralized execution. During the training phase, the environment assigns the same global reward to all agents without distinguishing their own contributions. As a consequence, only a subset of agents contributes to the reward, and so, during training an agent may be punished even if it has acted optimally, or rewarded even if it has acted wrongly. Clearly, this approach induces slow and unstable policy learning.

On the other hand, the parameter sharing technique consists in simultaneously learning a single shared policy for multiple agents, which boosts scalability. However, UEs may compete at the same time to transmit their own packets in the same shared channel. In other words, even if two UEs have the same observation, the action taken by one UE should be different from the action taken by the other one to avoid interference. Updating the same policy for every agent can create adverse actions that slow down convergence during learning.

In light of the above considerations, the main contribution of this paper is to propose a novel framework that provides faster convergence to the MAC protocol learning problem. Specifically, we consider the same communication scenario studied in [2, 3] and introduce our innovative learning framework with the following features.

First, we adopt an enhanced version of the CTDE paradigm, which in addition to the global reward signal, leverages for each agent, a different local intrinsic reward signal based on its individual behavior. This idea is inspired by intrinsic reward learning introduced in [6, 7] for a single-agent environment. Different from the global reward given by the environment (termed as “extrinsic reward”) that is hand-designed, the intrinsic reward is automatically learned by each agent.

Second, the proposed solution avoids the use of the parameter sharing technique and, instead, considers that each agent has two independent modules, namely, a policy network and an intrinsic reward network. The policy network learns the optimal policy per agent, while the intrinsic reward network provides additional reward signal to the policy network.

Simulation results show that in complex scenarios and adopting the multi-agent proximal policy optimization (MAPPO) algorithm, the proposed learning framework yields a $75\%$ improvement in convergence speed, and about $4\%$ improvement in transmission performance compared to conventional CTDE without parameter sharing technique, and even better results with respect to a baseline consisting of both CTDE and parameter sharing.

The rest of the article is organized as follows. The system model and the formalization of the cooperative MARL problem are described in Section II. The proposed approach is detailed in Section III. Finally, the numerical simulation results and conclusions are drawn in Section IV and Section V, respectively.

II System Model and MARL Formulation

Consider a single BS serving a set $\mathcal{N}$ of $N$ homogeneous UEs needing to deliver $P$ MAC PDUs to the BS. The network nodes exchange control messages encapsulated inside signaling PDUs (sPDUs) through the downlink (DL) and uplink (UL) control channels, which are assumed to be dedicated and error free. As regards data transmission, UEs send data PDUs (dPDUs) using the same physical uplink shared channel (PUSCH) operating according to a time division multiple access (TDMA) scheme, which leads to possible collisions. Specifically, for each time step $t$ a UE can send one dPDU, and this dPDU is successfully received by the BS only if a single UE out of $N$ has transmitted it.

Control plane: let $\mathcal{M}_{\text{UE,s}}=\{0,1\}$ be the set of possible messages sent by the UEs, and $\mathcal{M}_{\text{BS}}=\{0,1,2\}$ be the set of DL messages. At each time step $t$ , the BS can send to each UE $i\in\mathcal{N}$ only one message $m_{i}^{t}\in\mathcal{M}_{\text{BS}}$ , and each UE $i$ can send one signaling message $a_{i,\text{s}}^{t}\in\mathcal{M}_{\text{UE,s}}$ to the BS. Specifically, $m_{i}^{t}=2$ represents an acknowledgement (ACK) message that confirms that a dPDU sent from UE $i$ has been correctly received at the BS in the previous time step $t-1$ , $m_{i}^{t}=1$ refers to a scheduling grant message to UE $i$ , and $m_{i}^{t}=0$ to indicate that no access is granted for UE $i$ . As for UEs, $a_{i,\text{s}}^{t}=1$ means sending an access request to reserve time step $t+1$ for transmission, while $a_{i,\text{s}}^{t}=0$ means do not transmit any signaling message.

Data plane: each UE $i$ has a dPDUs storage capability, modeled as a buffer with first-in first-out (FIFO) policy, which contains at most $P$ dPDUs. We denote with $b_{i}^{t}\in\mathcal{B}=\{0,1,\dots,P\}$ the buffer status at time $t$ , and we assume that the buffer starts full. For each time step $t$ , UE $i$ is able to transmit a dPDU or to delete it. Specifically, the data plane action is denoted as $a_{i,\text{u}}\in\mathcal{M}_{\text{UE,u}}=\{0,1,2\}$ , where $a_{i,\text{u}}=1$ means that the UE transmits the first dPDU in its buffer (if any), $a_{i,\text{u}}=2$ means it deletes the first dPDU in the buffer, and $a_{i,\text{u}}=0$ to do nothing.

We assume that the BS is a MAC expert agent, i.e., it adopts a MAC protocol that is not learned. In detail, at each time step $t$ , if the BS receives more scheduling requests from the UEs, then it chooses one of the requesting UEs at random and a scheduling grant is sent in response. If the UE has made a successful data transmission concurrently with the scheduling request, then the BS will ignore this scheduling request and send only an ACK message to it.

II-A Multi-Agent Reinforcement Learning Formulation

The goal is to find the optimal MAC protocol adopted by UEs that maximizes the number of unique dPDUs successfully received by the BS, while minimizing the time spent to do so. To effectively reach this goal, we propose to cast the UEs as MAC learning agents and the protocol learning problem as a cooperative and multi-agent partially observable Markov decision process (MPOMDP). The system can be described as a tuple as $\langle$$\mathcal{N}$ , $\mathcal{A}$ , $\mathcal{S}$ , $\mathcal{O}$ , $\pi_{i}$ , $R_{\text{ext}}$ , $\gamma$$\rangle$ . Let $\mathcal{N}$ denote the set of $N$ homogeneous learning agents (i.e., UEs). Each agent $i\in\mathcal{N}$ at time step $t$ has a partial observation of the global state defined as $o_{i}^{t}=(b_{i}^{t},b_{i}^{t-1},a_{i}^{t-1},m_{i}^{t-1},\dots,b_{i}^{t-M},a_{i}^{t-M},m_{i}^{t-M})$ , where $M$ is the memory length. Accordingly, let ${a}_{i}^{t}=(a_{i,\text{u}}^{t},a_{i,\text{s}}^{t})$ indicate the action taken by agent $i$ , which involves both data and control plane. Each agent $i$ shares the same observation and action space, denoted as $\mathcal{O}$ and $\mathcal{A}$ , respectively. Clearly, $\mathcal{A}=\mathcal{M}_{\text{UE,u}}\times\mathcal{M}_{\text{UE,s}}=\{A_{1},\dots,A_{|\mathcal{A}|}\}$ . Let $\pi_{i}\left(a_{i}^{t}\mid o_{i}^{t}\right)\colon\mathcal{O}\times\mathcal{A}\rightarrow[0,1]$ be a stochastic policy for agent $i$ , that is, the probability of choosing a given action $a_{i}$ given that agent $i$ is observing $o_{i}$ . For sake of clarity, we also introduce $\bm{o}^{t}=[o_{1}^{t},\dots,o_{N}^{t}]$ , $\bm{a}^{t}=[a_{1}^{t},\dots,a_{N}^{t}]$ , and $\bm{\pi}=[\pi_{1},\dots,\pi_{N}]$ .

At time step $t$ , each agent $i$ observes $o_{i}^{t}$ and selects an action $a_{i}^{t}$ according to its own policy $\pi_{i}$ . At time step $t+1$ , in conventional MPOMDP, each agent receives from the environment an extrinsic reward $R_{\text{ext}}^{t+1}$ , which is the same for all agents and quantifies the benefit of the joint actions performed by all the $N$ agents. This design decision reflects the objective of optimizing the performance of the whole network, rather than that of individual agents. We define an episode as a finite sequence of agent-environment interactions lasting $T_{\text{ep}}$ time steps. For each episode, we define the episodic cumulative extrinsic return as

[TABLE]

where $\gamma$ is a discount factor. Since maximizing $G_{\text{ep},\text{ext}}$ represents the goal of the reinforcement learning problem, the values of $R_{\text{ext}}^{t+1}$ in each time step $t$ should be properly designed. Here, we leverage the simple approach [2], where $R_{\text{ext}}^{t+1}\in\{-1,0\}$ , with $R_{\text{ext}}^{t+1}$ equal to 0 only if a dPDU has been received correctly at time step $t$ or if all dPDUs sent by each UE have been already received correctly in the previous time steps. This means that $G_{\text{ep},\text{ext}}$ reaches its maximum value (i.e., 0) when all packets have been received at minimum time, and the minimum value of $G_{\text{ep},\text{ext}}$ is assumed when no packets have been received correctly. We emphasize that the selection of extrinsic reward functions are typically hand-designed. However, finding a good reward function is not straightforward and requires a high expertise and domain knowledge of the designer. Moreover, the extrinsic reward is strongly goal or task-specific, which limits its applicability to other use cases and goals. Let $J_{\text{ep},\text{ext}}(\bm{\pi})$ denote the expected episodic cumulative extrinsic return obtained when each agent $i$ follows its own policy $\pi_{i}\in\bm{\pi}$ , i.e.,

[TABLE]

where $a_{i}^{t}\sim\pi_{i}\left(a_{i}^{t}\mid o_{i}^{t}\right),\forall i\in\mathcal{N}$ . The objective of the MARL problem is to find optimal policies $\bm{\pi}^{*}$ that maximize $J_{\text{ep},\text{ext}}$ . For doing this, we adopt the CTDE paradigm. Several MARL techniques can be used, ranging from simple value-based approaches (e.g., tabular Q-learning [8]) to on-policy algorithms (e.g., MAPPO [9]). In general, all approaches consider an independent policy parameterized by ${\theta_{i}}$ for each agent $i$ and denoted as $\pi_{\theta_{i}}$ . Each agent updates independently its own parameter $\theta_{i}$ by maximizing the expected extrinsic reward. In addition to this, thanks to the homogeneous nature of UEs and the use of the same cumulative extrinsic reward $J_{\text{ep},\text{ext}}$ , another possible approach is to learn a shared optimal policy $\bm{\pi}^{*}$ by leveraging the concept of parameter sharing [5].

III Proposed approach

In this section, we formally present our approach that aims to automatically speed up the convergence time by adopting both the concept of extrinsic reward empowered by an intrinsic reward [6] and the concept of lifetime [7]. Specifically, in [6] the authors investigated in the case of a single agent environments the advantages of using an intrinsic reward function parameterized by $\eta$ in addition to the conventional extrinsic reward. Differently from the hand-designed extrinsic reward, the intrinsic reward function is automatically learned by each agent to improve its learning dynamics. In this case, both the policy and intrinsic reward parameters are learned within a single episode. Conversely, in [7], the authors propose to learn an intrinsic reward over a lifetime consisting of $N_{\text{ep}}$ episodes, instead of a single episode, to take into account the system dynamics. In detail, the policy parameter $\theta$ is still updated episode-by-episode by considering only the cumulative episodic intrinsic reward, while the intrinsic reward parameter $\eta$ is updated within every lifetime to maximize the cumulative extrinsic reward over an entire lifetime.

Therefore, inspired by these works, we propose a new approach that incorporates the multi-agent intrinsic reward function in our system model. For the sake of clarity, we first define some terminologies.

•

Intrinsic reward function for agent $i$ . Defined as a function related to agent $i$ and parameterized by $\eta_{i}$ . At the end of time step $t$ , $R_{\text{in},\eta_{i}}^{t+1}$ is a scalar reward that takes into account the history of the entire lifetime of agent $i$ until time step $t$ , including all its partial observations $([o_{i}^{0},\dots,o_{i}^{t}])$ , its selected actions $([a_{i}^{0},\dots,a_{i}^{t}])$ , and extrinsic reward values $[R_{\text{ext}}^{1},\dots,R_{\text{ext}}^{t}])$ .

•

Overall reward function for agent $i$ . Defined as a function related to agent $i$ made of two contributions. First, the extrinsic reward value $R_{\text{ext}}^{t+1}$ received from the environment, which is the same for all agents and quantifies the benefit of joint actions performed by $N$ agents. Second, the intrinsic reward value $R_{\text{in},\eta_{i}}^{t+1}$ that is learned independently by each agent $i$ . For each agent $i$ and time step $t+1$ , the overall reward is given as

[TABLE]

where $\lambda\in[0,1]$ is a hyper-parameter that balances the weighted summation between the extrinsic reward and the intrinsic reward.

•

Episodic overall return. For each episode $k$ , we define the episodic overall return of agent $i$ as

[TABLE]

•

Lifetime extrinsic return. At the end of a lifetime, we define the lifetime extrinsic return as

[TABLE]

where $T$ is the number of steps per lifetime, i.e., $T=N_{\text{ep}}T_{\text{ep}}$ . Using the lifetime return $G_{\text{life},\text{ext}}$ as the objective instead of the conventional episodic return $G_{\text{ep},\text{ext}}$ allows exploration across multiple episodes.

III-A Architecture of the multi-agent framework

Each agent $i\in\mathcal{N}$ is equipped with two neural networks, as depicted in Fig. 2. The first one represents the policy function $\pi_{i}$ , and the second one represents the related intrinsic reward function. The policy network (see Fig. 2a) is a multi-layer perceptron (MLP) with weights $\theta_{i}$ . The intrinsic reward function is represented by a neural network providing as output a scalar reward that takes into account the long past history of agent $i$ . For this reason, instead of adopting a conventional MLP, we exploit the characteristics of recurrent neural networks (RNNs). Unlike MLP, in RNNs the output from previous step is fed as input to the current step creating a feedback loop. As a consequence, the output provided at step $t$ takes into consideration not only the current input, but also what the network has learned from the previous inputs, involving internal memory capabilities. However, conventional RNNs are not able memorize data for long time and tend to forget its previous inputs. To overcome this problem, we use an LSTM, which is a type of recurrent neural network that expands the memory capacity for long period of time [10]. The proposed LSTM is parameterized by $\eta_{i}$ for representing the intrinsic reward function, as shown in Fig. 2b.

III-B Algorithm overview

A high level description of the proposed training algorithm related to each agent $i$ is presented in Pseudo-code 1 and depicted in Fig. 3. As shown, the updates of the policy network and the intrinsic rewards network are carried out with a different periodicity, corresponding to one episode and one lifetime, respectively. The periodicity of a lifetime permits to update the intrinsic reward network taking into account the long-term system dynamics.

At each episode $k$ , each agent $i$ generates an experience interacting with the environment for $T_{\text{ep}}$ time steps using its policy and its intrinsic reward network. In detail, at each time step $t$ , the experience of agent $i$ is stored inside the episode rollout

[TABLE]

and the lifetime rollout

[TABLE]

Episode-by-episode, each agent $i$ updates its policy parameter $\theta_{i}^{(k-1)}$ following the procedure is described in Section III-C. At the end of a lifetime, each agent updates the intrinsic reward network parameter $\eta_{i}$ following the procedure described in Section III-D. The overall procedure is carried out until the intrinsic reward network reaches convergence.

III-C Updating the Policy Parameter $\theta_{i}$

In this subsection, we describe how to update the policy parameter of each agent $i$ . Specifically, at the end of episode $k$ , the update of $\theta^{(k-1)}_{i}$ is performed so as to maximize the expected episodic cumulative overall return of episode $k-1$

[TABLE]

where $a_{i}^{t}\sim\pi_{\theta^{(k-1)}_{i}}\left(a_{i}^{t}\mid o_{i}^{t}\right)$ . This update can be done by using a simple policy gradient method, as follows

[TABLE]

The policy gradient theorem [11] shows that, given the episode rollout $T_{\text{E},i}^{(k)}$ , the update can be computed as111Other approaches exploiting other policy gradient methods (REINFORCE [12], PPO [13], TRPO [14]) can be used.

[TABLE]

III-D Updating the Intrinsic Reward Parameter ( $\eta_{i}$ )

Given a lifetime and the updated policy parameters at the end of the lifetime $\left({\theta}_{i}^{\left(N_{\text{ep}}\right)}\right)$ , we update the intrinsic reward network parameter for each agent $i$ with the aim of maximizing the expected lifetime extrinsic return

[TABLE]

where $a_{i}^{t}\sim\pi_{\theta^{(N_{\text{ep}})}_{i}}\left(a_{i}^{t}\mid o_{i}^{t}\right)$ . Similarly as the policy parameters update, this update can be done by using a simple policy gradient method, as follows

[TABLE]

Intuitively, updating $\eta_{i}$ requires estimating the effect such a change would have on the extrinsic value through the change in the policy parameters. To obtain this, we compute the meta-gradient $\nabla_{\eta_{i}}J_{\text{ext},\text{life}}$ exploiting the chain rule as follows:

[TABLE]

Moreover, the first gradient can be approximated by means of the policy gradient theorem [11] as

[TABLE]

We note that a new lifetime should be computed with the updated policy parameters $\theta^{(N_{\text{ep}})}_{i}$ to calculate this gradient. For avoiding this, we reuse the lifetime generated by the original policy parameters $\theta^{(k)}_{i}$ , with $k=1,\dots,N_{\text{ep}}$ , by means of the importance sampling ratio [15]. Hence, we exploit the lifetime rollout $T_{\text{L},i}$ , and rewrite the gradient computation as follows:

[TABLE]

IV Simulation Results and Analysis

In this section, we examine the convergence performance of the proposed learning framework in terms of percentage of successfully delivered dPDUs vs. the number of training episodes. The list of the simulation parameters is reported in Table I.

The results of the proposed method are compared against the following baselines.

•

Extrinsic-NPS: An independent policy is trained for each agent $i$ to maximize the expected episodic cumulative extrinsic return ( $J_{\text{ep},\text{ext}}$ ).

•

Extrinsic-PS: A shared policy is trained among all the agents to maximize $J_{\text{ep},\text{ext}}$ , as in [2].

•

Random Uniform (RU): Regardless of the observation $o_{i}^{t}$ , each agent $i$ select its action uniformly from $\mathcal{A}$ .

Fig. 4 plots the percentage of successfully delivered packets vs. training episodes with respect to baselines averaged over 10 independent training sessions. After assessing the training phase, we select the best trained instance for each solution in terms of average percentage of successfully delivered packets. Then, we test them in 1000 episodes and show the related statistics by using boxplots.

Specifically, Fig. 4a shows the simulation results in the case of $P=1$ packet to deliver. We observe that the proposed algorithm and Etrinsic-NPS require $5.1\cdot 10^{3}$ iterations to reach convergence. Conversely, Etrinsic-PS has not converged within $7\cdot 10^{3}$ episodes. This shows that the introduction of additional features (intrinsic reward and lifetime update), does not introduce any significant improvement in the case of a simple transmission scenario. In Fig. 4b we show the performances when $P=2$ packets need to be delivered. Therein, the proposed method reaches convergence in almost $2\cdot 10^{4}$ training episodes, which is $75\%$ less than the number of episodes required for the Extrinsic-NPS method. This is because, in this more complex scenario, the proposed method provides additional information with the correct periodicity to the policy update process. Fig. 4b shows also that the proposed method achieves a maximum service success rate of $81\%$ , that is $4\%$ better than the maximum performances of the Extrinsic-NPS one. As regards the other baseline, Extrinsic-PS does not reach convergence within the considered training interval. As concerns the testing phase, our solution exhibits an interquartile range between $75\%$ and $100\%$ of packet successfully delivered, which is the best result, as show in Fig. 4b.

Summarizing, the proposed method yields better convergence speed with better transmission performances in the case of a complex scenario in which the additional information is key for policy parameter tuning.

V Conclusions

We have proposed a novel multi-agent reinforcement learning framework for MAC protocol learning, which in addition to using the classical extrinsic team reward, learns an individual intrinsic reward for each agent based on its history. Each agent uses two modules, namely a policy network and an intrinsic reward network. These two modules are updated with a different periodicity to obtain better learning results in terms of convergence speed. Specifically, the policy network is trained within an episode, while the intrinsic reward network is trained over a fixed number of subsequent episodes, called lifetime. We formulated an optimization problem that seeks to maximize the number of successfully transmitted packets. Our results demonstrate that exploiting these two modules with two different learning periodicities induces a faster convergence speed compared to several baseline solutions.

Acknowledgments

This work was partially supported by the European Union under the Italian National Recovery and Resilience Plan (NRRP) of NextGenerationEU, partnership on “Telecommunications of the Future” (PE00000001 - program “RESTART”), by the Italian MUR PON 2014-2020 under Project “reCITY - Resilient City - Everyday Revolution” (cod. ARS01_00592, CUP B69C21000390005), and by the European Union’s Horizon Europe program through the project CENTRIC.

Bibliography15

The reference list from the paper itself. Each links out to its DOI / PubMed record.

1[1] A. Lazaridou and M. Baroni, “Emergent multi-agent communication in the deep learning era,” 2020. [Online]. Available: https://arxiv.org/abs/2006.02419
2[2] A. Valcarce and J. Hoydis, “Toward joint learning of optimal MAC signaling and wireless channel access,” IEEE Transactions on Cognitive Communications and Networking , vol. 7, no. 4, pp. 1233–1243, 2021.
3[3] L. Miuccio, S. Riolo, S. Samarakoon, D. Panno, and M. Bennis, “Learning generalized wireless MAC communication protocols via abstraction,” in GLOBECOM 2022 - 2022 IEEE Global Communications Conference , 2022, pp. 2322–2327.
4[4] G. Papoudakis, F. Christianos, A. Rahman, and S. V. Albrecht, “Dealing with non-stationarity in multi-agent deep reinforcement learning,” 2019. [Online]. Available: https://arxiv.org/abs/1906.04737
5[5] X. Chu and H. Ye, “Parameter sharing deep deterministic policy gradient for cooperative multi-agent reinforcement learning,” 2017. [Online]. Available: https://arxiv.org/abs/1710.00336
6[6] Z. Zheng, J. Oh, and S. Singh, “On learning intrinsic rewards for policy gradient methods,” in Proceedings of the 32nd International Conference on Neural Information Processing Systems , ser. NIPS’18, 2018, pp. 4644–4654.
7[7] Z. Zheng, J. Oh, M. Hessel, Z. Xu, M. Kroiss, H. van Hasselt, D. Silver, and S. Singh, “What can learned intrinsic rewards capture?” 2019. [Online]. Available: https://arxiv.org/abs/1912.05500
8[8] R. S. Sutton and A. G. Barto, Reinforcement learning: An introduction . MIT press, 2018.

TL;DR

Contribution

Findings

Abstract

Peer Reviews

Videos

Taxonomy

On Learning Intrinsic Rewards for Faster Multi-Agent Reinforcement Learning based MAC Protocol Design in 6G Wireless Networks

Abstract

Index Terms:

I Introduction

II System Model and MARL Formulation

II-A Multi-Agent Reinforcement Learning Formulation

III Proposed approach

III-A Architecture of the multi-agent framework

III-B Algorithm overview

III-C Updating the Policy Parameter θi\theta_{i}θi​

III-D Updating the Intrinsic Reward Parameter (ηi\eta_{i}ηi​)

IV Simulation Results and Analysis

V Conclusions

Acknowledgments

III-C Updating the Policy Parameter $\theta_{i}$

III-D Updating the Intrinsic Reward Parameter ( $\eta_{i}$ )