Reinforcement Learning for Self-Organization and Power Control of   Two-Tier Heterogeneous Networks

Roohollah Amiri; Mojtaba Ahmadi Almasi; Jeffrey G. Andrews; Hani; Mehrpouyan

arXiv:1812.09778·cs.IT·March 19, 2019

Reinforcement Learning for Self-Organization and Power Control of Two-Tier Heterogeneous Networks

Roohollah Amiri, Mojtaba Ahmadi Almasi, Jeffrey G. Andrews, Hani, Mehrpouyan

PDF

TL;DR

This paper introduces a reinforcement learning-based framework for self-organizing power control in dense two-tier heterogeneous networks, enabling adaptive interference management and quality of service maintenance.

Contribution

It proposes a distributed multi-agent Markov decision process model and a Q-learning algorithm for autonomous power optimization in HetNets.

Findings

01

Q-DPA algorithm achieves near-optimal power control with high probability.

02

The framework maintains macrocell user quality of service at high femtocell densities.

03

Sample complexity bounds ensure efficient learning in dense deployments.

Abstract

Self-organizing networks (SONs) can help manage the severe interference in dense heterogeneous networks (HetNets). Given their need to automatically configure power and other settings, machine learning is a promising tool for data-driven decision making in SONs. In this paper, a HetNet is modeled as a dense two-tier network with conventional macrocells overlaid with denser small cells (e.g. femto or pico cells). First, a distributed framework based on multi-agent Markov decision process is proposed that models the power optimization problem in the network. Second, we present a systematic approach for designing a reward function based on the optimization problem. Third, we introduce Q-learning based distributed power allocation algorithm (Q-DPA) as a self-organizing mechanism that enables ongoing transmit power adaptation as new small cells are added to the network. Further, the sample…

Tables3

Table 1. Table I : Urban dual strip pathloss model

Link	PL(dB)
MBS to MUE	$15.3 + 37.6 \log_{10} R$ ,
MBS to FUE	$15.3 + 37.6 \log_{10} R + L_{o w}$ ,
FBS to FUE (same apt strip)	$56.76 + 20 \log_{10} R + 0.7 d_{2 D, i n d o o r}$ ,
FBS to FUE (different apt strip)	$m a x (15.3 + 37.6 \log_{10} R, 38.46 + 20 \log_{10} R) + 18.3 + 0.7 d_{2 D, i n d o o r} + L_{o w}$ .

Table 2. Table II : Simulation Parameters

Default parameters	Value	State parameters	Value
Frame time	2 ms	$d_{1}^{'}, d_{2}^{'}, d_{3}^{'}$	50, 150, 400 m
UE thermal noise	-174 dBm/Hz	$d_{1}, d_{2}, d_{3}$	17.5, 22.5, 45 m
Traffic model	Fullbuffer
FBS parameters	Value	Q-DPA parameters	Value
$p_{min}$	5 dBm	Training period (iterations) $L$	$T \times \| 𝒳 \| . \| 𝒜_{k} \|$ frames
$p_{max}$	15 dBm	Learning parameter $β$	0.9
$Δ p$	1 dBm	Exploratory probability ( $e$ )	10%

Table 3. Table III : Performance of different learning configurations. 1 1 1 is the best, and 4 4 4 is the worst.

Learning configuration	$\sum p_{k}$	$\sum r_{k}$	$r_{0}$
IL+ $𝒳_{1}$	$4$	$1$	$4$
CL+ $𝒳_{1}$	$3$	$3$	$3$
IL+ $𝒳_{2}$	$2$	$2$	$2$
CL+ $𝒳_{2}$	$1$	$4$	$1$

Equations92

γ_{0} = \frac{p _{0} ∣ h _{0, 0} ∣ ^{2}}{femtocells’ interference k \in K \sum p _{k} ∣ h _{k, 0} ∣ ^{2} + N _{0}},

γ_{0} = \frac{p _{0} ∣ h _{0, 0} ∣ ^{2}}{femtocells’ interference k \in K \sum p _{k} ∣ h _{k, 0} ∣ ^{2} + N _{0}},

γ_{k} = \frac{p _{k} ∣ h _{k, k} ∣ ^{2}}{macrocell’s interference p _{0} ∣ h _{0, k} ∣ ^{2} + femtocells’ interference j \in K ∖ { k } \sum p _{j} ∣ h _{j, k} ∣ ^{2} + N _{k}},

γ_{k} = \frac{p _{k} ∣ h _{k, k} ∣ ^{2}}{macrocell’s interference p _{0} ∣ h _{0, k} ∣ ^{2} + femtocells’ interference j \in K ∖ { k } \sum p _{j} ∣ h _{j, k} ∣ ^{2} + N _{k}},

V_{π} (x^{'}) = E_{π} [t = 0 \sum \infty β^{t} R^{(t + 1)} x^{(0)} = x^{'}],

V_{π} (x^{'}) = E_{π} [t = 0 \sum \infty β^{t} R^{(t + 1)} x^{(0)} = x^{'}],

Q_{π} (x, a) = R (x, a) + β x^{'} \in X \sum Pr (x^{'} ∣ x, a) V_{π} (x^{'}) .

Q_{π} (x, a) = R (x, a) + β x^{'} \in X \sum Pr (x^{'} ∣ x, a) V_{π} (x^{'}) .

V^{*} (x) = a max \leavevmode Q^{*} (x, a),

V^{*} (x) = a max \leavevmode Q^{*} (x, a),

R (x, a) = k \in K \sum R_{k} (x_{k}, a_{k}),

R (x, a) = k \in K \sum R_{k} (x_{k}, a_{k}),

Pr (x_{k}^{'} ∣ x, a) = Pr (x_{k}^{'} ∣ x_{k}, a_{k}), \leavevmode (x, a) \in X \times A, \leavevmode (x_{k}, a_{k}) \in X_{k} \times A_{k}, \leavevmode x_{k}^{'} \in X_{k} .

Pr (x_{k}^{'} ∣ x, a) = Pr (x_{k}^{'} ∣ x_{k}, a_{k}), \leavevmode (x, a) \in X \times A, \leavevmode (x_{k}, a_{k}) \in X_{k} \times A_{k}, \leavevmode x_{k}^{'} \in X_{k} .

V (x) = E [t = 0 \sum \infty β^{t} R^{(t + 1)} (x, a)] = E [t = 0 \sum \infty β^{t} k \in K \sum R_{k}^{(t + 1)} (x_{k}, a_{k})] = k \in K \sum V_{k} (x_{k}),

V (x) = E [t = 0 \sum \infty β^{t} R^{(t + 1)} (x, a)] = E [t = 0 \sum \infty β^{t} k \in K \sum R_{k}^{(t + 1)} (x_{k}, a_{k})] = k \in K \sum V_{k} (x_{k}),

Q_{k} (x_{k}, a_{k}) = R_{k} (x_{k}, a_{k}) + β x_{k}^{'} \sum Pr (x_{k}^{'} ∣ x_{k}, a_{k}) V_{k} (x_{k}^{'}),

Q_{k} (x_{k}, a_{k}) = R_{k} (x_{k}, a_{k}) + β x_{k}^{'} \sum Pr (x_{k}^{'} ∣ x_{k}, a_{k}) V_{k} (x_{k}^{'}),

Q (x, a)

Q (x, a)

= k \in K \sum R_{k} (x_{k}, a_{k}) + β x^{'} \in X \sum Pr (x^{'} ∣ x, a) k \in K \sum V_{k} (x_{k})

= k \in K \sum R_{k} (x_{k}, a_{k}) + β k \in K \sum x_{k}^{'} \in X_{k} \sum Pr (x_{k}^{'} ∣ x, a) V_{k} (x_{k})

= k \in K \sum R_{k} (x_{k}, a_{k}) + β k \in K \sum x^{'} \in X_{k} \sum Pr (x_{k}^{'} ∣ x_{k}, a_{k}) V_{k} (x_{k}) = k \in K \sum Q_{k} (x_{k}, a_{k}) .

Q (x^{(t)}, a^{(t)}) \leftarrow Q (x^{(t)}, a^{(t)}) + α^{(t)} (x, a) R^{(t + 1)} (x^{(t)}, a^{(t)}) + β (M) a^{'} max \leavevmode Q (x^{(t + 1)}, a^{'}) - Q (x^{(t)}, a^{(t)}),

Q (x^{(t)}, a^{(t)}) \leftarrow Q (x^{(t)}, a^{(t)}) + α^{(t)} (x, a) R^{(t + 1)} (x^{(t)}, a^{(t)}) + β (M) a^{'} max \leavevmode Q (x^{(t + 1)}, a^{'}) - Q (x^{(t)}, a^{(t)}),

α^{(t)} (x, a) = \frac{1}{[ 1 + t ( x , a ) ]},

α^{(t)} (x, a) = \frac{1}{[ 1 + t ( x , a ) ]},

M = a^{'} max k \in K \sum Q_{k} (x_{k}^{(t + 1)}, a_{k}^{'}) \approx k \in K \sum a_{k}^{'} max \leavevmode Q_{k} (x_{k}^{(t + 1)}, a_{k}^{'}) .

M = a^{'} max k \in K \sum Q_{k} (x_{k}^{(t + 1)}, a_{k}^{'}) \approx k \in K \sum a_{k}^{'} max \leavevmode Q_{k} (x_{k}^{(t + 1)}, a_{k}^{'}) .

M = a^{'} max k \in K \sum Q_{k} (x_{k}^{(t + 1)}, a_{k}^{'}) \approx a_{k}^{'} max k \in K^{'} \sum Q_{k} (x_{k}^{(t + 1)}, a_{k}^{'}),

M = a^{'} max k \in K \sum Q_{k} (x_{k}^{(t + 1)}, a_{k}^{'}) \approx a_{k}^{'} max k \in K^{'} \sum Q_{k} (x_{k}^{(t + 1)}, a_{k}^{'}),

Q_{k} (x_{k}^{(t)}, a_{k}^{(t)}) \leftarrow Q_{k} (x_{k}^{(t)}, a_{k}^{(t)}) + α^{(t)} (R^{(t + 1)} (x_{k}^{(t)}, a_{k}^{(t)}) + β Q_{k} (x_{k}^{(t + 1)}, a_{k}^{*}) - Q_{k} (x_{k}^{(t)}, a_{k}^{(t)})),

Q_{k} (x_{k}^{(t)}, a_{k}^{(t)}) \leftarrow Q_{k} (x_{k}^{(t)}, a_{k}^{(t)}) + α^{(t)} (R^{(t + 1)} (x_{k}^{(t)}, a_{k}^{(t)}) + β Q_{k} (x_{k}^{(t + 1)}, a_{k}^{*}) - Q_{k} (x_{k}^{(t)}, a_{k}^{(t)})),

a_{k}^{'} arg max \leavevmode Q_{k} (x_{k}^{(t + 1)}, a_{k}^{'}),

a_{k}^{'} arg max \leavevmode Q_{k} (x_{k}^{(t + 1)}, a_{k}^{'}),

a_{k}^{'} arg max k \in K^{'} \sum Q_{k} (x_{k}^{(t + 1)}, a_{k}^{'}),

a_{k}^{'} arg max k \in K^{'} \sum Q_{k} (x_{k}^{(t + 1)}, a_{k}^{'}),

R_{k} (r_{0}, r_{k}, Γ_{0}, Γ_{k}) = (r_{0} - lo g_{2} (1 + Γ_{0}))^{k_{1}} + (r_{k} - lo g_{2} (1 + Γ_{k}))^{k_{2}} + C,

R_{k} (r_{0}, r_{k}, Γ_{0}, Γ_{k}) = (r_{0} - lo g_{2} (1 + Γ_{0}))^{k_{1}} + (r_{k} - lo g_{2} (1 + Γ_{k}))^{k_{2}} + C,

V_{2} (x) = E_{π} [t = 0 \sum \infty β^{t} (f^{(t + 1)} (\cdot) + C)] = E_{π} [t = 0 \sum \infty β^{t} f^{(t + 1)} (\cdot)] + C t = 0 \sum \infty β^{t} = V_{1} (x) + \frac{C}{1 - β} .

V_{2} (x) = E_{π} [t = 0 \sum \infty β^{t} (f^{(t + 1)} (\cdot) + C)] = E_{π} [t = 0 \sum \infty β^{t} f^{(t + 1)} (\cdot)] + C t = 0 \sum \infty β^{t} = V_{1} (x) + \frac{C}{1 - β} .

Q (x^{'}, a) \leftarrow Q (x^{'}, a) + α^{(t)} (x^{'}, a) (R (x^{'}, a) + β \leavevmode a^{'} max \leavevmode Q (x^{''}, a^{'}) - Q (x^{'}, a)) \leftarrow α^{(t)} (x^{'}, a) (f (\cdot) + β \leavevmode a^{'} max \leavevmode Q (x^{''}, a^{'})) + (A) α^{(t)} (x^{'}, a) C .

Q (x^{'}, a) \leftarrow Q (x^{'}, a) + α^{(t)} (x^{'}, a) (R (x^{'}, a) + β \leavevmode a^{'} max \leavevmode Q (x^{''}, a^{'}) - Q (x^{'}, a)) \leftarrow α^{(t)} (x^{'}, a) (f (\cdot) + β \leavevmode a^{'} max \leavevmode Q (x^{''}, a^{'})) + (A) α^{(t)} (x^{'}, a) C .

R_{k} (r_{0}, r_{k}, Γ_{0}, Γ_{k}) = (r_{0} - lo g_{2} (1 + Γ_{0}))^{k_{1}} + (r_{k} - lo g_{2} (1 + Γ_{k}))^{k_{2}},

R_{k} (r_{0}, r_{k}, Γ_{0}, Γ_{k}) = (r_{0} - lo g_{2} (1 + Γ_{0}))^{k_{1}} + (r_{k} - lo g_{2} (1 + Γ_{k}))^{k_{2}},

\frac{\partial R _{k}}{\partial r _{i}} \geq 0, \leavevmode i = 0, k .

\frac{\partial R _{k}}{\partial r _{i}} \geq 0, \leavevmode i = 0, k .

R_{k} = (r_{0} - lo g_{2} (1 + Γ_{0}))^{2 m - 1} + (r_{k} - lo g_{2} (1 + Γ_{k}))^{2 m - 1},

R_{k} = (r_{0} - lo g_{2} (1 + Γ_{0}))^{2 m - 1} + (r_{k} - lo g_{2} (1 + Γ_{k}))^{2 m - 1},

\frac{\partial R _{k}}{\partial r _{i}} \times (r_{i} - lo g_{2} (1 + Γ_{i})) \leq 0, \leavevmode i = 0, k .

\frac{\partial R _{k}}{\partial r _{i}} \times (r_{i} - lo g_{2} (1 + Γ_{i})) \leq 0, \leavevmode i = 0, k .

\frac{\partial R _{k}}{\partial r _{0}} \times (r_{0} - lo g_{2} (1 + Γ_{0})) \leq 0,

\frac{\partial R _{k}}{\partial r _{0}} \times (r_{0} - lo g_{2} (1 + Γ_{0})) \leq 0,

\frac{\partial R _{k}}{\partial r _{k}} \geq 0.

Pr (∥ Q^{*} - Q_{π} ∥ < ϵ) \geq 1 - δ .

Pr (∥ Q^{*} - Q_{π} ∥ < ϵ) \geq 1 - δ .

∥ Q^{*} - Q^{(T)} ∥ \leq \frac{2 R _{ma x}}{( 1 - β )} [\frac{β}{T ( 1 - β )} + \frac{2}{T} ln \frac{2 ∣ X ∣ . ∣ A ∣}{δ}] .

∥ Q^{*} - Q^{(T)} ∥ \leq \frac{2 R _{ma x}}{( 1 - β )} [\frac{β}{T ( 1 - β )} + \frac{2}{T} ln \frac{2 ∣ X ∣ . ∣ A ∣}{δ}] .

T = Ω (\frac{8 R _{ma x}^{2}}{ϵ ^{2} ( 1 - β ) ^{2}} ln \frac{2 ∣ X ∣ . ∣ A _{k} ∣}{δ})

T = Ω (\frac{8 R _{ma x}^{2}}{ϵ ^{2} ( 1 - β ) ^{2}} ln \frac{2 ∣ X ∣ . ∣ A _{k} ∣}{δ})

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Full text

Reinforcement Learning for Self Organization and Power Control of Two-Tier Heterogeneous Networks

Roohollah Amiri, Mojtaba Ahmadi Almasi, Jeffrey G. Andrews, Hani Mehrpouyan This work was presented in part at ICC 2018 [1].R. Amiri, M. A. Almasi, and H. Mehrpouyan are with Department of Electrical and Computer Engineering, Boise State University, Boise, ID, USA (e-mail: [email protected]; [email protected]; [email protected]). J. G. Andrews (email:[email protected]) is with the University of Texas at Austin, USA. Last revised March 7, 2024.

Abstract

Self-organizing networks (SONs) can help manage the severe interference in dense heterogeneous networks (HetNets). Given their need to automatically configure power and other settings, machine learning is a promising tool for data-driven decision making in SONs. In this paper, a HetNet is modeled as a dense two-tier network with conventional macrocells overlaid with denser small cells (e.g. femto or pico cells). First, a distributed framework based on multi-agent Markov decision process is proposed that models the power optimization problem in the network. Second, we present a systematic approach for designing a reward function based on the optimization problem. Third, we introduce Q-learning based distributed power allocation algorithm (Q-DPA) as a self-organizing mechanism that enables ongoing transmit power adaptation as new small cells are added to the network. Further, the sample complexity of the Q-DPA algorithm to achieve $\epsilon$ -optimality with high probability is provided. We demonstrate, at density of several thousands femtocells per km2, the required quality of service of a macrocell user can be maintained via the proper selection of independent or cooperative learning and appropriate Markov state models.

Index Terms:

Self-organizing networks, HetNets, Reinforcement learning, Markov decision process.

I Introduction

Self-organization is a key feature as cellular networks densify and become more heterogeneous, through the additional small cells such as pico and femtocells [2, 3, 4, 5, 6]. Self-organizing networks (SONs) can perform self-configuration, self-optimization and self-healing. These operations can cover basic tasks such as configuration of a newly installed base station (BS), resource management, and fault management in the network [7]. In other words, SONs attempt to minimize human intervention where they use measurements from the network to minimize the cost of installation, configuration and maintenance of the network. In fact SONs bring two main factors in play: intelligence and autonomous adaptability [2, 3]. Therefore, machine learning techniques can play a major role in processing underutilized sensory data to enhance the performance of SONs [8, 9].

One of the main responsibilities of SONs is to configure the transmit power at various small BSs to manage interference. In fact, a small BS needs to configure its transmit power before joining the network (as self-configuration). Subsequently, it needs to dynamically control its transmit power during its operation in the network (as self-optimization). To address these two issues, we consider a macrocell network overlaid with small cells and focus on autonomous distributed power control, which is a key element of self-organization since it improves network throughput [10, 11, 12, 13, 14] and minimizes energy usage [15, 16, 17]. We rely on local measurements, such as signal-to-interference-plus-noise ratio (SINR), and the use of machine learning to develop a SON framework that can continually improve the above performance metrics.

I-A Related Work

In wireless communications, dynamic power control with the use of machine learning has been implemented via reinforcement learning (RL). In this context, RL is an area of machine learning that attempts to optimize a BS’s transmit power to achieve a certain goal such as throughput maximization. One of the main advantages of RL with respect to supervised learning methods is its training phase, in which there is no need for correct input/output data. In fact, RL operates by applying the experience that it has gained through interacting with the network [18]. RL methods have been applied in the field of wireless communications in areas such as resource management [19, 20, 21, 22, 23, 24], energy harvesting [25], and opportunistic spectrum access [26, 27]. A comprehensive review of RL applications in wireless communications can be found in [28].

Q-learning is a model-free RL method [29]. The model-free feature of Q-learning makes it a proper method for scenarios in which the statistics of the network continuously change. Further, Q-learning has low computational complexity and can be implemented by BSs in a distributed manner [1]. Therefore, Q-learning can bring scalability, robustness, and computational efficiency to large networks. However, designing a proper reward function which accelerates the learning process and avoids false learning or unlearning phenomena [30] is not trivial. Therefore, to solve an optimization problem, an appropriate reward function for Q-learning needs to be determined.

In this regard, the works in [19, 20, 21, 22, 23, 24] have proposed different reward functions to optimize power allocation between femtocell base stations (FBSs). The method in [19] uses independent Q-learning in a cognitive radio system to set the transmit power of secondary BSs in a digital television system. The solution in [19] ensures that the minimum quality of service (QoS) for the primary user is met by applying Q-learning and using the SINR as a metric. However, the approach in [19] doesn’t take the QoS of the secondary users into considerations. The work in [20] uses cooperative Q-learning to maximize the sum transmission rate of the femtocell users while keeping the transmission rate of macrocell users near a certain threshold. Further, the authors in [21] have used the proximity of FBSs to a macrocell user as a factor in the reward function. This results in a fair power allocation scheme in the network. Their proposed reward function keeps the transmission rate of the macrocell user above a certain threshold while maximizing the sum transmission rate of FBSs. However, by not considering a minimum threshold for the FBSs’ rates, the approach in [21] fails to support some FBSs as the density of the network (and consequently interference) increases. The authors in [22] model the cross-tier interference management problem as a non-cooperative game between femtocells and the macrocell. In [22], femtocells use the average SINR measurement to enhance their individual performances while maintaining the QoS of the macrocell user. In [23], the authors attempt to improve the transmission rate of cell-edge users while keeping the fairness between the macrocell and the femtocell users by applying a round robin approach. The work in [24] minimizes power usage in a Long Term Evolution (LTE) enterprise femtocell network by applying an exponential reward function without the requirement to achieve fairness amongst the femtocells in the network.

In the above works, the reward functions do not apply to dense networks. That is to say, first, there is no minimum threshold for the achievable rate of the femtocells. Second, the reward functions are designed to limit the macrocell user rate to its required QoS and not more than that. This property encourages an FBS to use more power to increase its own rate by assuming that the caused interference just affects the macrocell user. However, the neighbor femtocells suffer from this decision and overall the sum rate of the network decreases. Further, they do not provide a generalized framework for modeling a HetNet as a multi-agent RL network or a procedure to design a reward function which meets the QoS requirements of the network. In this paper, we focus on dense networks and try to provide a general solution to the above challenges.

I-B Contributions

We propose a learning framework based on multi-agent Markov decision process (MDP). By considering an FBS as an agent, the proposed framework enables FBSs to join and adapt to a dense network autonomously. Due to unplanned and dense deployment of femtocells, providing the required QoS to all the users in the network becomes an important issue. Therefore, we design a reward function that trains the FBSs to achieve this goal. Furthermore, we introduce a Q-learning based distributed power allocation approach (Q-DPA) as an application of the proposed framework. Q-DPA uses the proposed reward function to maximize the transmission rate of femtocells while prioritizing the QoS of the macrocell user. More specifically the contributions of the paper can be summarized as:

We propose a framework that is agnostic to the choice of learning method but also connects the required RL analogies to wireless communications. The proposed framework models a multi-agent network with a single MDP that contains the joint action of the all the agents as its action set. Next, we introduce MDP factorization methods to provide a distributed and scalable architecture for the proposed framework. The proposed framework is used to benchmark the performance of different learning rates, Markov state models, or reward functions in two-tier wireless networks. 2. 2.

We present a systematic approach for designing a reward function based on the optimization problem and the nature of RL. In fact, due to scarcity of resources in a dense network, we propose some properties for a reward function to maximize sum transmission rate of the network while considering minimum requirements of all users. The procedure is simple and general and the designed reward function is in the shape of low complexity polynomials. Further, the designed reward function results in increasing the achievable sum transmission rate of the network while consuming considerably less power compared to greedy based algorithms. 3. 3.

We propose Q-DPA as an application of the proposed framework to perform distributed power allocation in a dense femtocell network. Q-DPA uses the factorization method to derive independent and cooperative learning from the optimal solution. Q-DPA uses local signal measurements at the femtocells to train the FBSs in order to: (i) maximize the transmission rate of femtocells, (ii) achieve minimum required QoS for all femtocell users with a high probability, and (iii) maintain the QoS of macrocell user in a densely deployed femtocell network. In addition, we determine the minimum number of samples that is required to achieve an $\epsilon$ -optimal policy in Q-DPA as its sample complexity. 4. 4.

We introduce four different learning configurations based on different combinations of independent/cooperative learning and Markov state models. We conduct extensive simulations to quantify the effect of different learning configurations on the performance of the network. Simulations show that the proposed Q-DPA algorithm can decrease power usage and as a result reduce the interference to the macrocell user.

The paper is organized as follows. In Section II, the system model is presented. Section III introduces the optimization problem and presents the existing challenges in solving this problem. Section IV presents the proposed learning framework which models a two-tier femtocell network with a multi-agent MDP. Section V-A presents the Q-DPA algorithm as an application of the proposed framework. Section VI presents the simulation results while Section VII concludes the paper.

Notation: Lower case, boldface lower case, and calligraphic symbols represent scalars, vectors, and sets, respectively. For a real-valued function $Q:\mathcal{Z}\rightarrow\mathbb{R}$ , $\lVert Q\rVert$ denotes the max norm, i.e., $\lVert Q\rVert=\underset{z\in\mathcal{Z}}{max}\leavevmode\nobreak\ \lvert Q\left(z\right)\rvert$ . $\mathbb{E}_{x}\left[\cdot\right]$ , $\mathbb{E}_{x}\left[\cdot|\cdot\right]$ , and $\frac{\partial f}{\partial x}$ denote the expectation, the conditional expectation, and the partial derivation with respect to $x$ , respectively. Further, $\Pr\left(\cdot|\cdot\right)$ and $|\cdot|$ denote the conditional probability and absolute value operators, respectively.

II Downlink System Model

Consider the downlink of a single cell of a HetNet operating over a set $\mathcal{S}=\left\{1,...,S\right\}$ of $S$ orthogonal subbands. In the cell a single macro base station (MBS) is deployed. The MBS serves one macrocell user equipment (MUE) over each subband while guaranteeing this user a minimum average SINR over each subband which is denoted by $\Gamma_{0}$ . A set of FBSs are deployed in area of coverage of the macrocell. Each FBS selects a random subband and serves one femtocell user equipment (FUE). We assume that overall, on each subband $s\in\mathcal{S}$ , a set $\mathcal{K}=\left\{1,...,K\right\}$ of $K$ FBSs are operating. Each FBS guarantees a minimum average SINR denoted by $\Gamma_{k}$ to its related FUE. We consider a dense network in which the density results in both cross-tier and co-tier interference. Therefore, in order to control the interference-level and provide the users with their required minimum SINR, we focus on power allocation in the downlink of the femtocell network. Uplink results can be obtained in a similar fashion but are not included for brevity. The overall network configuration is presented in Fig. 1. We focus on one subband, meanwhile the proposed solution can be extended to a case in which each FBS supports multiple users on different subbands.

We denote the MBS-MUE pair by the index [math] and the FBS-FUE pairs by the index $k$ from the set $\mathcal{K}$ . In the downlink, the received signal at the MUE operating over subband $s$ includes interference from the femtocells and thermal noise. Hence, the SINR at the MUE operating over subband $s\in\mathcal{S}$ , $\gamma_{0}$ , is calculated as

[TABLE]

where $p_{0}$ denotes the power transmitted by the MBS and $h_{0,0}$ denotes the channel gain from the MBS to the MUE. Further, the power transmitted by the $k$ th FBS is denoted by $p_{k}$ and the channel gain from the $k$ th FBS to the MUE is denoted by $h_{k,0}$ . Finally, $N_{0}$ denotes the variance of the additive white Gaussian noise. Similarly, the SINR at the $k$ th FUE operating over subband $s\in\mathcal{S}$ , $\gamma_{k}$ , is obtained as

[TABLE]

where $h_{k,k}$ denotes the channel gain between the $k$ th FBS and the $k$ th FUE, $h_{0,k}$ denotes the channel gain between the MBS and the $k$ th FUE, $p_{j}$ denotes the transmit power of the $j$ th FBS, $h_{j,k}$ is the channel gain between the $j$ th FBS and the $k$ th FUE, and $N_{k}$ is the variance of the additive white Gaussian noise. Finally, the transmission rates, normalized by the transmission bandwidth, at the MUE and the FUE operating over subband $s\in\mathcal{S}$ , i.e., $r_{0}$ and $r_{k}$ , respectively, are expressed as $r_{0}=\log_{2}\left(1+\gamma_{0}\right)$ and $r_{k}=\log_{2}\left(1+\gamma_{k}\right),\leavevmode\nobreak\ k\in\mathcal{K}$ .

III Problem Formulation

Each FBS has the objective of maximizing its transmission rate while ensuring that the SINR of the MUE is above the required threshold, i.e., $\Gamma_{0}$ . Denoting $\mathbf{p}=\left\{p_{1},...,p_{K}\right\}$ as the vector of the transmit powers of the $K$ FBSs operating over the subband $s\in\mathcal{S}$ , the power allocation problem is presented as follow

[TABLE]

where $p_{max}$ defines the maximum available transmit power at each FBS. The objective (3) is to maximize the sum transmission rate of the FUEs. Constraint (3a) refers to the power limitation of every FBS. Constraints (3b) and (3c) ensure that the minimum SINR requirement is satisfied for the MUE and the FUEs. The addition of constraint (3c) to the optimization problem is one of the differences between the proposed approach in this paper and that of [19, 20, 21, 22, 23, 24].

Considering (2), it can be concluded that the optimization in (3) is a non-convex problem for dense networks. This follows from the SINR expression in (2) and the objective function (3). More specifically, the interference term due to the neighboring femtocells in the denominator of (2) ensures that the optimization problem in (3) is not convex [31]. This interference term may be ignored in low density networks but cannot be ignored in dense networks consisting of a large number of femtocells [32]. However, non-convextiy is not the only challenge of the above problem. In fact, many iterative algorithms are developed to solve the above optimization problem with excellent performance. However, their algorithms contains expensive computations such as matrix inversion and bisection or singular value decomposition in each iteration which makes their real-time implementation challenging [33]. Besides, the $k$ th FBS is only aware of its own transmit power, $p_{k}$ , and does not know the transmit powers of the remaining FBSs. Therefore, the idea here is to treat the given problem as a black-box and try to learn the relation between the transmit power and the resulting transmission rate gradually by interacting with the network and simple computations.

To realize self-organization, each FBS should be able to operate autonomously. This means an FBS should be able to connect to the network at anytime and to continuously adapt its transmit power to achieve its objectives. Therefore, our optimization problem requires a self-adaptive solution. The steps for achieving self-adaptation can be summarized as: (i) the FBS measures the interference level at its related FUEs, (ii) determines the maximum transmit power to support its FUEs while not greatly degrading the performance of other users in the network. In the next section, the required framework to solve this problem will be presented.

IV The Proposed Learning Framework

Here, first we model a multi-agent network as an MDP. Then the required definitions, evaluation methods, and factorization of the MDP to develop a distributed learning framework are explained. Subsequently, the femtocell network is modeled as a multi-agent MDP and the proposed learning framework is developed.

IV-A Multi-Agent MDP and Policy Evaluation

A single-agent MDP comprises an agent, an environment, an action set, and a state set. The agent can transition between different states by choosing different actions. The trace of actions that is taken by the agent is called its policy. With each transition, the agent will receive a reward from the environment, as a consequence of its action, and will save the discounted summation of rewards as a cumulative reward. The agent will continue its behavior with the goal of maximizing the cumulative reward and the value of cumulative reward evaluates the chosen policy. The discount property increases the impact of recent rewards and decreases the effect of later ones. If the number of transitions is limited, the non-discounted summation of rewards can be used as well.

A multi-agent MDP consists of a set, $\mathcal{K}$ , of $K$ agents. The agents select actions to move between different states of the model to maximize the cumulative reward received by all the agents. Here, we again formulate the network of agents as one MDP, e.g., we define the action set as the joint action set of all the agents. Therefore, the multi-agent MDP framework is defined with a tuple as $\left(\mathcal{A},\mathcal{X},Pr,\mathbf{R}\right)$ with the following definitions.

•

$\mathcal{A}$ is the joint set of all the agents’ actions. An agent $k$ selects its action $a$ from its action set $\mathcal{A}_{k}$ , i.e., $a_{k}\in\mathcal{A}_{k}$ . The joint action set is represented as $\mathcal{A}=\mathcal{A}_{1}\times\cdots\times\mathcal{A}_{K}$ , with $\mathbf{a}\in\mathcal{A}$ as a single joint action.

•

The state of the system is defined with a set of random variables. Each random variable is represented by $X_{i}$ with $i=1,...,n$ , and the state set is represented as $\mathcal{X}=\left\{X_{1},X_{2},...,X_{n}\right\}$ , where $\mathbf{x}\in\mathcal{X}$ denotes a single state of the system. Each random variable reflects a specific feature of the network.

•

The transition probability function, $\Pr\left(\mathbf{x},\mathbf{a},\mathbf{x}^{\prime}\right)$ , represents the probability of taking joint action $\mathbf{a}$ at state $\mathbf{x}$ and ending in state $\mathbf{x}^{\prime}$ . In other words, the transition probability function defines the environment which agents are interacting with.

•

$\mathbf{R}\left(\mathbf{x},\mathbf{a}\right)$ is the reward function such that its value is the received reward by the agents for taking joint action $\mathbf{a}$ at state $\mathbf{x}$ .

We define $\pi:\mathcal{X}\rightarrow\mathbf{A}$ as the policy function, where $\pi\left(\mathbf{x}\right)$ is the joint action that is taken at the state $\mathbf{x}$ . In order to evaluate the policy $\pi\left(\mathbf{x}\right)$ , a value function $V_{\pi}\left(\mathbf{x}\right)$ and an action-value function $\mathbf{Q}_{\pi}\left(\mathbf{x},\mathbf{a}\right)$ are defined. The value of the policy $\pi$ in state $\mathbf{x}^{\prime}\in\mathcal{X}$ is defined as [18]

[TABLE]

in which $\beta\in\left(0,1\right]$ is a discount factor, $\mathbf{R}^{\left(t+1\right)}$ is the received reward at time step $t+1$ , and $\mathbf{x}^{\left(0\right)}$ is the initial state. The action-value function, $\mathbf{Q}_{\pi}\left(\mathbf{x},\mathbf{a}\right)$ , represents the value of the policy $\pi$ for taking joint action $\mathbf{a}$ at state $\mathbf{x}$ and then following policy $\pi$ for subsequent iterations. According to [18], the relation between the value function and the action-value function is given by

[TABLE]

For the ease of notation, we will use $V$ and $\mathbf{Q}$ for the value function and the action-value function of policy $\pi$ , respectively. Further, we use the term Q-function to refer to the action-value function. The optimal value of state $\mathbf{x}$ is the maximum value that can be reached by following any policy and starting at this state. An optimal value function $V^{*}$ , which gives an optimal policy $\pi^{*}$ , satisfies the Bellman optimality equation as [18]

[TABLE]

where $\mathbf{Q}^{*}\left(\mathbf{x},\mathbf{a}\right)$ is an optimal Q-function under policy $\pi^{*}$ . The general solution for (3f) is to start from an arbitrary policy and using the generalized policy iteration (GPI) [18] method to iteratively evaluate and improve the chosen policy to achieve an optimal policy. If the agents have a priori information of the environment, i.e., $Pr\left(\mathbf{x},\mathbf{a},\mathbf{x}^{\prime}\right)$ is known to the agents, dynamic programming is the solution for (3f). However, the environment is unknown in most practical applications. Hence, we rely on reinforcement learning (RL) to derive an optimal Q-function. RL uses temporal-difference to provide a real-time solution for the GPI method [18]. As a result, in Section V-A, we use Q-learning, as a specific method of RL, to solve (3f).

IV-B Factored MDP

To this point, we defined the Q-function over the joint state-action space of all the agents, i.e., $\mathcal{X}\times\mathcal{A}$ . We refer to this Q-function as the global Q-function. According to [29], Q-learning finds the optimal solution to a single MDP with probability one. However, in large MDPs, due to exponential increase in the size of the joint state-action space with respect to the number of agents, the solution to the problem becomes intractable. To resolve this issue, we use factored MDPs as a decomposition technique for large MDPs. The idea in factored MDPs is that many large MDPs are generated by systems with many parts that are weakly interconnected. Each part has its associated state variables and the state space can be factored into subsets accordingly. The definition of the subsets affects the optimality of the solution [34], and investigating the optimal factorization method helps with understanding the optimality of multi-agent RL solutions [35]. In [36] power control of a multi-hop network is modeled as an MDP and the state set is factorized into multiple subsets each referring to a single hop. The authors in [37] show that the subsets can be defined based on the local knowledge of the agents from the environment. Meanwhile, we aim to distribute the power control to the nodes of the network. Therefore, due to the definition of the problem in Section III and the fact that each FBS is only aware of its own power, we use the assumption in [37] and define the individual action set of the agents, i.e., $\mathcal{A}_{k}$ , as the subsets of the joint action set. Consequently, the resultant Q-function for the $k$ th agent is defined as $Q_{k}\left(\mathbf{x}_{k},a_{k}\right)$ , in which $a_{k}\in\mathcal{A}_{k}$ , $\mathbf{x}_{k}\in\mathcal{X}_{k}$ is the state vector of the $k$ th agent, and $\mathcal{X}_{k},\leavevmode\nobreak\ k\in\mathcal{K}$ , are the subsets of the global state set of the system, i.e., $\mathcal{X}$ .

In factored MDPs, We assume that the reward function is factored based on the subsets, i.e.,

[TABLE]

where, $R_{k}\left(\mathbf{x}_{k},a_{k}\right)$ is the local reward function of the $k$ th agent. Moreover, we also assume that the transition probabilities are factored, i.e., for the $k$ th subsystem we have

[TABLE]

The value function for the global MDP is given by

[TABLE]

where, $V_{k}\left(\mathbf{x}_{k}\right)$ is the value function of the $k$ th agent. Therefore, the derived policy has the value function equal to the linear combination of local value functions. Further, according to (3e), for each agent $k\in\mathcal{K}$

[TABLE]

and for the global Q-function

[TABLE]

Therefore, based on the assumptions in (3g) and (3h), the global Q-function can be approximated with the linear combination of local Q-functions. Further, (3k) results in a distributed and scalable architecture for the framework.

IV-C Femtocell Network as Multi-Agent MDP

In a wireless communication system, the resource management policy is equivalent to the policy function in an MDP. To integrate the femtocell network in a multi-agent MDP, we define the followings according to Fig. 2.

•

Environment: From the view point of an FBS, the environment is comprised of the macrocell and all other femtocells.

•

Agent: Each FBS is an independent agent in the MDP. In this paper, the terms of agent and FBS are used interchangeably. An agent has three objectives: (i) improving its sum transmission rate, (ii) guaranteeing the required SINR for its user (i.e., $\Gamma_{k}$ ), and (iii) meeting the required SINR for the MUE.

•

Action set ( $\mathcal{A}_{k}$ ): The transmit power level is the action of an FBS. The $k$ th FBS chooses its transmit power from the set $\mathcal{A}_{k}$ which covers the space between $\textit{p}_{\text{min}}$ and $\textit{p}_{\text{max}}$ . $\textit{p}_{\text{min}}$ and $\textit{p}_{\text{max}}$ denote the minimum and maximum transmit power of the FBS, respectively. In general, the FBS has no knowledge of the environment and it chooses its actions with the same probability in the training mode. Therefore, equal step sizes of $\Delta p$ are chosen between $p_{min}$ and $p_{max}$ to construct the set $\mathcal{A}_{k}$ .

•

State set ( $\mathcal{X}_{k}$ ): State set directly affects the performance of the MUE and the FUEs. To this end, we define four variables to represent the state of the network. The state set variables are defined based on the constraints of the optimization problem in (3). We define the variables $X_{1}$ and $X_{2}$ as indicators of the performance of the FUE and the MUE. On the other hand, the relative location of an FBS with respect to the MUE and the MBS is important and affects the interference power at the MUE caused by the FBS, and the interference power at the FBS causes by the MBS. Therefore, we define $X_{3}$ as an indicator of the interference imposed on the MUE by the FBS, and $X_{4}$ as an indicator of interference imposed on the femtocell by the MBS. The state variables are defined as

–

$X_{1}\in\left\{0,1\right\}$ : The value of $X_{1}$ indicates whether the FBS is supporting its FUE with the required minimum SINR or not. $X_{1}$ is defined as $X_{1}=\mathbbm{1}_{\left\{\gamma_{k}\geq\Gamma_{k}\right\}}$ .

–

$X_{2}\in\left\{0,1\right\}$ : The value of $X_{2}$ indicates whether the MUE is being supported with its required minimum SINR or not. $X_{2}$ is defined as $X_{2}=\mathbbm{1}_{\left\{\gamma_{0}\geq\Gamma_{0}\right\}}$ .

–

$X_{3}\in\left\{0,1,2,...,N_{1}\right\}$ : The value of $X_{3}$ defines the location of the FBS compared to $N_{1}$ concentric rings around the MUE. The radius of rings are $d_{1}$ , $d_{2}$ , … , $d_{N_{1}}$ .

–

$X_{4}\in\left\{0,1,2,...,N_{2}\right\}$ : The value of $X_{4}$ defines the location of the FBS compared to $N_{2}$ concentric rings around the MBS. The radius of rings are $d^{\prime}_{1}$ , $d^{\prime}_{2}$ , … , $d^{\prime}_{N_{2}}$ .

The $k$ th FBS calculates $\gamma_{k}$ based on the channel equality indicator (CQI) received from its related FUE to assess $X_{1}$ . The MBS is aware of the SINR of the MUE user, i.e., $\gamma_{0}$ , and the relative location of the FBS concerning itself and the MUE. Therefore, the FBS obtains the required information to asses the $X_{2}$ , $X_{3}$ , and $X_{4}$ variables via backhaul and feedback from the MBS.

Here, we defined the state variables as a function of each FBS’s SINR and location. Therefore, in high SINR regime, the state of FBSs can be assumed to be independent of each other.

In Section VI, we will examine different possible state sets to investigate the effect of the above state variables on the performance of the network.

V Q-DPA, Reward Function, and Sample Complexity

In this section, we present Q-DPA, which is an application of the proposed framework. Q-DPA details the learning method, the learning rate, and the training procedure. Then, the proposed reward function is defined. Finally, the required sample complexity for the training is derived.

V-A Q-learning Based Distributed Power Allocation (Q-DPA)

To solve the Bellman equation in (3f), we use Q-learning. The reasoning for choosing the RL method and advantages of Q-learning are explained in Sections IV-A and I-A, respectively. The Q-learning update rule to evaluate a policy for the global Q-function can be represented as [29]

[TABLE]

where $\mathbf{a^{\prime}}\in\mathcal{A}$ , $\alpha^{\left(t\right)}\left(\mathbf{x},\mathbf{a}\right)$ denotes the learning rate at time step $t$ , and $\mathbf{x}^{\left(t\right)}$ is the new state of the network. The term $M$ is the maximum value of the global Q-function that is available at the new state $\mathbf{x}^{\left(t+1\right)}$ . After each iteration, the FBSs will receive the delayed reward $\mathbf{R}^{\left(t+1\right)}\left(\mathbf{x}^{\left(t\right)},\mathbf{a}^{\left(t\right)}\right)$ and then the global Q-function will be updated according to (3l).

In the prior works [19, 20, 21, 23, 24], a constant learning rate was used for Q-learning to solve the required optimization problems. However, according to [38], in finite number of iterations, the performance of Q-learning can be improved by applying a decaying learning rate. Therefore, we use the following learning rate

[TABLE]

in which $t\left(\mathbf{x},\mathbf{a}\right)$ refers to the number of times, until time step $t$ , that the state-action pair $\left(\mathbf{x},\mathbf{a}\right)$ is visited. It is worth mentioning that, by using the above learning rate, we need to keep track of the number of times each state-action pair has been visited during training, which requires more memory. Therefore, at the cost of more memory, a better performance can be achieved.

There are two alternatives available for the training of new FBSs as they join the network, they can use independent learning or cooperative learning. In independent learning, each FBS tries to maximize its own Q-function. In other words, using the factorization method in Section IV-B, the term $M$ in (3l) is approximated as

[TABLE]

In cooperative learning, the FBSs share their local Q-functions and will assume that the FBSs with the same state make the same decision. Hence, term $M$ is approximated as

[TABLE]

where $\mathcal{K}^{\prime}$ is the set of FBSs with the same state $\mathbf{x}_{k}^{\left(t+1\right)}$ . Cooperative Q-learning may result in a higher cumulative reward [39]. However, cooperation will result in the same policy for FBSs with the same state and additional overhead since the Q-functions between FBSs need to be shared over the backhaul network. The local update rule for the $k$ th FBS can be derived from (3l) as

[TABLE]

where, $R^{\left(t+1\right)}\left(\mathbf{x}_{k}^{\left(t\right)},a_{k}^{\left(t\right)}\right)$ is the reward of the $k$ th FBS, and $a_{k}^{*}$ is defined as

[TABLE]

and

[TABLE]

for independent and cooperative learning, respectively.

In this paper, a tabular representation is used for the Q-function in which the rows of the table refer to the states and the columns refer to the actions of an agent. Generally, for large state spaces, neural networks are more efficient to use as Q-functions, however, part of this work is focused on the effect of state space variables. Therefore, we avoid large number of state variables. On the other hand, we provide exhaustive search solution to investigate the optimality of our solution which is not possible for large state spaces.

The training for an FBS happens over $L$ frames. In the beginning of each frame, the FBS chooses an action, i.e., transmit power. Then, the FBS sends a frame to the intended FUE. The FUE feeds back the required measurements such as CQI so the FBS can estimate the SINR at the FUE, and calculate the reward based on (3x). Finally, the FBS updates its Q-table according to (3p).

Due to limited number of training frames, each FBS needs to select its actions in a way that covers most of the action space and improves the policy at the same time. Therefore, the FBS chooses the actions with a combination of exploration and exploitation, known as an $e$ -greedy exploration. In the $e$ -greedy method, the FBS acts greedily with probability $1-e$ (i.e., exploiting) and randomly with probability $e$ (i.e., exploring). In exploitation, the FBS selects an action that has the maximum value in the current state in its own Q-table (independent learning) or in the summation of Q-tables (cooperative learning). In exploring, the FBS selects an action randomly to cover action space and avoid biasing to a local maximum. In [18], it is shown that for a limited number of iterations the $e$ -greedy policy results in a closer final value to the optimal value compared to only exploiting or exploring.

It is worth mentioning that the overhead of sharing Q-tables depends on the definition of the state model $\mathcal{X}_{k}$ according to Section IV-C. For instance, assuming the largest possible state model as $\mathcal{X}_{k}=\left\{X_{1},X_{2},X_{3},X_{4}\right\}$ . The variables $X_{3}$ and $X_{4}$ depend on the location of the FBS and are fixed during training. Therefore, one training FBS uses four rows of its Q-table and just needs the same rows from other FBSs. Hence, if the number of active FBSs is $|\mathcal{K}|$ , the number of messages to the FBS in each training frame is $4\times\left(|\mathcal{K}|-1\right)$ , each of size $|\mathcal{A}_{k}|$ .

V-B Proposed Reward Function

The design of the reward function is essential because it directly impacts the objectives of the FBS. Generally, there has not existed a quantitative approach to designing the reward function. Here, we present a systematic approach for deriving the reward function based on the nature of the optimization problem under consideration. Then, we compare the behavior of the designed reward function to the ones in [19, 20, 21].

The reward function for the $k$ th FBS is represented as $R_{k}$ . According to the Section IV-C, the $k$ th FBS has knowledge of the minimum required SINR for the MUE, i.e. $\Gamma_{0}$ , and minimum required SINR for its related FUE, i.e. $\Gamma_{k}$ . Also, after taking an action in each step, the $k$ th FBS has access to the rate of the MUE, i.e. $r_{0}$ and the rate of its related FUE, i.e. $r_{k}$ . Therefore, $R_{k}$ is considered as a function of the above four variables as $R_{k}:\left(r_{0},r_{k},\Gamma_{0},\Gamma_{k}\right)\rightarrow\mathbb{R}$ .

In order to design the appropriate reward function, we need to estimate the progress of the $k$ th FBS toward the goals of the optimization problem. Based on the input arguments to the reward function, we define two progress estimators, one for the MUE as $\left(r_{0}-\log_{2}\left(1+\Gamma_{0}\right)\right)$ and one for the $k$ th FUE as $\left(r_{k}-\log_{2}\left(1+\Gamma_{k}\right)\right)$ . To reduce computational complexity, we define the reward function as a polynomial function of the defined progress estimators as

[TABLE]

where, $k_{1}$ and $k_{2}$ are integers and $C\in\mathbb{R}$ is a constant referred to as the bias of the reward function.

The constant bias, $C$ , in the reward function has two effects on the learning algorithm: (i) The final value of the states for a given policy $\pi$ , and (ii) the behavior of the agent in the beginning of the learning process as follows:

Effect of bias on the final value of the states: Assume the reward function, $R_{1}=f\left(\cdot\right)$ , and the reward function $R_{2}=f\left(\cdot\right)+C$ , $C\in\mathbb{R}$ . We define the value of state $\mathbf{x}$ for a given policy $\pi$ using $R_{1}$ as $V_{1}\left(\mathbf{x}\right)$ and the value of the state $\mathbf{x}$ for the same policy using $R_{2}$ as $V_{2}\left(\mathbf{x}\right)$ . According to (3d)

[TABLE]

Therefore, bias of the reward function adds the constant value $\frac{C}{1-\beta}$ to the value of the states. However, all the states are affected the same after the convergence of the algorithm. 2. 2.

Effect of bias in the beginning of the learning process: This effect is studied using the action-value function of an agent, i.e., the Q-function. Assume that the Q-function of the agent is initialized with zero values and the reward function is defined as $R=f\left(\cdot\right)+C$ . Further let us consider the first transition of the agent from state $\mathbf{x}^{\prime}$ to state $\mathbf{x}^{\prime\prime}$ happens by taking action $a$ at time step $t$ , i.e., $\mathbf{x}^{\left(t\right)}=\mathbf{x}^{\prime}$ and $\mathbf{x}^{\left(t+1\right)}=\mathbf{x}^{\prime\prime}$ . The update rule at time step $t$ is given by (3p)

[TABLE]

According to the above, after the first transition from the state $\mathbf{x}^{\prime}$ to the state $\mathbf{x}^{\prime\prime}$ , the Q-value for the state $\mathbf{x}^{\prime}$ is biased by the term (A). If ( $A>0$ ), the value of the state $\mathbf{x}^{\prime}$ increases and if ( $A<0$ ), the value of the state $\mathbf{x}^{\prime}$ decreases. Therefore, the already visited states will be more or less attractive to the agent in the beginning of the learning process as long as the agent has not explored the state-space enough.

The change of behavior of the agent in the learning process can be used to bias the agent towards the desired actions or states. However, in basic Q-learning the agent has no knowledge in prior about the environment. Therefore, we select the bias equal to zero, $C=0$ , and define the reward function as

Definition 1.

The reward function for the $k$ th FBS, $R_{k}:\left(r_{0},r_{k},\Gamma_{0},\Gamma_{k}\right)\rightarrow\mathbb{R}$ , is a continuous and differentiable function on $\mathbb{R}^{2}$ defined as

[TABLE]

where $k_{1}$ and $k_{2}$ are integers.

The objective of the FBS is to maximize its transmission rate. On the other hand, high transmission rate for the MUE is a priority for the FBS. Therefore, $R_{k}$ should have the following property

[TABLE]

The above property implies that higher transmission rate for the FBS or the MUE results in higher reward. Hence, considering Definition 1, we design a reward function that motivates the FBSs to increase $r_{k}$ and $r_{0}$ as much as possible even more than the required rate as follow

[TABLE]

where $m$ is an integer. The above reward function considers the minimum rate requirements of the FUE and the MUE, while encourages the FBS to increase transmission rate of both.

To further understand the proposed reward function, we discuss reward functions that are used by [19, 20, 21]. We refer to the designed reward function in [19] as quadratic, in [20] as exponential, and in [21] as proximity reward functions. The quadratic reward function is designed based on a conservative approach. In fact, the FBS is enforced to select actions that result in transmission rate close to the minimum requirement. Therefore, higher or lower rate than the minimum requirement results in a same amount of reward. The behavior of the quadratic reward function can be explained as follow

[TABLE]

The above property implies that if the rate of the FBS or the MUE is higher than the minimum requirement, the actions that increase the rate will decrease the reward. Hence, this property is against increasing sum transmission rate of the network. The exponential and proximity reward functions have the property in (3w) for the rate of the FBS, and the property in (3y) for the rate of the MUE. In another words, they satisfy the following properties

[TABLE]

As the density of the FBSs increases, the above properties result in increasing transmit power to achieve higher individual rate for a FUE while introducing higher interference for the MUE and other neighbor FUEs. In fact, as increasing the FUE rate is rewarded, taking actions that result in increasing the MUE rate decreases the reward. However, the FBS should have the option of decreasing its transmit power to increase the rate of the MUE. This behavior is important since it causes an FBS to produce less interference for its neighboring femtocells. Therefore, we give equal opportunity for increasing the rate of the MUE or the FUE.

The value of reward functions for different FBSs is different, however they have the same behavior. Here, we plot the value of the four reward functions that are discussed above. The plots refers to the proposed (Fig. 3(a)), quadratic (Fig. 3(b)), exponential (Fig. 3(c)), and proximity (Fig. 3(d)) reward functions. The important information that can be obtained from these plots are the maximal points of the reward functions, behavior of the reward functions around minimum requirements, and behavior of the reward functions by increasing $r_{k}$ or $r_{0}$ . The proposed reward function in Fig. 3(a) shows pushing the FBS to select transmit power levels that increase both $r_{k}$ and $r_{0}$ , while other reward functions have their maximum around the minimum rate requirements.

V-C Sample Complexity

In each training frame, Q-DPA collects one sample from the environment represented as the state-action pair in the Q-function. Sample complexity is defined as the minimum number of samples that is required to train the Q-function to achieve an $\epsilon$ -optimal policy. For $\epsilon>0$ and $\delta\in\left(0,1\right]$ , $\pi$ is an $\epsilon$ -optimal policy if [40]

[TABLE]

The sample complexity depends on the exploration policy that is generating the samples. In Q-DPA, $e$ -greedy policy is used as the exploration policy. However, $e$ -greedy policy depends on the Q-function of the agent which is being updated. In fact, the distribution of $e$ -greedy policy is unknown. Here, we provide a general bound on the sample complexity of Q-learning.

Proposition 1.

Assume $R_{max}$ is the maximum of the reward function for an agent and $Q^{\left(T\right)}$ is the action-value for state-action pair $\left(x,a\right)$ after $T$ iterations. Then, with probability at least $1-\delta$ , we have

[TABLE]

Proof.

See Appendix A. ∎

This proposition proves the stability of Q-learning and helps us to provide a minimum number of iterations to achieve $\epsilon>0$ error with respect to $Q^{*}$ with probability $1-\delta$ for each state-action pair. By assuming the right term of the above inequality as $\epsilon$ , the following Corollary is concluded.

Corollary 1.

For any $\epsilon>0$ , after

[TABLE]

number of iterations, $Q^{\left(T\right)}$ reaches $\epsilon$ -optimality with probability at least $1-\delta$ .

VI Simulation Results

The objective of this section is to validate the performance of the Q-DPA algorithm with different learning configurations in a dense urban scenario. We first introduce the simulation setup and parameters. Then, we introduce four different learning configurations and we analyze the trade-offs between them. Finally, we investigate the performance of the Q-DPA with different reward functions introduced in Section V-B. For the sake of simplicity, we use the notation IL as independent learning and CL as cooperative learning.

VI-A Simulation Setup

We use a dense urban scenario as the setup of the simulation as illustrated in Fig. 4. We consider one macrocell with radius $350$ m which supports multiple MUEs. The MBS assigns a subband to each MUE. Each MUE is located within a block of apartments and each block contains two strip of apartments. Each strip has five apartments of size $10$ m $\times 10$ m. There is one FBS located in the middle of each apartment which supports an FUE within a $5$ m distance. We assume that the FUEs are always inside the apartments. The FBSs are closed-access, therefore, the MUE is not able to connect to any FBS, however, it receives interference from the FBSs working on the same subband as itself. Here, we assume that the MUE and all the FBSs work on the same sub-carriers to consider the worst case scenario (high interference scenario). However, the extension of the simulation to the multi-carrier scenario is straight forward but does not affect our investigations. We assume the block of apartments is located on the edge of the macrocell, i.e., $350$ m distance from the MBS, and the MUE is assumed to be in between the two strip of apartments.

In these simulations, in order to initiate the state variables $X_{3}$ and $X_{4}$ in Section IV-C, the number of rings around the MBS and the MUE are assumed to be three ( $N_{1}=N_{2}=3$ ). Although, as the density increases, more rings with smaller diameters can be used to more clearly distinguish between the FBSs.

It is assumed that the FBSs and the MBS operate at $f=2.0$ GHz. The MBS allocates $33$ dBm as its transmit power, and the FBSs choose their transmit power from a range of $5$ dBm to $15$ dBm with power steps of $1$ dB. In order to model the pathloss, we use the urban dual strip model from 3GPP TR 36.814 [41]. The pathloss model of different links are provided in Table I. In Table I, $R$ is the distance between a transmitter and a receiver in meters, $L_{ow}$ is the wall penetration loss which is set to $20$ dB [41]. $d_{2D,indoor}$ is the 2-dimensional distance. We assume that the apartments are single floor, therefore, $d_{2D,indoor}\approx R$ . The fourth row of the pathloss models is used for the links between the FBSs and the MUE.

The minimum SINR requirements for the MUE and the FUEs are defined based on the required rate needed to support their corresponding user. In our simulations, the minimum required transmission rate to meet the QoS of the MUE is assumed to be $4$ (b/s/Hz), i.e., $\log_{2}(1+\Gamma_{0})=4$ (b/s/Hz). Moreover, for the FUEs the minimum required rate is set to $0.5$ (b/s/Hz), i.e, $\log_{2}(1+\Gamma_{k})=0.5$ (b/s/Hz), $k\in\mathcal{K}$ . It is worth mentioning that by knowing the media access control (MAC) layer parameters, the values of the required rates can be calculated using [42, Eqs. (20) and (21)].

To perform Q-learning, the minimum number of required frames, i.e., $L$ , is calculated based on achieving $90\%$ optimality, with probability of at least $0.9$ , i.e., $\delta=0.1$ . The simulation parameters are given in Table II. The value of the Q-learning parameters are selected according to our simulations and references [19, 20, 21, 22, 23, 24].

The simulation starts with one femtocell. The FBS starts running Q-DPA in Section V-A using IL. After convergence, the next FBS is added to the network. The new FBS runs Q-DPA, while the other FBS is already trained, and will just act greedy to choose its transmit power. After convergence of the second FBS, the next one is added to the network, and so on. We represent all the results versus the number of active femtocells in the system, from one to ten. Considering the size of the apartment block, and the assumption that all femtocells operate on the same frequency range, the density of deployment varies approximately from $600\leavevmode\nobreak\ \text{FBS}/km^{2}$ to $6000\leavevmode\nobreak\ \text{FBS}/km^{2}$ .

VI-B Performance of Q-DPA

Here, we show the simulation results of distributed power allocation with Q-DPA. First, we define two different state sets. The sets are defined as $\mathcal{X}_{1}=\left\{X_{1},X_{3},X_{4}\right\}$ and $\mathcal{X}_{2}=\left\{X_{2},X_{3},X_{4}\right\}$ . In both sets, FBSs are aware of their relative location to the MUE and the MBS due to the presence of $X_{3}$ and $X_{4}$ , respectively. The state set $\mathcal{X}_{1}$ gives knowledge of the status of the FUE to the FBS, and the state set $\mathcal{X}_{2}$ provides knowledge of the status of the MUE to the FBS.

In order to understand the effect of independent and cooperative learning, and the effect of different state sets, we use four different learning configurations as: independent learning with each of the two state sets as IL+ $\mathcal{X}_{1}$ and IL+ $\mathcal{X}_{2}$ , and cooperative learning with each of the two state sets as CL+ $\mathcal{X}_{1}$ and CL+ $\mathcal{X}_{2}$ . The results are compared with greedy approach in which each FBS chooses maximum transmit power. The simulation results are shown in three figures as: transmission rate of the MUE (Fig. 5(a)), sum transmission rate of the FUEs (Fig. 5(b)), and sum transmit power of the FBSs (Fig. 5(c)).

According to Fig. 5(c), in the greedy algorithm, each FBS uses the maximum available power for transmission. Therefore, the greedy method introduces maximum interference for the MUE and has the lowest MUE transmission rate in Fig. 5(a). On the other hand, despite using maximum power, the greedy algorithm does not achieve highest transmission rate for the FUEs either (Fig. 5(b)). This is again due to the high level of interference.

The state set $\mathcal{X}_{2}$ provides knowledge of MUE’s QoS status to the learning FBSs. Therefore, as we see in Fig. 5(a), the performance of IL with $\mathcal{X}_{2}$ is higher than the ones with $\mathcal{X}_{1}$ . This statement is true for CL too. We can see the reverse of this conclusion in the FUEs’ sum transmission rate in Fig. 5(b). The performance of IL with $\mathcal{X}_{1}$ is higher than IL with $\mathcal{X}_{2}$ . This is because the FBSs are aware of the status of the FUE, therefore, they consider actions that result in the state variable $X_{1}=\mathbbm{1}_{\left\{\gamma_{k}\geq\Gamma_{k}\right\}}$ to be $1$ . This is true in comparison of the states in CL too. In conclusion, the state set $\mathcal{X}_{1}$ works in favor of femtocells and the state set $\mathcal{X}_{2}$ benefits the MUE.

We conclude from the simulation results that IL and CL present different trade-offs. More specifically, IL supports a higher sum transmission rate for the FBSs and a lower transmission rate for the MUE, while CL can support a higher transmission rate for the MUE at the cost of an overall lower sum transmission rate for the FBSs. From a power consumption point of view, IL results in a higher power consumption when compared to that of CL. In general, IL trains an FBS to be selfish compared to CL. IL can be very useful when there is no means of communication between the agents. On the other hand, CL trains an FBS to be more considerate about other FBSs at the cost of communication overhead.

In Table III, we have compared the performance of the four learning configurations. In each column, number $1$ is used as a metric to refer to the highest performance achieved and number $4$ is used to refer to the lowest performance observed. The first column represents the summation of transmit powers of FBSs, the second column indicates the summation of transmission rates of the FUEs, and the third column denotes the transmission rate of the MUE.

VI-C Reward Function Performance

Here, we compare the performance of the four reward functions discussed in Section V-B. Since the objective is to maximize the sum transmission rate of the FUEs, according to Table III, we choose the combination IL+ $\mathcal{X}_{1}$ as the learning configuration. The performance of the reward functions are provided as the MUE transmission rate (Fig. 6(a)), sum transmission rate of the FUEs (Fig. 6(b)), and sum transmission power of the FBSs (Fig. 6(c)). In each figure, the solution of the optimization problem with exhaustive search and the performance of greedy method are provided. The exhaustive search provides us with the highest achievable sum transmission rate for the network. The quadratic, exponential, and proximity reward functions result in fast decaying of MUE transmission rate, while the proposed reward function results in a much slower decrease of the rate for the MUE. The proposed reward function manages to achieve a higher sum transmission rate compared to that of the other three reward functions as well. Fig. 6(c) indicates that the proposed reward function reduces the sum transmitted power at the FBSs which in turn could result in lower levels of interference at the FUEs. In comparison with the exhaustive search solution as the optimal solution, there is a gap of performance. For instance according to Fig. 6(c), for eight number of FBSs, the proposed reward function uses an average of $50$ mWatt less sum transmit power than the optimal solution. However, as we see in Fig. 6(b) and Fig. 6(a), by using more power, the sum transmission rate can be improved and the transmission rate of the MUE can be decreased to the level of exhaustive solution without violating its minimum required rate. In our future works, we wish to cover this gap by using neural networks as the function approximator of the learning method.

VII Conclusion and Future Work

In this paper, we propose a learning framework for a two-tier femtocell network. The framework enables addition of a new femtocell to the network, while the femtocell trains itself to adapt its transmit power to support its serving user while protecting the macrocell user. On the other hand, the proposed method as a distributed approach can solve the power optimization problem in dense HetNets, while significantly reducing power usage. The proposed framework is generic and motivates the design of machine learning based SONs for management schemes in femtocell networks. Besides, the framework can be used as a bench test for evaluating the performance of different learning configurations such as Markov state models, reward functions and learning rates. Further, the proposed framework can be applied to other interference-limited networks such as cognitive radio networks as well.

In future work, it would be interesting to consider mmWave-enabled femtocells in the present setup. In fact, the high pathloss and shadowing along with the vulnerability of mmWave directional signals to the blockages impacts the learning outcome [43]. This will in turn affect the subsequent power optimization problem. In addition, as we discussed in simulation section in details, there is a performance gap between the proposed approach and the exhaustive search. Although, the proposed approach results in less computational complexity; we wish to improve and cover this gap by utilizing neural networks as the function approximator of the learning method. In fact, neural networks can handle the large state-action spaces more efficiently. Moreover, another future complementary work to achieve a higher sum data rate and fill the performance gap would be to feed the interference model of the network to the factorization process. This way, a better factorization can be provided for the global Q-function.

Appendix A Proof of Proposition 1

Proof.

Assume an MDP represented as $\left(\mathcal{X},\mathcal{A},\Pr\left(y|x,a\right),r\left(x,a\right)\right)$ , a policy $\pi$ with value-function $V_{\pi}:\mathcal{X}\rightarrow\mathbb{R}$ and Q-function $Q_{\pi}:\mathcal{Z}\rightarrow\mathbb{R}$ , $\mathcal{Z}=\mathcal{X}\times\mathcal{A}$ . Here, $\mathcal{A}$ refers to action space of one agent and $k$ is the iteration index. According to (3d), the maximum of the value-function can be fined as $V_{max}=\frac{R_{max}}{1-\beta}$ . The Bellman optimality operator is defined as $\left(\mathtt{T}{Q}\right)\left(x,a\right)\triangleq r\left(x,a\right)+\beta\sum_{y\in\mathcal{X}}\Pr\left(y|x,a\right)\underset{b\in\mathcal{A}}{\max}\leavevmode\nobreak\ Q\left(y,b\right)$ . $\mathtt{T}{Q}$ is a contraction operator with factor $\beta$ , i.e., $\lVert\mathtt{T}{Q}-\mathtt{T}{Q^{\prime}}\rVert\leq\beta\lVert Q-Q^{\prime}\rVert$ and $Q^{*}$ is a unique fixed-point of $\left(\mathtt{T}{Q}\right)\left(x,a\right)$ , $\forall\left(x,a\right)\in\mathcal{Z}$ . Further, for the ease of notation and readability the time step notation is slightly changed as $Q_{k}$ refers to the action-value function after $k$ iterations.

Assume that the state-action pair $\left(x,a\right)$ is visited $k$ times and $\mathcal{F}_{k}=\left\{y_{1},y_{2},...,y_{k}\right\}$ are the visiting next states. At time step $k+1$ , the update rule of Q-learning is

[TABLE]

where, $\mathtt{T}_{k}{Q_{k}}$ is the empirical Bellman operator defined as $\mathtt{T}_{k}{Q_{k}}\left(x,a\right)\triangleq r\left(x,a\right)+\beta\underset{b\in\mathcal{A}}{\max}\leavevmode\nobreak\ Q\left(y_{k},b\right)$ . (From this point, for simplicity, we remove the dependency on $\left(x,a\right)$ ). It is easy to show that $E\left[\mathtt{T}_{k}{Q_{k}}\right]=\mathtt{T}{Q}_{k}$ , therefore, we define $e_{k}$ as the estimation error of each iteration as $e_{k}=\mathtt{T}_{k}{Q_{k}}-\mathtt{T}{Q}_{k}$ . By using $\alpha_{k}=\frac{1}{k+1}$ , the update rule of Q-learning can be written as

[TABLE]

Now, in order to prove Proposition 1, we need to state the following lemmas.

Lemma 1.

For any $k\geq 1$

[TABLE]

Proof.

We prove this lemma by induction. The lemma holds for $k=1$ as $Q_{1}=\mathtt{T}_{0}{Q}_{0}=\mathtt{T}{Q}_{0}+e_{0}$ . We now show that if the result holds for $k$ , then it also holds for $k+1$ . From (3ax) we have

[TABLE]

Thus (3ay) holds for $k\geq 1$ by induction. ∎

Lemma 2.

Assume that initial action-value function, $Q_{0}$ , is uniformly bounded by $V_{max}$ . Then, for all $k\geq 1$ we have $\lVert Q_{k}\rVert\leq V_{max}$ and $\lVert Q^{*}-Q_{k}\rVert\leq 2V_{max}$ .

Proof.

We first prove that $\lVert Q_{k}\rVert\leq V_{max}$ by induction. The inequality holds for $k=1$ as

[TABLE]

Now, we assume that for $1\leq i\leq k$ , $\lVert Q_{k}\rVert\leq V_{max}$ holds. First, $\lVert\mathtt{T}_{k}{Q_{k}}\rVert=\lVert r+\beta\max Q_{k}\rVert\leq\lVert r\rVert+\beta\lVert\max Q_{k}\rVert\leq R_{max}+\beta V_{max}=V_{max}$ . Second, from Lemma 1 we have

[TABLE]

Therefore, the inequality holds for $k\geq 1$ by induction. Now the bound on $\lVert Q^{*}-Q_{k}\rVert$ follows $\lVert Q^{*}-Q_{k}\rVert\leq\lVert Q^{*}\rVert+\lVert Q_{k}\rVert\leq 2V_{max}$ . ∎

Lemma 3.

Assume that initial action-value function, $Q_{0}$ , is uniformly bounded by $V_{max}$ , then, for any $k\geq 1$

[TABLE]

Proof.

From Lemma 1, we have

[TABLE]

Therefore, we can write

[TABLE]

and according to [44], $\lVert Q^{*}-Q_{i}\rVert\leq\beta^{i}\lVert Q^{*}-Q_{0}\rVert$ . Hence, using Lemma 2, we can write

[TABLE]

∎

Now, we prove Proposition 1 by using the above result in Lemma 3. To this aim, we need to provide a bound on the norm of the summation of errors in the inequality of Lemma 3. First, we can write

[TABLE]

For the estimation error sequence $\left\{e_{0},e_{1},\cdots,e_{k}\right\}$ , we have the property that $\mathbb{E}\left[e_{k}|\mathcal{F}_{k-1}\right]=0$ which means that the error sequence is a martingale difference sequence with respect to $\mathcal{F}_{k}$ . Therefore, according to Hoeffding-Azuma inequality [45] for a martingale difference sequence of $\left\{e_{0},e_{1},\cdots,e_{k-1}\right\}$ which is bounded by $2V_{max}$ , for any $t>0$ , we can write

[TABLE]

Therefore, by a union bound over the state-action space, we have

[TABLE]

and then,

[TABLE]

Hence, with probability at least $1-\delta$ we can say

[TABLE]

Consequently, the result in Proposition 1 is proved. ∎

Bibliography45

The reference list from the paper itself. Each links out to its DOI / PubMed record.

1[1] R. Amiri, H. Mehrpouyan, L. Fridman, R. K. Mallik, A. Nallanathan, and D. Matolak, “A machine learning approach for power allocation in Het Nets considering Qo S,” in Proc. IEEE ICC , pp. 1–7, May 2018.
2[2] O. G. Aliu, A. Imran, M. A. Imran, and B. Evans, “A survey of self organisation in future cellular networks,” IEEE Commun. Surv. Tutor. , vol. 15, no. 1, pp. 336–361, First Quarter 2013.
3[3] J. Moysen and L. Giupponi, “From 4G to 5G: Self-organized network management meets machine learning,” Co RR , vol. abs/1707.09300, 2017. [Online]. Available: http://arxiv.org/abs/1707.09300
4[4] J. G. Andrews, S. Buzzi, W. Choi, S. V. Hanly, A. Lozano, A. C. K. Soong, and J. C. Zhang, “What will 5G be?” IEEE J. Select. Areas Commun. , vol. 32, no. 6, pp. 1065–1082, June 2014.
5[5] M. Peng, D. Liang, Y. Wei, J. Li, and H. Chen, “Self-configuration and self-optimization in LTE-advanced heterogeneous networks,” IEEE Commun. Mag. , vol. 51, no. 5, pp. 36–45, May 2013.
6[6] M. Agiwal, A. Roy, and N. Saxena, “Next generation 5G wireless networks: A comprehensive survey,” IEEE Commun. Surv. Tutor. , vol. 18, no. 3, pp. 1617–1655, Thirdquarter 2016.
7[7] P. V. Klaine, M. A. Imran, O. Onireti, and R. D. Souza, “A survey of machine learning techniques applied to self-organizing cellular networks,” IEEE Commun. Surv. Tutor. , vol. 19, no. 4, pp. 2392–2431, Fourthquarter 2017.
8[8] A. Imran, A. Zoha, and A. Abu-Dayya, “Challenges in 5G: how to empower SON with big data for enabling 5G,” IEEE Network , vol. 28, no. 6, pp. 27–33, Nov 2014.

TL;DR

Contribution

Findings

Abstract

Peer Reviews

Videos

Reinforcement Learning for Self Organization and Power Control of Two-Tier Heterogeneous Networks

Abstract

Index Terms:

I Introduction

I-A Related Work

I-B Contributions

II Downlink System Model

III Problem Formulation

IV The Proposed Learning Framework

IV-A Multi-Agent MDP and Policy Evaluation

IV-B Factored MDP

IV-C Femtocell Network as Multi-Agent MDP

V Q-DPA, Reward Function, and Sample Complexity

V-A Q-learning Based Distributed Power Allocation (Q-DPA)

V-B Proposed Reward Function

Definition 1**.**

V-C Sample Complexity

Proposition 1**.**

Proof.

Corollary 1**.**

VI Simulation Results

VI-A Simulation Setup

VI-B Performance of Q-DPA

VI-C Reward Function Performance

VII Conclusion and Future Work

Appendix A Proof of Proposition 1

Proof.

Lemma 1**.**

Proof.

Lemma 2**.**

Proof.

Lemma 3**.**

Proof.

Definition 1.

Proposition 1.

Corollary 1.

Lemma 1.

Lemma 2.

Lemma 3.