A CMDP-based Approach for Energy Efficient Power Allocation in Massive   MIMO Systems

Peng Li; Yanxiang Jiang; Wei Li; Fuchun Zheng; Xiaohu You

arXiv:1703.07051·cs.IT·March 22, 2017

A CMDP-based Approach for Energy Efficient Power Allocation in Massive MIMO Systems

Peng Li, Yanxiang Jiang, Wei Li, Fuchun Zheng, Xiaohu You

PDF

Open Access

TL;DR

This paper introduces a CMDP-based method for optimizing energy-efficient power allocation in massive MIMO uplink systems, balancing QoS requirements and achieving near-optimal performance through offline algorithms.

Contribution

It formulates the power allocation problem as a CMDP and proposes an offline solution using value iteration and Q-learning to find the global optimum policy.

Findings

01

Proposed policy closely matches ergodic optimal performance.

02

Effective in managing QoS requirements for multiple users.

03

Demonstrates the viability of CMDP in massive MIMO energy optimization.

Abstract

In this paper, energy efficient power allocation for the uplink of a multi-cell massive MIMO system is investigated. With the simplified power consumption model, the problem of power allocation is formulated as a constrained Markov decision process (CMDP) framework with infinite-horizon expected discounted total reward, which takes into account different quality of service (QoS) requirements for each user terminal (UT). We propose an offline solution containing the value iteration and Q-learning algorithms, which can obtain the global optimum power allocation policy. Simulation results show that our proposed policy performs very close to the ergodic optimal policy.

Equations28

g_{l imk} = h_{l imk} β_{l ik},

g_{l imk} = h_{l imk} β_{l ik},

y_{l} = i = 1 \sum L G_{l i} P_{i}^{1/2} x_{i} + n_{l},

y_{l} = i = 1 \sum L G_{l i} P_{i}^{1/2} x_{i} + n_{l},

z_{l k} = a_{l k}^{H} y_{l} = p_{l k}^{1/2} a_{l k}^{H} g_{l l k} x_{l k} + a_{l k}^{H} κ \neq = k κ = 1 \hfill \sum K p_{l κ}^{1/2} g_{l l κ} x_{l κ} + a_{l k}^{H} i \neq = l i = 1 \hfill \sum L G_{l i} P_{i}^{1/2} x_{i} + a_{l k}^{H} n_{l},

z_{l k} = a_{l k}^{H} y_{l} = p_{l k}^{1/2} a_{l k}^{H} g_{l l k} x_{l k} + a_{l k}^{H} κ \neq = k κ = 1 \hfill \sum K p_{l κ}^{1/2} g_{l l κ} x_{l κ} + a_{l k}^{H} i \neq = l i = 1 \hfill \sum L G_{l i} P_{i}^{1/2} x_{i} + a_{l k}^{H} n_{l},

γ_{l k} = \frac{p _{l k} a _{l k}^{H} g _{l l k} ^{2}}{κ \neq = k \hfill κ = 1 \hfill \sum K p _{l κ} a _{l k}^{H} g _{l l κ} ^{2} + i \neq = l i = 1 \hfill \sum L κ = 1 \sum K p _{iκ} a _{l k}^{H} g _{l iκ} ^{2} + σ _{n}^{2} ∥ a _{l k} ∥ ^{2}} .

γ_{l k} = \frac{p _{l k} a _{l k}^{H} g _{l l k} ^{2}}{κ \neq = k \hfill κ = 1 \hfill \sum K p _{l κ} a _{l k}^{H} g _{l l κ} ^{2} + i \neq = l i = 1 \hfill \sum L κ = 1 \sum K p _{iκ} a _{l k}^{H} g _{l iκ} ^{2} + σ _{n}^{2} ∥ a _{l k} ∥ ^{2}} .

E E = l = 1 \sum L k = 1 \sum K \frac{lo g _{2} ( 1 + γ _{l k} )}{p _{l k} + p _{l c}} .

E E = l = 1 \sum L k = 1 \sum K \frac{lo g _{2} ( 1 + γ _{l k} )}{p _{l k} + p _{l c}} .

p_{b} = \int_{Γ_{b}}^{Γ_{b + 1}} \frac{1}{ψ _{0}} e^{- \frac{ψ}{ψ _{0}}} d ψ = e^{- \frac{Γ _{b}}{ψ _{0}}} - e^{- \frac{Γ _{b + 1}}{ψ _{0}}},

p_{b} = \int_{Γ_{b}}^{Γ_{b + 1}} \frac{1}{ψ _{0}} e^{- \frac{ψ}{ψ _{0}}} d ψ = e^{- \frac{Γ _{b}}{ψ _{0}}} - e^{- \frac{Γ _{b + 1}}{ψ _{0}}},

\begin{array}[]{l}p\{{\psi^{{}^{\prime}}}=b^{\prime}|{\psi}=b\}\\[0.5pt] \\ =\left\{{\begin{array}[]{*{20}{l}}{\frac{{h({\Gamma_{b+1}})}}{{{p_{b}}}},b^{\prime}=b+1,b\in[0,{Q_{S}}-2],}\\[0.5pt] \\ {\frac{{h({\Gamma_{b}})}}{{{p_{b}}}},b^{\prime}=b-1,b\in[1,{Q_{S}}-1],}\\[0.5pt] \\ {1-\frac{{h({\Gamma_{b+1}})}}{{{p_{b}}}}-\frac{{h({\Gamma_{b}})}}{{{p_{b}}}},b^{\prime}=b,b\in[1,{Q_{S}}-2],}\end{array}}\right.\end{array}

\begin{array}[]{l}p\{{\psi^{{}^{\prime}}}=b^{\prime}|{\psi}=b\}\\[0.5pt] \\ =\left\{{\begin{array}[]{*{20}{l}}{\frac{{h({\Gamma_{b+1}})}}{{{p_{b}}}},b^{\prime}=b+1,b\in[0,{Q_{S}}-2],}\\[0.5pt] \\ {\frac{{h({\Gamma_{b}})}}{{{p_{b}}}},b^{\prime}=b-1,b\in[1,{Q_{S}}-1],}\\[0.5pt] \\ {1-\frac{{h({\Gamma_{b+1}})}}{{{p_{b}}}}-\frac{{h({\Gamma_{b}})}}{{{p_{b}}}},b^{\prime}=b,b\in[1,{Q_{S}}-2],}\end{array}}\right.\end{array}

P {s_{c}^{^{'}} ∣ s_{c}, a_{c}} = l = 1 \prod L i = 1 \prod L k = 1 \prod K κ = 1 \prod K p {ψ_{l ik κ}^{^{'}} ∣ ψ_{l ik κ}} .

P {s_{c}^{^{'}} ∣ s_{c}, a_{c}} = l = 1 \prod L i = 1 \prod L k = 1 \prod K κ = 1 \prod K p {ψ_{l ik κ}^{^{'}} ∣ ψ_{l ik κ}} .

a_{c}^{n} max s.t. v^{π} (s_{c}^{0}) = E_{s_{c}^{0}}^{π} {n = 1 \sum \infty λ^{n - 1} R (s_{c}^{n}, a_{c}^{n})} c_{l k}^{π} (s_{c}^{0}) = E_{s_{c}^{0}}^{π} {n = 1 \sum \infty λ^{n - 1} C_{l k} (s_{c}^{n}, a_{c}^{n})} \geq r_{m i n}, l = 1, \dots, L, k = 1, \dots, K .

a_{c}^{n} max s.t. v^{π} (s_{c}^{0}) = E_{s_{c}^{0}}^{π} {n = 1 \sum \infty λ^{n - 1} R (s_{c}^{n}, a_{c}^{n})} c_{l k}^{π} (s_{c}^{0}) = E_{s_{c}^{0}}^{π} {n = 1 \sum \infty λ^{n - 1} C_{l k} (s_{c}^{n}, a_{c}^{n})} \geq r_{m i n}, l = 1, \dots, L, k = 1, \dots, K .

L (s_{c}, a_{c}; ρ) = R (s_{c}, a_{c}) + l = 1, k = 1 \sum L, K ρ_{l k} C_{l k} (s_{c}, a_{c}) .

L (s_{c}, a_{c}; ρ) = R (s_{c}, a_{c}) + l = 1, k = 1 \sum L, K ρ_{l k} C_{l k} (s_{c}, a_{c}) .

v_{ρ} (s_{c}) = a_{c} max ⎩ ⎨ ⎧ L (s_{c}, a_{c}; ρ) + s_{c}^{^{'}} \sum λ P {s_{c}^{^{'}} ∣ s_{c}, a_{c}} v_{ρ} (s_{c}^{^{'}}) ⎭ ⎬ ⎫ .

v_{ρ} (s_{c}) = a_{c} max ⎩ ⎨ ⎧ L (s_{c}, a_{c}; ρ) + s_{c}^{^{'}} \sum λ P {s_{c}^{^{'}} ∣ s_{c}, a_{c}} v_{ρ} (s_{c}^{^{'}}) ⎭ ⎬ ⎫ .

ρ_{l k, j^{'} + 1} = ρ_{l k, j^{'}} + \frac{1}{j ^{'}} (r_{m i n} - c_{l k}^{π^{*}} (s_{c}^{0})),

ρ_{l k, j^{'} + 1} = ρ_{l k, j^{'}} + \frac{1}{j ^{'}} (r_{m i n} - c_{l k}^{π^{*}} (s_{c}^{0})),

\int_{0}^{ρ_{l k}} (r_{m i n} - c_{l k}^{π^{*}} (s_{c}^{0})) d ρ_{l k}, l = 1, \dots, L, k = 1, \dots, K

\int_{0}^{ρ_{l k}} (r_{m i n} - c_{l k}^{π^{*}} (s_{c}^{0})) d ρ_{l k}, l = 1, \dots, L, k = 1, \dots, K

v_{i^{'} + 1}^{π} (s_{c}) = a_{c} max ⎩ ⎨ ⎧ L (s_{c}, a_{c}; ρ_{j^{'}}) + s_{c}^{^{'}} \sum λ P {s_{c}^{^{'}} ∣ s_{c}, a_{c}} v_{i^{'}}^{π} (s_{c}^{^{'}}) ⎭ ⎬ ⎫ .

v_{i^{'} + 1}^{π} (s_{c}) = a_{c} max ⎩ ⎨ ⎧ L (s_{c}, a_{c}; ρ_{j^{'}}) + s_{c}^{^{'}} \sum λ P {s_{c}^{^{'}} ∣ s_{c}, a_{c}} v_{i^{'}}^{π} (s_{c}^{^{'}}) ⎭ ⎬ ⎫ .

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsAdvanced MIMO Systems Optimization · Advanced Wireless Network Optimization · Energy Harvesting in Wireless Networks

Full text

\newcaptionstyle

mystyle1TABLE \captiontext \captionstylemystyle1 \newcaptionstylemystyle2\captionlabel. \captiontext

\captionstylemystyle2 \newcaptionstylemystyle3\captionlabel. \captiontext \captionstylemystyle3

A CMDP-based Approach for Energy Efficient Power Allocation in Massive MIMO Systems

Peng Li*†, Yanxiang Jiang†∗, Wei Li‡, Fuchun Zheng§, and Xiaohu You†*

*†*National Mobile Communications Research Laboratory, Southeast University, Nanjing 210096, China.

*‡*Dept. of Information and Communication Engineering, Xi’an Jiaotong University, Xi’an 710049, China.

§School of Systems Engineering, University of Reading, Reading, RG6 6AY, UK.

∗E-mail: [email protected]

Abstract

In this paper, energy efficient power allocation for the uplink of a multi-cell massive MIMO system is investigated. With the simplified power consumption model, the problem of power allocation is formulated as a constrained Markov decision process (CMDP) framework with infinite-horizon expected discounted total reward, which takes into account different quality of service (QoS) requirements for each user terminal (UT). We propose an offline solution containing the value iteration and Q-learning algorithms, which can obtain the global optimum power allocation policy. Simulation results show that our proposed policy performs very close to the ergodic optimal policy.

I Introduction

With the rapid development of wireless communication system, there has been a new surge of interest in energy efficient systems, due to the contradiction between the ever-increasing energy demand and the societal and economical concerns. As one of key technologies of 5G mobile communication systems, massive MIMO has been put forward to significantly improve the system capacity with extra degrees of freedom which facilitate transmit diversity and spatial multiplexing gains [1].

Recently, there has been an increasing research interest in energy efficiency (EE) for massive MIMO systems. As discussed in [2], it is of primary importance to set up an accurate power consumption model for reliable guidelines of EE optimization. By using a refined power consumption model, closed-form EE-optimal value of transmit power was derived in [2] by means of some properties of Lambert W function. However, the optimization problem there without any constraints on quality of service (QoS) failed to model the real scenario in communication systems. In the uplink of massive MIMO systems, the maximum transmit power and the minimum data rate for each user terminal (UT) should be included into basic QoS requirements. In [3], the problem of maximizing the EE as a function of the numbers of UTs and antennas in BS was analyzed, for a given spectral efficiency and fixed transceiver power consumption parameters. Similarly, the impact of system parameters (the average channel gain to the UTs and the power consumption parameters) on the optimal EE was studied in [4] for maximizing the EE with a fixed sum spectral efficiency. Besides the theoretical analysis on the relationships between system parameters and the optimal EE, it is of great importance to develop optimization methods for maximizing EE under the multi-cell scenario.

More recently, the Markov decision process (MDP) method has been utilized to deal with the resource allocation problems for communication systems. In [5], by using the semi-MDP method, a resource allocation scheme was proposed to achieve the optimal power efficiency for QoS-guaranteed services in OFDMA multi-cell cooperation networks. However, the technicalities and complexities associated with semi-MDP seldom lead to practical algorithms [6]. On the other hand, in order to meet the QoS requirements, only a few works on the constrained Markov decision process (CMDP) method for resource allocation in MIMO systems [6] and OFDM systems [7] have been reported. The problem of power and rate allocation in MIMO systems was modeled as a CMDP in [6] with the goal of minimizing the transmit power subject to delay constraints, while the problem of power and subcarrier allocation for downlink OFDMA systems was formulated as a CMDP in [7] with the goal of maximizing the EE under average delay constraints. By introducing a middle state called “post-decision state”, an online solution was proposed in [7].

Motivated by the aforementioned results, we propose a novel offline power allocation scheme to achieve the global optimum EE under QoS constraints in the uplink of multi-cell massive MIMO systems, which exploits the powerful optimization tool, constrained Markov decision process (CMDP). The power allocation policy is determined via the use of value iteration and Q-learning algorithms. The appeal of the value iteration algorithm is attributed to its ease in implementation and simplicity in the convergence condition to the global optimum solution. More importantly, the value iteration algorithm can be used for further studies to analyze the structure of the optimal policy obtained in this paper. The global convergence of the Q-learning algorithm guarantees the proposed offline solution to obtain the global optimum power allocation policy. Specifically, the proposed offline solution can exploit the obtained decision rule to build an offline look-up table, which can avoid the frequent and continuous computations and provide flexibility by adjusting the corresponding parameters of the value iteration and Q-learning algorithms.

The rest of this paper is organized as follows. In Section II, the system model is briefly described. The problem formulation and solution algorithm are presented in Section III. Simulation results are shown in Section IV. Final conclusions are drawn in Section V.

II System Model

Consider a multi-cell massive MIMO system consisting of $L$ cells where each BS is equipped with an array of $M$ antennas, and each cell is filled with $K$ single-antenna UTs uniformly as illustrated in Fig. 1. Assume $M\gg K$ . The focus of this paper is on the uplink without any form of BS cooperation.

Let $g_{limk}$ denote the complex propagation coefficient between the $m$ -th BS antenna in the $l$ -th cell and the $k$ -th UT in the $i$ -th cell. Then, it can be expressed as,

[TABLE]

where the small-scale fading coefficient $h_{limk}$ is always assumed to be i.i.d. random variable with distribution $\mathcal{CN}(0,1)$ , and the large-scale fading coefficient $\sqrt{{\beta_{lik}}}$ models the geometric attenuation and shadow fading, which is assumed to be independent over $m$ , constant over many coherence time intervals and known a priori [1]. The component $\beta_{lik}=\varphi{\zeta_{lik}}/d_{lik}^{\alpha}$ consists of path loss and shadow fading, where $\varphi$ is a constant related to carrier frequency and antenna gain, $d_{lik}$ is the distance between the BS in the $l$ -th cell and the $k$ -th UT in the $i$ -th cell, $\alpha$ is the path loss exponent, and $\zeta_{lik}$ represents the shadow fading with the distribution $10{\log_{10}}{\zeta_{lik}}\sim\mathcal{N}(0,\sigma_{sh}^{2})$ . Then, we have propagation matrix ${\bm{G}}_{li}={{\bm{H}}_{li}}{{\bm{D}}_{li}^{1/2}}$ , where ${\bm{H}}_{li}$ denotes the $M\times K$ matrix of fast fading coefficients between the BS in the $l$ -th cell and the $K$ UTs in the $i$ -th cell, i.e., ${[{{\bm{H}}_{li}}]_{mk}}={h_{limk}}$ , and ${\bm{D}}_{li}$ is the $K\times K$ diagonal matrix with ${[{{\bm{D}}_{li}}]_{kk}}={\beta_{lik}}$ .

In the uplink, let ${{\bm{y}}_{l}}$ denote the $M\times 1$ received signal vector of the BS in the $l$ -th cell. Then, it can be expressed as:

[TABLE]

where ${{\bm{x}}_{i}}\in{\mathbb{C}^{K\times 1}}$ denotes the transmit symbol vector in the $i$ -th cell with $\bm{x}_{i}\sim\mathcal{CN}(\mathbb{0},\mathbb{I}_{K})$ , ${\bm{P}}_{i}^{1/2}=\text{diag}\{\sqrt{{p_{i1}}},\sqrt{{p_{i2}}},\cdots,\sqrt{{p_{iK}}}\}$ denotes the transmit power matrix allocated to the UTs in the $i$ -th cell, and ${{\bm{n}}_{l}}\in{\mathbb{C}^{M\times 1}}$ denotes the additive white Gaussian noise (AWGN) with $\bm{n}_{l}\sim\mathcal{CN}(\mathbb{0},\sigma_{n}^{2}{\mathbb{I}_{M})}$ .

We consider the case where the BSs have the perfect channel state information (CSI), i.e., they know $\bm{G}$ . Assume that the zero-forcing (ZF) receiver is utilized to detect the streams of the $K$ UTs. Let ${\bm{A}}_{l}$ be the receiver matrix, ${\bm{a}}_{lk}$ the $k$ -th column of ${\bm{A}}_{l}$ , ${\bm{g}}_{llk}$ the $k$ -th column of the propagation matrix ${\bm{G}}_{ll}$ , and ${p_{lk}}$ the transmit power allocated to the $k$ -th UT in the $l$ -th cell. Then, the detected signal of the $k$ -th UT in the $l$ -th cell can be expressed as:

[TABLE]

where only the first term is the desired information, while the other terms represent the intra-cell interference, inter-cell interference and noise, respectively. As a result, the uplink signal to interference plus noise ratio (SINR) of the $k$ -th UT in the $l$ -th cell can be expressed as follows:

[TABLE]

The EE of a communication system is measured in bits/Joule and defined as the total average number of bits/Joule successfully delivered from the UTs. As for the detailed power consumption model, for a specific UT, apart from the power consumed at the UT which can be modeled as the sum of transmit power and circuit power consumed by inevitable electronic operations, the average circuit power consumption within the BS is of great importance such as receiver antenna units, decoding, multiuser detection and fixed power consumption. For readers not interested in the receiver circuit power, the average circuit power consumption in BS can be assumed to be zero [8]. In this paper, our focus lies in the power allocation scheme, without consideration of other parameters such as $M$ or $K$ , so there is no need to formulate such a trivial power consumption model as [4]. Hence, we integrate all of the power consumed above but the transmit power into a specific term, $p_{lc}$ , denoted as the average circuit power consumption for each UT in the $l$ -th cell, to simplify the power consumption model. Therefore, the uplink EE of the multi-cell massive MIMO system is given by

[TABLE]

III The Proposed CMDP-based Power Allocation Scheme

Using the system model presented in Section II, we formulate the power allocation optimization problem by applying the CMDP, and then propose an offline solution containing the value iteration and Q-learning algorithms to solve it.

III-A Formulation of Optimization Problem

We first extract the main characters from the above system model to build a CMDP-based model. A CMDP-based model can be characterized by five elements: decision epochs, states, actions, transition probabilities and rewards [9].

Decision Epochs: Before modeling, the time dimension is partitioned into decision slots represented by $\{1,2,\cdots,n,\cdots\}$ , where the time slot $n$ is defined as the time interval $\left[{nT_{c},(n+1)T_{c}}\right]$ , and $T_{c}$ denotes the channel coherence time. Then, the decision epochs can be indexed with $n$ . We assume that the wireless channel fluctuates slowly and the CSI remains quasi-static and i.i.d. between decision slots.

States: To model the fluctuation in the physical layer, a finite-state Markov channel (FSMC) model can be built to characterize the time-varying behavior of the channel [10]. In our model, the system state space ${S^{C}}={C^{S}}\times{C^{S}}\times\cdots\times{C^{S}}$ is the Cartesian product of cell state space ${C^{S}}$ accounting for the channel gains in each cell, whose component is also a composite state of link state ${\bm{g}}_{llk}^{H}{{\bm{g}}_{li\kappa}}$ , denoted by ${\psi_{lik\kappa}}$ , and each link state is quantized using a finite number of thresholds $\Gamma=\{0={\Gamma_{0}},{\Gamma_{1}},\cdots,{\Gamma_{{Q_{S}}}}=\infty\}$ , where ${\Gamma_{b}}<{\Gamma_{b^{\prime}}},\ \forall\ b<b^{\prime}$ . The composite system state of all the cells is denoted by ${s_{c}}=\{{s_{1}},{s_{2}},\cdots,{s_{L}}\}$ , where ${s_{l}}=\{{\psi_{lik\kappa}}|i=1,\cdots,L;\ k,\kappa=1,\cdots,K\}$ . Based on the above assumptions, the sequence of composite system states forms a Markov chain with transition probabilities $P\{{s_{c}^{{}^{\prime}}}|{s_{c}}\}$ that is independent of actions, which is similar as that in [5], and ${s_{c}}$ , ${s_{c}^{{}^{\prime}}}$ denote the system states in current and next decision epoch, respectively.

Actions: Let ${A^{C}}={C^{A}}\times{C^{A}}\times\cdots\times{C^{A}}$ denote the system action space, and $C^{A}$ denote the action space of each cell, whose cardinality is ${Q_{A}}$ . Specially, let $a_{p,l}$ denote the set of the transmit powers allocated to the UTs in the $l$ -th cell. Then, the composite system action can be denoted by ${a_{c}}=\{{a_{p,1}},{a_{p,2}},\cdot\cdot\cdot,{a_{p,L}}\}$ . Note that the proper choice of the set of action space can incorporate the QoS requirement with respect to the maximum transmit power for each UT without additional operations.

Transition Probabilities: Based on the modeling of FSMC, the link state transition occurs only from the current state to its neighboring states. Without loss of generality, we simplify each link state ${\psi_{lik\kappa}}$ as $\psi$ for convenience. Then, the steady probability for the $b$ -th link state can be expressed as

[TABLE]

where ${\psi_{0}}=\mathbb{E}\{\psi\}$ is the average link gain. According to [10], the level-crossing rate of the link gain is given by $h(\psi)=\sqrt{2\pi\psi/{\psi_{0}}}{f_{c}}{e^{-{\textstyle{\psi\over{{\psi_{0}}}}}}}$ , where ${f_{c}}$ is the maximum Doppler frequency normalized by the decision rate $1/T_{c}$ . The link state transition probabilities are determined by

[TABLE]

where ${\psi}$ and ${\psi^{{}^{\prime}}}$ denote the link states in current and next decision epoch, respectively. The transition probabilities of $p\{{\psi^{{}^{\prime}}}={b^{\prime}}|{\psi}=b\}$ for the boundaries are given by $p\{{\psi^{{}^{\prime}}}=0|{\psi}=0\}=1-p\{{\psi^{{}^{\prime}}}=1|{\psi}=0\}$ and $p\{{\psi^{{}^{\prime}}}={Q_{S}}-1|{\psi}={Q_{S}}-1\}=1-p\{{\psi^{{}^{\prime}}}={Q_{S}}-2|{\psi}={Q_{S}}-1\}$ .

The composite system state transition probabilities can be computed by

[TABLE]

Rewards: We adopt the overall EE as system reward function, which is defined as $R({s_{c}},{a_{c}})$ for the action ${a_{c}}$ at the state ${s_{c}}$ . And the corresponding QoS requirement with respect to the minimum data rate for each UT can be expressed as a series of constraints: ${C_{lk}}({s_{c}},{a_{c}})\geq{r_{\min}}$ , for $l=1,\cdots,L$ and $k=1,\cdots,K$ , where $C_{lk}$ and $r_{\min}$ denote the instantaneous data rate and the required minimum data rate for each UT in the uplink, respectively.

By exploiting the above CMDP framework, the transmit power can be adjusted according to a stationary policy $\pi=({\delta_{1}},{\delta_{2}},\cdots,{\delta_{n}},\cdots)$ , where each decision rule ${\delta_{n}}$ specifies a mapping function ${\delta_{n}}:{S^{C}}\to{A^{C}}$ to maximize the objective function. Let $\lambda$ denote the discount factor, ${v^{\pi}}(s^{0}_{c})$ denote the expected discounted total reward, and ${c_{lk}^{\pi}}(s^{0}_{c})$ denote the expected discounted total cost associated with the required date rate constraint, given that the policy $\pi$ is used with initial state $s^{0}_{c}$ . Then, we can formulate the CMDP-based optimization problem as follows

[TABLE]

III-B Offline Solution

To solve the constrained optimization problem in (9), we first utilize the Lagrangian approach [9], [11] to transform the CMDP optimization problem into an equivalent unconstrained MDP optimization problem. For any non-negative vector of Lagrange multipliers (LM) ${\bm{\rho}}={[\,{\rho_{lk}}\,|\,l=1,\cdots,L,\,k=1,\cdots,K\,]^{T}}$ , we define the Lagrangian as

[TABLE]

Then, the Bellman’s equations are given by

[TABLE]

Now we propose an offline scheme to derive the optimal power allocation policy. The stationary optimal policy and the corresponding maximum expected discounted total reward function can be obtained by the well-known value iteration algorithm [9], for a fixed LM vector ${\bm{\rho}}$ . Then, we utilize the Q-learning algorithm [11] to determine the proper ${\bm{\rho}}$ for the feasible constraint ${r_{\min}}$ . Specifically, the iteration algorithm is described as follows

[TABLE]

where $j^{\prime}$ is the index of the iteration steps. The convergence to the global optimum ${\bm{\rho}}^{*}$ of the Q-learning algorithm can be ensured, because the functions

[TABLE]

are piece-wise linear concave[11]. Taking into consideration of the convergence to the global optimum policy ${\pi}^{*}$ ensured by the value iteration algorithm [9], the proposed offline algorithm can attain the global optimum power allocation scheme.

The offline iterative algorithm is summarized in Algorithm 1, where $\epsilon$ denotes an infinitesimal gap, and $i^{\prime}$ is the index of the iteration steps for the value iteration algorithm. We remark here that the inner iteration between step 2 and step 5 in Algorithm 1 performs the value iteration computation to solve the Bellman’s equations in (11) to obtain the stationary optimal policy for the given LM vector ${\bm{\rho}}_{j^{\prime}}$ , where we denote the $i^{\prime}$ -th approximation to ${v_{\bm{\rho}}}(\cdot)$ by ${v_{i^{\prime}}^{\pi}}(\cdot)$ . We also remark here that the outer iteration between step 2 and step 5 in Algorithm 1 performs the Q-learning computation to solve the equations in (12) to obtain the optimal LM vector ${\bm{\rho}}^{*}$ , where we replace $c_{lk}^{{\pi}^{*}}(s_{c}^{0})$ by $c_{lk}^{{\pi}^{*}({{\bm{\rho}}_{j^{\prime}}})}(s_{c}^{0})$ in the $j^{\prime}$ -th iteration. We point out here that the value of the components of the initial LM vector ${\bm{\rho}}_{0}$ should be set to be very large to converge to the optimal LM vector in consideration of the minimum data rate constraint.

The obtained optimal decision policy ${\pi}^{*}({{\bm{\rho}}^{*}})$ in Algorithm 1 contains a series of optimal decision rule ${\delta^{*}}$ , which specifies a mapping function ${\delta^{*}}:{S^{C}}\to{A^{C}}$ to get the maximum reward. The mapping function ${\delta^{*}}$ can be exploited to construct an offline look-up table to avoid the frequent and continuous computations. By using the table, the corresponding transmit power can be allocated to the UTs to maximize the reward once the channel states are known. Note here that tradeoff between the performance gain and the size of the offline table can be balanced by changing ${Q_{S}}$ and ${Q_{A}}$ .

IV Simulation Results

In this section, we compare the performances of our proposed CMDP-based power allocation policy with the ergodic optimal policy. The ergodic optimal policy is achieved by maximizing the reward functions from the set of feasible actions at each system state by using the exhaustive-search method. Note that the focus of the comparison between our proposed policy and the ergodic optimal policy does not lie in the uniformity, i.e., specific action chosen at each system state, but the long-term performance with respect to the expected discounted total reward. In our simulations, unless otherwise stated, the system parameters are set as follows: the number of cells is ${L=2}$ , the number of UTs in each cell is ${K=1}$ , the path loss factor $\varphi$ is 1, the path loss exponent $\alpha$ is 3.7, the variance of log-normal shadow fading $\sigma_{sh}^{2}$ is 10dB, the number of antennas to each BS is $M=128$ , the average circuit power consumption is ${p_{lc}}=10\text{mw}$ for each ${l}$ [4], the link states are equiprobable, quantized with ${Q_{S}}=4$ states, the action at each system state is chosen from ${\rm{{10^{-2}}mw}}$ to ${\rm{{10^{2}}mw}}$ with ${Q_{A}}=20$ intervals, and the discount factor $\lambda$ is assumed to be ${0.9}$ .

Fig. 2 shows the performances of our proposed policy in comparison with the ergodic optimal policy. It is clear that the performance gap between them can be negligible. And the expected discounted total reward under different discount factors increases as ${\lambda}$ becomes larger. This is because larger discount factor means longer-term reward taken into consideration.

Fig. 3 and Fig. 4 show how the maximum transmit signal-to-noise ratio (SNR) allocated to the UTs affects the expected discounted total reward of policies with $\sigma_{n}^{2}=-101\text{dBm}$ . It can be observed that the performances of the two policies stop to increase and tend to be constant when the maximum transmit SNR is larger than certain threshold SNR. The reason is that there is no longer any need to consume more power, when the maximum expected total reward has already been achieved. In Fig. 3, it can be observed that the performances of both policies improve as ${Q_{S}}$ gets larger, which is due to the more refined quantification of the channel states. It can also be observed that there is a performance gap between the two policies when the maximum transmit SNR is large enough. This results from that the iteration process of the value iteration algorithm makes our proposed policy sensitive to the channel state quantized by ${Q_{S}}$ , which will be aggravated by the larger maximum transmit SNR. In Fig. 4, it can be observed that the performances of the two policies improve as ${Q_{A}}$ gets larger when the maximum transmit SNR is large enough. The reason is that the larger ${Q_{A}}$ , the higher-precision transmit power can be allocated to the UTs. In addition, we can see that the performance gain by increasing ${Q_{A}}$ is no longer significant when ${Q_{A}}$ gets large enough. This reveals that the proposed policy can achieve good performance with a small ${Q_{A}}$ , i.e., a low-precision transmit power quantization is enough for our proposed policy.

In Fig. 5, the impact of the number of antennas in BS on the performance of the policies is shown. We can see that the performance growth tends to slow down with the increase of $M$ . The performance of our proposed policy is very close to the ergodic optimal policy when $M$ is large enough, but not for the case of small $M$ . Recall that the value iteration algorithm is sensitive to the channel state based on ${\bm{g}}_{llk}^{H}{{\bm{g}}_{li\kappa}}$ which contains the parameter $M$ , and that the large-scale fading coefficient $g_{limk}$ is independent of $M$ . Correspondingly, for a smaller $M$ , the impact of the small-scale fading coefficient of the channel will be increased sharply which results in the worse modeling of the channel state. This leads to the performance gap of the two policies for the case of smaller $M$ .

V Conclusions

In this paper, we have proposed a CMDP-based power allocation algorithm for the uplink of multi-cell massive MIMO system to maximize EE under two QoS requirements. The policy performance of our CMDP-based offline algorithm is very close to the ergodic optimal policy, and some further analysis results have been given to determine the impact of system parameters on the long-term EE.

Acknowledgments

This work was supported in part by the National Basic Research Program of China (973 Program 2012CB316004), the National 863 Project (2015AA01A709), and the Natural Science Foundation of China (61221002).

Bibliography11

The reference list from the paper itself. Each links out to its DOI / PubMed record.

1[1] T. L. Marzetta, “Noncooperative cellular wireless with unlimited numbers of base station antennas,” IEEE Trans. Wireless Commun. , vol. 9, no. 11, pp. 3590–3600, Nov. 2010.
2[2] E. Bjrnson, L. Sanguinetti, J. Hoydis, and et al., “Optimal design of energy-efficient multi-user MIMO systems: is massive MIMO the answer?” IEEE Trans. Wireless Commun. , vol. 14, no. 6, pp. 3059–3075, June 2015.
3[3] S. Mukherjee and S. K. Mohammed, “On the energy-spectral efficiency trade-off of the MRC receiver in massive MIMO systems with transceiver power consumption,” ar Xiv:1404.3010 v 1 [cs.IT] , Apr. 2014.
4[4] S. K. Mohammed, “Impact of transceiver power consumption on the energy efficiency of zero-forcing detector in massive MIMO systems,” IEEE Trans. Commun. , vol. 62, no. 11, pp. 3874–3890, Nov. 2014.
5[5] P. Wang, X. Zhang, and M. Song, “Optimal stochastic subcarrier and power allocations for Qo S-guaranteed services in OFDMA multicell cooperation networks,” in Proc. IEEE ICC , pp. 6449–6453, June 2013.
6[6] D. V. Djonin and V. Krishnamurthy, “MIMO transmission control in fading channels – a constrained Markov decision process formulation with monotone randomized policies,” IEEE Trans. Signal Process. , vol. 55, no. 10, pp. 5069–5083, Oct. 2007.
7[7] K. Bi, Q. Yang, F. Fu, and et al., “Energy-efficient power and subcarrier allocation for OFDMA systems with value function approximation approach,” in Proc. IEEE ICTC , pp. 530–535, Oct. 2012.
8[8] G. Miao, “Energy-efficient uplink multi-user MIMO,” IEEE Trans. Wireless Commun. , vol. 12, no. 5, pp. 2302–2313, May 2013.