Adaptive Predictive Power Management for Mobile LTE Devices

Peter Brand; Joachim Falk; Jonathan Ah Sue; Johannes Brendel; Ralph; Hasholzner; and J\"urgen Teich

arXiv:1907.02774·cs.NI·July 8, 2019

Adaptive Predictive Power Management for Mobile LTE Devices

Peter Brand, Joachim Falk, Jonathan Ah Sue, Johannes Brendel, Ralph, Hasholzner, and J\"urgen Teich

PDF

TL;DR

This paper introduces proactive machine learning algorithms to improve power management in LTE mobile devices, achieving up to 17% energy savings by predicting transmission inactivity periods.

Contribution

It presents and compares supervised and reinforcement learning approaches for predictive power management in LTE devices, a novel proactive strategy in this context.

Findings

01

Achieved up to 17% energy savings.

02

Compared effectiveness of supervised and reinforcement learning.

03

Demonstrated the feasibility of proactive power management.

Abstract

Reducing the energy consumption of mobile phones is a crucial design goal for cellular modem solutions for LTE and 5G standards. In addition to improving the power efficiency of components through structural and technological advances, optimizing the energy efficiency through improved dynamic power management is an integral part in contemporary hardware design. Most techniques targeting mobile devices proposed so far, however, are purely reactive in powering down and up system components. Promising approaches extend this, by predicting and using information from the environment and the communication protocol to take proactive decisions. In this paper, we propose and compare two proactive algorithmic approaches for light-weight machine learning to predict the control information needed to allow a mobile device to go to sleep states more often, e.g., in time slots of transmission…

Tables4

Table 1. TABLE I : Mapping of a TTI information tuple l 𝑙 l to a scenario z 𝑧 z and corresponding DPM policy p 𝑝 p . Each element in the policy vector corresponds to one physical modem component, the value indicating at which portion of TTI length the component is switched to the off state. The don’t care values − - for the transmitter components r RF _ TX subscript 𝑟 RF _ TX r_{\mathrm{RF\_TX}} and r PHY _ TX subscript 𝑟 PHY _ TX r_{\mathrm{PHY\_TX}} stem from the fact that our proposed policies only target the RX chain of the modem.

TTI	Scenario	Policy
$l$	$z$	$p$
$(𝐭, 𝐟)$	$z_{1}$	$p_{1} = (0.5, 0.5, -, -, 1, 1)$
$(𝐭, 𝐭), (𝐟, 𝐭)$	$z_{2}$	$p_{2} = (1, 1, -, -, 1, 1)$
$(𝐟, 𝐟)$	$z_{3}$	$p_{3} = (0, 0, -, -, 1, 1)$

Table 2. TABLE II : Neural Network Hyperparameters

Name [properties]
Layer 1 [60 neurons, tangent hyperbolic]
Layer 2 [10 neurons, tangent hyperbolic]
Output Layer [2 neurons, linear]
Performance Function [mean square error]

Table 3. TABLE III : Deriving the reward 𝐫 [ n − 1 ] 𝐫 delimited-[] 𝑛 1 \mathbf{r}[n-1] for the last action 𝐚 [ n − 1 ] ≡ 𝐥 p [ n ] 𝐚 delimited-[] 𝑛 1 subscript 𝐥 𝑝 delimited-[] 𝑛 \mathbf{a}[n-1]\equiv\mathbf{l}_{p}[n] and the really observed TTI information 𝐥 [ n ] 𝐥 delimited-[] 𝑛 \mathbf{l}[n] . The reward value is in descending order: (i) correct predictions, (ii) false positives, and (iii) false negatives. A special case is 𝐥 p [ n ] = ( 𝐟 , 𝐟 ) subscript 𝐥 𝑝 delimited-[] 𝑛 𝐟 𝐟 \mathbf{l}_{p}[n]=(\mathbf{f},\mathbf{f}) as a direct assessment of prediction quality can only be performed during the dedicated learning phase. During the exploitation phase, the input 𝐥 [ n ] 𝐥 delimited-[] 𝑛 \mathbf{l}[n] may sometimes not be observable ( 𝐮 , 𝐮 ) 𝐮 𝐮 (\mathbf{u},\mathbf{u}) , i.e., our predictive DPM decided to completely turn of the modem. In this case, we assume the prediction to be correct and assign a positive reward. However, if turning the modem of turned out to be wrong, i.e, data had to be re-transmitted by the eNodeB , the misprediction will be penalized later through the discussed additional mechanisms.

Prediction	Reality	Reward	Description
$𝐥_{p} [n]$	$𝐥 [n]$	$𝐫 [n - 1]$
$(𝐭, 𝐟)$	$(𝐭, 𝐟)$	2	energy saved
	$(𝐭, 𝐭) (𝐟, 𝐭)$	-5	false negative
	$(𝐟, 𝐟)$	0	false positive
$(𝐭, 𝐭), (𝐟, 𝐭)$	$(𝐭, 𝐟)$	0	false positive
	$(𝐭, 𝐭) (𝐟, 𝐭)$	2	energy saved
	$(𝐟, 𝐟)$	0	false positive
$(𝐟, 𝐟)$	$(𝐭, 𝐟)$	-5	false negative
	$(𝐭, 𝐭) (𝐟, 𝐭)$	-5	false negative
	$(𝐟, 𝐟)$	0	energy saved
	$(𝐮, 𝐮)$	0	assumed correct

Table 4. TABLE IV : Translation of arithmetic operations to FLOPs count.

Operation	Complexity $[FLOP]$
Addition	1
Subtraction	1
Comparison	1
Multiplication	2
Division	4
Exponential	8

Equations39

D L G = (n d i, t b s, m cs) \in B \times N^{+} \times N^{[0, 31]}

D L G = (n d i, t b s, m cs) \in B \times N^{+} \times N^{[0, 31]}

U L G = (n d i, t b s, m cs) \in B \times N^{+} \times N^{[0, 31]}

U L G = (n d i, t b s, m cs) \in B \times N^{+} \times N^{[0, 31]}

l [n] = (ULG, DLG) \in L

l [n] = (ULG, DLG) \in L

l =< l [1], l [2], ..., l [N] >\in L^{N}

l =< l [1], l [2], ..., l [N] >\in L^{N}

t_{r} \in {\frac{0}{W}, \frac{1}{W}, \frac{2}{W} \dots, \frac{W}{W}} \cup {-}

t_{r} \in {\frac{0}{W}, \frac{1}{W}, \frac{2}{W} \dots, \frac{W}{W}} \cup {-}

FPR = \frac{# grant presences erroneously predicted}{# grant absences}

FPR = \frac{# grant presences erroneously predicted}{# grant absences}

FNR = \frac{# grant absences erroneously predicted}{# grant presences}

FNR = \frac{# grant absences erroneously predicted}{# grant presences}

l (Σ) = l (θ_{0} + θ_{1} x_{1} + ... + θ_{n} x_{n})

l (Σ) = l (θ_{0} + θ_{1} x_{1} + ... + θ_{n} x_{n})

R = t, y = 0, 1 \sum P (y, t) C (y, t)

R = t, y = 0, 1 \sum P (y, t) C (y, t)

R = P (t = 1) FNR C (0, 1) + P (t = 0) FPR C (1, 0)

R = P (t = 1) FNR C (0, 1) + P (t = 0) FPR C (1, 0)

\frac{FNR _{1} - FNR _{2}}{FPR _{2} - FPR _{1}} = \frac{C ( 1 , 0 ) P ( t = 0 )}{C ( 0 , 1 ) P ( t = 1 )} = m, with m \in R^{+*}

\frac{FNR _{1} - FNR _{2}}{FPR _{2} - FPR _{1}} = \frac{C ( 1 , 0 ) P ( t = 0 )}{C ( 0 , 1 ) P ( t = 1 )} = m, with m \in R^{+*}

R = (C (1, 1) C (1, 0) C (0, 1) C (0, 0))

R = (C (1, 1) C (1, 0) C (0, 1) C (0, 0))

R = (0 0.15 0.85 0)

R = (0 0.15 0.85 0)

a^{best} = a argmax Q (s, a)

a^{best} = a argmax Q (s, a)

a [n] = {random action with probability ϵ a^{best} with probability 1 - ϵ

a [n] = {random action with probability ϵ a^{best} with probability 1 - ϵ

\displaystyle\scalebox{0.95}{$Q(\mathbf{s}[n-1],\mathbf{a}[n-1])$}

\displaystyle\scalebox{0.95}{$Q(\mathbf{s}[n-1],\mathbf{a}[n-1])$}

\displaystyle+\scalebox{0.95}{$\alpha\cdot(Q(\mathbf{s}[n],\mathbf{a}[n])\cdot\gamma+\mathbf{r}[n-1])$}

\forall s \in S : Q (s, (t, t)) = Q (s, (f, t)) \geq Q (s, (t, f)) \geq Q (s, (f, f))

\forall s \in S : Q (s, (t, t)) = Q (s, (f, t)) \geq Q (s, (t, f)) \geq Q (s, (f, f))

D_{G}^{K} = \frac{# of grants of type G}{K}, G \in {DLG, ULG}

D_{G}^{K} = \frac{# of grants of type G}{K}, G \in {DLG, ULG}

E_{Q} = \frac{c _{Q}}{f} \cdot P_{f} \cdot f

E_{Q} = \frac{c _{Q}}{f} \cdot P_{f} \cdot f

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Full text

Adaptive Predictive Power Management for Mobile LTE Devices

Peter Brand, Joachim Falk, Jonathan Ah Sue, Johannes Brendel, Ralph Hasholzner, and

Jürgen Teich Peter Brand, Joachim Falk, and Jürgen Teich are with the Friedrich-Alexander-Universität Erlangen-Nürnberg $\{$ firstname.lastname $\}$ @fau.de

Jonathan Ah Sue, Johannes Brendel, and Ralph Hasholzner are with Intel Deutschland GmbH $\{$ firstname.lastname $\}$ @intel.com

Abstract

Reducing the energy consumption of mobile phones is a crucial design goal for cellular modem solutions for LTE and 5G standards. In addition to improving the power efficiency of components through structural and technological advances, optimizing the energy efficiency through improved dynamic power management is an integral part in contemporary hardware design. Most techniques targeting mobile devices proposed so far, however, are purely reactive in powering down and up system components. Promising approaches extend this, by predicting and using information from the environment and the communication protocol to take proactive decisions. In this paper, we propose and compare two proactive algorithmic approaches for light-weight machine learning to predict the control information needed to allow a mobile device to go to sleep states more often, e.g., in time slots of transmission inactivity in a cell. The first approach is based on supervised learning, the second one based on reinforcement learning. As the implementation of learning techniques also creates energy and resource costs, both approaches are carefully evaluated not only in terms of prediction accuracy, but also overall energy savings. Using the presented technique, we observe achievable energy savings of up to 17%.

1 Introduction

Regarding usability and marketability of battery-powered consumer products such as mobile phones, the minimization of energy consumption in operation is a must. This is especially important, as battery technology is not advancing fast enough to cope with increased energy consumption of, e.g., displays or application processors. Depending on various factors, the modem (as the focus of this paper) can be a major contributor to overall energy consumption being responsible for up to 65 % of the overall consumed energy [1]. Apart from structural and technological improvements aiming to lower the power consumption of integrated circuits regardless of their operational mode, Dynamic Power Management (DPM) [2] serves to reduce energy consumption by switching idle components from high power to less power-consuming modes. This is realized via so-called policies that represent schedules of power states for all components of a system.

In Long Term Evolution (LTE) [3] and 5G, base stations are responsible for scheduling the traffic in a cell to and from all mobile devices. While this scheduling is entirely known to the base station, being the communication master, the mobile devices have no knowledge when they will be granted time slots for transmission (via so-called grants in the control channel). In order to guarantee that each such grant is received, each device would need to continuously monitor the control channel. Due to factors like radio quality or other mobile devices in the cell, there can be a significant amount of time and energy spent by a modem to decode the control channel only to realize that there was no relevant grant transmitted.

Therefore, we propose proactive DPM-techniques based on machine learning that predict whether a mobile device will receive a relevant grant at a certain time step. Contrary to reactive DPM techniques that take decisions only after perceiving a grant presence or absence, our presented proactive approach monitors the control channel only when a grant presence is predicted. Whenever a grant absence is predicted, distinct components of the modem can be put to low power states significantly sooner.

Because of the possibility of a high variability in the scheduling behavior of a base station, we argue that proactive DPM needs to be capable of being trained online without any prior experience in a cell. In this context, we propose and compare two algorithmic approaches for light-weight machine learning to predict the control information needed to allow a mobile device to save energy, e.g., in time slots of transmission inactivity in a cell.

The first approach is based on supervised learning, the second one based on reinforcement learning. As the implementation of such learning techniques also creates energy and resource costs, both approaches are carefully tuned and evaluated not only in terms of prediction accuracy, but also in terms of overall energy savings, based on both simulated data as well as data captured in real LTE cells. From the viewpoint of a mobile device in an LTE environment, a computationally light-weight adaptive predictive DPM system that is able to learn online to achieve net energy savings is therefore the preferred solution.

The rest of the paper is structured as follows: First, we discuss related work in Section 2. Next, in Section 3, we introduce the basics of the LTE communication protocol. Here, we introduce which control events are relevant for adaptive DPM. Subsequently, in Section 4, we discuss the architecture of our LTE modem as well as the different power saving policies that can be realized by this architecture. In Section 5, we propose two techniques for predictive power management based on machine learning. Here, a special focus is put on data formats, input and output, and the definition of policies. Next, in Section 6, we discuss the applicability and advantages of different machine learning approaches, including their capabilities for online learning and giving a definition of classification errors. Moreover, we formalize the prediction problem both as a reinforcement learning problem and formulate a supervised learning problem. Based on this, a quantitative study of computational complexity as well as resource and energy requirements for the implementation of the discussed algorithms is given in Section 7. Experiments are conducted to show that significant net energy savings are achievable in the steady state of scenarios such as video download. Finally, we conclude with an outlook on future research directions in Section 8.

2 Related Work

A comprehensive overview of DPM is provided in [2]. Here, DPM techniques are grouped into (i) adaptive predictive and (ii) stochastic approaches.

Stochastic control problems are characterized by their power state transition times being non-deterministic and more than two power states (i.e., not only on/off) available to components. Most related and relevant to our work are adaptive predictive approaches: Predictive power management [4, 5, 6, 7] uses machine learning techniques to predict, e.g., the length of idle intervals. This information can be used to choose an optimal timeout policy or to define suitable power management policies of when to power down/up individual components. Workload-dependent DPMs for, e.g, multi-processor systems [8, 7], often act event-triggered. Here, the DPM system receives a signal indicating that a component is idle and only then has to perform a prediction of an optimal time-out length.

In contrast to this, in the LTE context and for mobile devices, suitable control signals have to be determined first to trigger dynamic power management actions. Second, in order to save considerable amounts of energy, such control signals themselves must be predicted. Finally, instead of event-triggering, we rather encounter a periodic, time-triggered control problem.

In this paper, we base our modeling of the LTE environment and LTE-compliant behavior on [3] and [9]. For LTE or wireless networks in general, there exist only very few approaches to predictive DPM that can be distinguished by the location where predictions are performed: (i) the core network or (ii) the individual mobile device.

In case of the core network, research work has focused on the question of how to save power by optimizing resource block allocation, see e.g. [10, 11, 12]. These works aim to reduce power consumption of the average power of all devices registered in a cell by optimizing scheduling decisions from the base station through machine learning. In contrast to this, our approach is employed on the side of the mobile device in such a network, unaware of the often dynamic scheduling policy of a base station and without any global knowledge on the number of devices in a cell and the number and types of current requests.

Two main challenges have to be coped with in this context: Scarcity of global knowledge available to the predictor in the device as well as scarcity of computational resources available in the modem. A further difference is the strict requirement of reduced computational complexity for our use-case, as a mobile device is both battery-powered and severely constrained in computational power.

Despite these huge challenges, both approaches might be applied hand-in-hand, as our approach described in the following does not assume any knowledge given on the scheduling techniques used in a base station.

Finally, there also exist a few approaches on adaptive power management on mobile devices that can be distinguished by their adaptiveness to new situations, e.g., number of devices, in a cell and the required additional interaction with the core network. For example, the work in [13] examines how mobile device receiver behavior can be adapted to increase energy efficiency. However, the proposed solution requires to extend the protocol by additional signals that indicate superfluous computations in order to notify the system in advance of opportunities to switch components to low-power states. Although requiring no forecast, this technique would only be applicable in case the full LTE communication protocol would be extended by the proposed signaling mechanism. The authors of [14] use machine learning to predict when data transmissions are sparse enough to be delayed without the user noticing. Upon prediction of such an opportunity, the mobile device then signals to the base station to defer the communication, leading to more bursts in traffic, reducing the amount of time data has to be received. In contrast, our approach proposed in this paper works on top of any MAC layer communication protocol without any change, and requiring no additional control signals.

In [15], an approach for grant prediction in LTE is introduced and shown in theory to be beneficial for predictive DPM, but no actual prediction technique is proposed and only ideal energy savings are reported. Additionally to [16], where a solution to the prediction problem is proposed, this paper presents an in-depth investigation of the impact of trace characteristics on theoretically as well as realistically achievable energy savings for a representative application model from [30]. Finally, [17] presents an approach to the problem of grant prediction based on a supervised learning with impressive false negative prediction rates of only about $2\,\%$ . However, all training there is performed offline on pre-recorded communication traces as presented in [18] and without any online validation or training, thus neglecting the dynamic behavior in a cell caused by changes in the number of active mobile devices and their requests as well as adaptive scheduling behavior of the corresponding base station. As a result, no margins of achievable net energy savings for real environments are known. To the best of our knowledge, this is the first paper to (i) present and compare different approaches for LTE grant prediction for DPM in mobile devices that (ii) may be applied in any LTE network without (iii) any prior training, and (iv) analyzes net energy savings in (v) real environments on the basis of sound power and energy models of modem and predictor components.

3 LTE Protocol

This section presents an overview of the key control signals of LTE. With an appropriate level of abstraction, we introduce (i) key layers in LTE communication, (ii) radio transmission and reception concepts, and finally (iii) both uplink and downlink control signals. Later sections will use these signals to motivate DPM strategies and design predictors to realize these strategies. For a more in-depth and complete LTE overview, we refer the reader to [3].

3.1 LTE Base Terminology

An LTE network consists of two major parts: (i) the access network and (ii) the core network. The access network is a cell-structured network of evolved Base Stations that communicate with mobile devices (i.e., User Equipments) via radio transmission. As the UE modem components – responsible for radio transmission and reception – are the focus of this work, our explanation of the LTE protocol layers will be focused on the layers that either realize or directly affect the communication between the UEs and eNodeBs. This abstract network view is shown in Fig. 1 with the relevant layers – application layer, Access Stratum (AS)/ Non Access Stratum (NAS) layers, and Radio Frequency (RF) layer – highlighted.

(i)

The application layer executes user applications, e.g., video streaming, that connect a UE with a service provider, like an internet server, through the core network. A functional model of such communication on application level is given in Section 7.2. 2. (ii)

The AS/NAS layers generate and process LTE-conform control signals as outlined in Sections 3.3 and 3.4. 3. (iii)

The RF layer realizes the radio signal transmission and reception. Analog radio signals are sent between UE and eNodeB aligned to channels divided in frequency into uplink and downlink regions as detailed in Section 3.2.

3.2 Communication Channels

RF communication in LTE between a UE and an eNodeB is aligned in time to so-called Transmission Time Intervals, as shown in Fig. 2, which are $1$ ms in length. Moreover, each TTI itself is divided into $14$ Orthogonal Frequency-Division Multiplexing (OFDM) symbols of equal length, each of length $\frac{1}{14}$ ms. In the frequency domain, all communication is aligned to so-called Physical Resource Blocks of size $180$ kHz. Furthermore, downlink (sent from the eNodeB to the UE) and uplink (sent from the UE to the eNodeB) are separated in frequency.

To make sure transmissions are free of interference, an eNodeB schedules all uplink and downlink data in frequency and time and sends this schedule information to all UEs. As the target of the discussed predictive DPM (see Section 4.1) are the components of the modem radio frequency reception chain, an in-depth understanding of the downlink (Section 3.3) and uplink (Section 3.4) region is necessary.

3.3 Downlink

The downlink region of a TTI is divided (in time) into the control channel (first 3 OFDM-Symbols) and the data channel (remaining 11 OFDM-Symbols), see Fig. 3. As is implied by the names, the data channel is reserved for actual user data, while the control channel holds all relevant protocol information.

Section 4.1 will show how a prediction of the information contained here can be leveraged to obtain energy savings compared to a state-of-the-art DPM.

3.3.1 Grant Signaling

Grant signaling is used to notify the UEs of scheduling opportunities with (i) Downlink Grants and (ii) Uplink Grants. These grants are used for the purpose of notifying a UE of specific PRBs at specific times where (i) downlink transmissions are to be received or where (ii) uplink transmissions are allowed. Each type of grant information is signaled in the control channel of a TTI and addressed to a specific UE. Moreover, downlink grants always point to PRBs in the data channel of the very same TTI, whereas uplink grants indicate PRBs to be used by the UE exactly $4$ TTIs in the future. Finally, to serve as input for our work on grant prediction in Section 6, downlink and uplink grants can be formally described by tuples:

[TABLE]

$ndi$ is a Boolean value that describes whether a new packet ( $ndi\equiv\mathbf{t}$ ) should be sent, or a previous packet should be retransmitted ( $ndi\equiv\mathbf{f}$ ). $tbs$ is a positive natural number that specifies the exact number of PRBs associated with the grant within the data channel. $mcs$ is also a natural number between [math] and $31$ that specifies the Modulation and Coding Scheme (MCS) for both uplink transmission and downlink reception and describes how the RF payload data is encoded.

3.4 Uplink

Not crucial for prediction, but nevertheless important for our evaluation of overall energy consumption in Section 7, are the uplink RF LTE mechanisms. In this work, we consider (i) Buffer Status Reports, (ii) uplink data transmissions, and (iii) ACK/NACK signaling.

(i)

BSRs are sent by the UE to notify the eNodeB of data that is ready to be sent. Hence, asking for future scheduling and, thus, an uplink grant.

(ii)

Uplink data transmission is implicitly scheduled exactly 4 TTIs after the reception of an uplink grant.

(iii)

ACK/NACK signaling in the uplink is realized to notify the eNodeB of successful (ACK) or unsuccessful (NACK) data reception from the UE, and is specified to happen exactly 4 TTI after downlink transmissions (i.e., a received DLG).

Since the eNodeB knows when in time and where in frequency to expect data transmissions and ACK/NACK feedback, it will react to unexpected behavior (no, or erroneous data transmissions, missing ACK/NACK feedback). This is done by rescheduling and resending an uplink grant (missing transmission) or resending the downlink data (missing ACK/NACK), but both kinds of grants with the field $ndi=\mathbf{f}$ .

Similar to the downlink case, the uplink frequency spectrum is subdivided into channels for uplink control and for uplink data, as shown in Fig. 4. Here, the upper- and lowermost PRBs belong to the control channel and the frequencies in between carry the data channel.

Transmissions in a TTI are exclusive to one channel, which means that a UE can either send in the control or data channel, with the data channel taking precedence. Thus, if a UE transmits payload data, everything will be transmitted in bulk in the data channel. Otherwise, it will be sent in the control channel.

4 Architecture and Power Modeling

To model the power consumption of a modem and to formalize policies for DPM, we take a * Power State Machine (PSM)-based* approach according to [1], see Fig. 5 for convenience. Here, each essential power-manageable hardware component is modeled by a PSM with its respective power states and possible state transitions corresponding to power management decisions. Distinguished in Fig. 5 are a Radio Frequency (RF) part and a Physical Layer (PHY) part. In each power state, a fixed nominal power consumption value is assumed. Additionally, each part is divided into a reception (RX), transmission (TX), and control (CTRL) components. Thus, resulting in the six different components shown in Fig. 5a). The RF components receive information from, respectively, send information to the PHY components that decode, respectively, encode the information.

In the following section, we motivate the potential of power and energy savings in an LTE modem through proactive power management. We propose DPM policies based on the prediction of grant signals in the reception path, thus affecting the RF_RX and PHY_RX components. Their PSM models are shown in Fig. 5b). As can be seen, both components can be in a low (off) or high power state (on, Control/Data). Whereas during the decoding of the control region (first three OFDM symbols of a TTI), the PHY_RX component is in the Control state, it transits to the Data state when decoding the data channel.

4.1 Power Management Policies

In the context of an LTE modem, a power management policy denotes a schedule of the power states of all its DPM-controllable components over the duration of a TTI.

On the receiver (RX) path, an LTE-compliant reactive DPM could distinguish a very simple policy as shown in Fig. 6a): If the data channel contains no data for the UE, which is the case if no DL grant (DLG) is present in the control channel, then the RF_RX and PHY_RX components can be simply turned off for the remainder of the TTI, see the two scenarios $z_{1}$ and $z_{3}$ , respectively, on the left (only an ULG) and on the right (neither ULG nor DLG present) in Fig. 6. Otherwise, the components must remain in the high power state and no power savings are possible in this case (scenario $z_{2}$ shown in the middle).

However, the potential to achieve considerable power and energy savings through reactive DPM in scenarios $z_{1}$ and $z_{3}$ is not very high, as the information to power down RX components becomes available to the DPM only after the complete decoding of the data channel as shown by the length of the red time intervals of the components RF_RX and PHY_RX in Fig. 6a).

In contrast, imagine the DPM could correctly predict the information that neither a DLG nor a ULG grant will be received in the next TTI. Obviously, both components could then stay in the off mode for the complete duration of a TTI (see scenario $z_{3}$ in Fig. 6b). Moreover, in scenario $z_{1}$ , this would allow at least the RF_RX component to stay off for a considerably longer time than in case of a reactive DPM.

Our predictive DPM approach proposed in this paper aims to exploit the full potential of energy savings of a UE by trying to maximize the intervals (green in Fig. 6b) in which components are operated in off mode. This is achieved by aggressively turning the modem components off before the decoding of the control channel. To achieve this, Section 6 will introduce and evaluate two machine learning algorithms for time-series prediction of whether a TTI will contain a DLG or ULG grant for a UE in the next TTI. In contrast to a reactive approach, it will be shown that this has the potential of a significant amount of net energy savings (see Section 7). However, prediction bears also the danger of potentially loosing transmission capacity in uplink or downlink in case of misprediction of grant signals. Such mispredictions may indeed affect performance but notably also future traffic of the whole cell due to retransmissions. Because the eNodeB will generally retransmit data if a UE does not react as expected, e.g., by missing grant signaling due to erroneously turning the modem off, a prolonged data transfer for the affected data will result. In consequence, no energy savings, but rather increased energy might be observed due to the required retransmissions. This and the overhead of energy consumed by an implementation of the prediction technique itself must therefore be carefully evaluated.

5 Predictive DPM

This section formalizes the problem of predictive DPM, introduces the notion of DPM policies based on the three scenarios distinguished previously, and provides formulas for energy estimation based on the notion of analyzed traces.

Figure 7 gives an overview of the complete predictive DPM system. Based on received LTE traces with $\mathbf{l}[n]$ denoting the control information received by the modem of a UE in the $n$ th TTI of a trace (see Section 5.1), a predictor as explained in Section 5.3 performs the prediction $\mathbf{l}_{p}[n+1]$ of the next TTI’s control information upon which the next scenario $\mathbf{z}[n+1]$ is determined. Section 5.2 thereby illustrates the notion of policies as schedules of power states of each power-controllable component and how they are determined for each of the characterized scenarios to save the highest amount of energy.

5.1 LTE Traffic Modeling

Let the TTI information $\mathbf{l}[n]$ represent all information that is observable by a single UE during the $n$ th TTI. This information is used for grant prediction. It is defined as a tuple containing the ULG and DLG information, as introduced in Eqs. 1 and 2:

[TABLE]

Based on this, a so-called trace of length $N$ can be modeled by a sequence

[TABLE]

5.2 Policies and Mapping of Scenarios to Policies

A policy $p$ is defined as a schedule of the power states of each power-controllable component $r\in R=\{r_{\mathrm{RF\_RX}},\allowbreak{}r_{\mathrm{PHY\_RX}},\allowbreak{}r_{\mathrm{RF\_TX}},\allowbreak{}r_{\mathrm{PHY\_TX}},\allowbreak{}r_{\mathrm{RF\_CTRL}},\allowbreak{}r_{\mathrm{PHY\_CTRL}}\}$ according to Fig. 5 within a TTI.

Assuming that the initial state of each component is a high power state and that each component can maximally be switched off once during a TTI, the schedule of each component $r$ may be described by a time index

[TABLE]

that indicates the fraction of a TTI (with $W=14$ ) at which the component is powered down to its off-state. A policy $p$ is then simply a tuple $p=(t_{1},\cdots,t_{|R|})$ of such time indices.

Hereby, we assume that an LTE-conform DPM system will never switch a component to on, apart from the very beginning of a TTI. Following this, the time index of [math] means that the respective component will not be switched on at all, while $1$ means it is kept on during the entire TTI.

As motivated in the previous section (see Fig. 6), we distinguish three relevant scenarios $Z=\{z_{1},z_{2},z_{3}\}$ for our LTE-modem DPM. In detail, scenario $z_{1}$ corresponds to information only in the control channel, scenario $z_{2}$ has information both in the control and data channel, and scenario $z_{3}$ indicates no information within the TTI.

The mapping of this input to power states and transitions is intuitive: For $z_{1}$ , only the control channel should be received, i.e., RX components can be turned off after the control signals have been processed. For $z_{2}$ , obviously all content of the TTI has to be received, and no RX component should be turned off. Last, for $z_{3}$ , the RX components can be turned off during the whole TTI without any loss of information. The time that the RX components need to process the data contained in the control channel and can forward the information to the DPM system is dependent on the modem itself. In the following energy analysis, we assume the RX components are operated at a 50 % duty cycle and turned off in the second half ( $0.5\cdot t_{\mathrm{TTI}}$ ) of the TTI if no DL grant is present in the control channel.

Obviously, three policies, one characterizing each of the above scenarios are necessary but also sufficient to describe the DPM schedules for each modem component. Table I shows the relation between TTI tuple information $l$ , the corresponding scenario $z$ , and the resulting policy $p$ to be chosen for the purpose of energy reduction:

Note that the mapping of TTI information $l$ to a scenario $z$ , respectively policy $p$ , can also be used in the following predictive approach, thus, deriving $\mathbf{z}[n+1]$ and policy $\mathbf{p}[n+1]$ directly from a predicted tuple $\mathbf{l}_{p}[n+1]$ according to Table I. Next, we will explain the prediction mechanism itself, which is based on an observed time series of TTI control information.

5.3 The Predictive DPM Cycle

In contrast to a reactive DPM system that may take as input only TTI control information up to, respectively of the current TTI $n$ , a time-series based predictor takes a sliding window of length $K$ to predict the TTI control information $\mathbf{l}[n+1]$ of the next TTI, see Fig. 7. Such a predictor may be represented by a function $f_{\mathrm{PRED}}:L^{\mathrm{K}}\rightarrow L$ that uses the information of $\mathbf{l}[n-K+1]\ldots\mathbf{l}[n]$ to compute a prediction $\mathbf{l}_{p}[n+1]$ of the next TTI control information tuple. As the predictor has to predict whether a grant will be present or not in the next TTI, only the four discrete tuples shown in Table I will result from prediction, as the predictor will never predict the TTI information value $\mathbf{u}$ .

Based on the prediction $\mathbf{l}_{p}[n+1]$ , the scenario $\mathbf{z}[n+1]$ is determined by the mapping function $f_{\mathrm{Z}}:L\rightarrow Z$ according to Table I. The table also describes the final step of mapping the scenario $\mathbf{z}[n+1]$ to a desired policy. This mapping can therefore also be described by a function $f_{\mathrm{DPM}}:Z\rightarrow P$ according to Fig. 7.

5.4 Energy Estimation

Since the motivation for applying learning techniques is to reduce the modem power, respectively energy consumption, it is important to consider the inherent energy consumption of the prediction overhead per TTI as well. In the following, this overhead is denoted by $E_{\mathrm{Q}}$ . This overhead will be carefully examined in our experiments in terms of clock cycles and energy consumption when implementing the predictor in software on a Digital Signal Processor (DSP) that is representative for usage in LTE modems (see Section 7). As our goal is to evaluate net energy savings, we take the energy consumption $E^{\mathrm{DPM}}$ of the reactive approach as a base line for the comparison with the energy consumption $E^{\mathrm{COG}}$ – including prediction overhead $E_{\mathrm{Q}}$ – obtained by the predictive DPM approach when analyzing a given trace $\mathbf{l}$ .

The energy consumption $E^{\mathrm{COG}}$ for a given trace is dependent on the cognitive policy $\mathbf{p}^{\mathrm{COG}}[n]$ (with $\mathbf{p}^{\mathrm{COG}}[n].t_{r}$ carrying the respective time portion of TTI $n$ the resource $r$ will stay in its ”on” power state) and the power consumption $P^{\mathrm{on}}_{r}$ ( $P^{\mathrm{off}}_{r}$ ) of each resource $r$ in the high (low) power state. Thus, the energy consumption of a complete trace using predictive DPM can be calculated as follows:

[TABLE]

Since the proposed DPM considers only the power state transitions of the RX components, we assume the TX components ( $r_{2},r_{3}$ ) to be steered by a state-of-the-art reactive LTE DPM policy $p^{\mathrm{DPM}}$ .

Likewise, the energy consumption for the reactive approach, dependent on the reactive LTE DPM policy $p^{\mathrm{DPM}}$ , may be computed as:

[TABLE]

A net energy gain can be achieved if $E^{\mathrm{COG}}<E^{\mathrm{DPM}}$ holds for a trace.

6 Machine Learning Techniques for LTE Grant Prediction

There exists a multitude of machine learning approaches. Which of the following approaches is suitable or not depends mainly on which information is available during learning. If during each learning step, there is direct feedback on the quality of a classification or prediction, supervised learning is appropriate. If this information is not readily available at every training step, but rather an approximation of quality must be deduced indirectly from certain events, reinforcement learning may be the appropriate choice. Finally, unsupervised learning may be applied to cases where neither of the above information is available. In our case of LTE grant prediction, a suitable machine learning approach shall correctly predict the next TTI control information based on a sequence of $K$ previous control tuples. Thus, since the correctness of each prediction may be asserted either directly or indirectly (listening and grant being sent or not listening when no grant is being sent), we do not consider unsupervised learning as a preferable technique for our problem.

Before delving into details, Section 6.1 discusses error classes and how mispredictions may affect a UE and even the whole cell. Subsequently, in Section 6.2, both a supervised and a reinforcement learning approach are presented and benefits and shortcomings of each outlined. Next, we present solutions for LTE grant prediction based on supervised learning (Section 6.3) and reinforcement learning (Section 6.4).

6.1 Error Classification

In the context of our stated LTE grant prediction formalization, the scenario identification is dependent on the predicted presence or absence of a DLG and an ULG. Hence, it can be defined as a binary classification problem for which two types of prediction errors exist: False Positive (FP) and False Negative (FN) errors [17].

A FP error means a grant was erroneously predicted to appear. This error is neutral to performance (e.g., data rate), as no information is lost. However, as a false positive error means that components are left in a high power state for longer than needed, energy savings are missed, thus, negatively affecting the non-functional property energy consumption. To measure the proportion of false positive errors, we define the False Positive Rates as:

[TABLE]

The second kind of prediction error is a FN error, which means a grant absence was erroneously predicted. As explained in Section 3.1, this leads to scheduled PRBs to be effectively unused because either the downlink data is not received by the UE or the UE will not transmit any uplink data. False negative errors are diminishing the performance, i.e., leading to a deterioration of data rate and effective bandwidth of the whole LTE cell. Analogous to the FPR, we define the * False Negative Rate (FNR)* as:

[TABLE]

In summary, minimization of the FPR corresponds to a minimization of energy consumption as defined by Section 5.4, while minimization of the FNR is necessary in order not to affect the quality of service. Hence, the two approaches presented in the following both aim at saving energy while minimizing the FNR.

6.2 Design Considerations

Depending on UE application, UE movement, and other UEs behavior in the cell, traffic patterns are diverse and may undergo continuous changes.

Thus, when designing a predictor, there are several conflicting design objectives. The first pair is complexity and prediction accuracy. As explained in Section 5.4, performing a prediction itself as well as the training of the predictor requires additional computations. Assuming a nominal prediction accuracy, a less complex prediction algorithm will yield higher energy savings compared to a more complex algorithm. Of course, the prediction accuracy of a more complex algorithm may achieve a better prediction accuracy if the trace characteristics are hard to predict. For traces with especially simple characteristics, complexity may be wasted.

The second important decision is assumption of stationarity. Assume a trace that is stationary for a long time. In this case, training a predictor with a quick convergence to a sufficient solution on a short initial part of interval may be the preferred choice. However, if the observable traces are subject to continuous changes, a solution that is capable of on-line training to improve its prediction may be superior.

Therefore, we propose and compare two different approaches that carefully balance these considerations. The first presented approach (Section 6.3) is based on supervised learning. Here, we propose a predictor based on a neural network, aimed to be trained quickly until it is turned into exploitation mode. The second approach (Section 6.4) is based on reinforcement learning. There, we introduce a light-weight tabular prediction algorithm, which is trained continuously. Finally, in Section 7, both approaches are compared for a number of different traces in terms of prediction accuracy and the potential for energy savings.

6.3 Supervised Learning

A first benchmark of several supervised ML algorithms for LTE grant prediction is described in [17]. In essence, 3 different algorithms have been trained to perform grant prediction. Their output, a value between 0 and 1, is used as cost-sensitive classification input, to decide whether a grant, modeled as 1, or no grant, modeled as 0, should be predicted. This gives the ability to tune the false negative rate (FNR) separately from the accuracy result. Using this two-stage classification approach has 2 main advantages:

•

Intrinsic algorithm accuracy comparison: Applying the same cost-sensitive classification technique with different first-stage ML algorithm makes it possible to compare the intrinsic predictive ability of these algorithms. In other words, it becomes possible to answer the following question: which algorithm can inherently model the grant traffic better?

•

Tunable operating point: For a given prediction, it might be requested to reach different FNR performances. For instance, if too many grants have been missed (high FNR) at the beginning of an LTE scenario, it is desirable to adjust only the cost-sensitive parameters in order to have a predictor which is less prone to false negatives than false positives. Using such dynamic and straight-forward adjustments of the predictor output can be used to tune the FNR depending on the user needs. Therefore, a safe cost-sensitive setup, i.e., low FNR but higher FPR, would be used for scenarios with time constraints where packet losses could be very damageable. For non-critical scenarios, a more aggressive cost-sensitive setup, i.e., low FPR but higher FNR, would be applicable.

6.3.1 Feed-Forward Neural Network

From 3 popular supervised learning approaches, i.e., feed-forward neural networks (FFNN), support vector regression (SVR) and recurrent neural networks (RNN), we chose the less computationally expensive one, the FFNN approach. Indeed, SVR requires to solve a high-dimensional optimization problem, often in the dual space [21]. RNNs, when unfolded through time, are also computationally very expensive [22].

In particular, we use a specific type of FFNN where all the neurons of one layer are connected to all neurons of the next layer, i.e., fully connected FFNN. In Fig. 8, we describe the model of one neuron which is the core building block of the entire neural network depicted in Fig. 9, which uses the hyperparameters given in Table II.

Formally, the output of one neuron can be expressed as

[TABLE]

with $(x_{1},...,x_{n})$ being the output of the $n$ previous neurons, $l$ the link function and $\boldsymbol{\theta}\in\mathbb{R}^{n+1}$ the weight vector modified by backpropagation in order to minimize the error between outputs and targets during the training phase.

This type of ML algorithms is known to be universal function approximators and can therefore achieve any kind of nonlinear mapping between inputs and outputs [23]. Although training neural networks with backpropagation can be computationally expensive, several variants exist [24], allowing to choose the optimal accuracy-complexity trade-off depending on the task. Generally, no conclusions can be drawn on the influence of the number of hidden layers and neurons. Therefore, we chose the hyperparameters given in Table II.

The prediction window input length is chosen equal to $K=10$ , which is equivalent to taking the last 10 values of each LTE metric as neural network input.

In total, the input vector is thus a 60-dimensional vector containing the normalized LTE metric values, explicitly considering all information contained in a grant ( $\mathrm{ndi}$ , $\mathrm{tbs}$ , and $\mathrm{mcs}$ ). These values might give some indications on the past grant history but also bandwidth occupancy, channel conditions, and past corrupted data, which are informations used by the eNodeB to decide on the future allocations.

The output is a real-valued estimation of the likelihood of a ULG and/or a DLG presence in the next TTI. This real-valued estimation is then transformed by the cost-sensitive classification decision stage to discern the actual predicted TTI (present or absent).

6.3.2 Cost-Sensitive Classification

A receiver operating characteristic (ROC) curve is often used to assess the classifiers’ performance. It depicts the trade-off between the FNR and the FPR. The performance of a classifier is assessed by computing the area under the ROC curve (AUC). The higher is the AUC, the more efficient is the classifier. As depicted in Fig. 10, the FNR and FPR can be tuned to achieve the any desirable trade-off on the training data.

In the following, let $P(y/t)$ denote the conditional probability that $y$ is predicted given $t$ as target, $P(y,t)$ denotes the joint probability and $C(y,t)$ the cost of predicting $y$ with target $t$ . Therefore, a general cost function $R$ can be defined for the classifier,

[TABLE]

In this work, since correct classifications are not penalized, only cases with $C(1,1)=C(0,0)=0$ are considered and from Bayes’ rule, joint probabilities can be expressed with conditional probabilities. Under the naive Bayes assumption, $\textit{FPR}=P(1/0)$ and $\textit{FNR}=P(0/1)$ and therefore the formulation for Eq. 11 becomes

[TABLE]

Selecting the threshold which allows the best trade-off is done by drawing isocost lines as described in [25]. Using Eq. 12, it can be derived that two points $(1-\textit{FNR}_{1},\textit{FPR}_{1})$ and $(1-\textit{FNR}_{2},\textit{FPR}_{2})$ , in Fig. 10, have the same performance if

[TABLE]

In Fig. 10, the minimum isocost line $(d)$ is depicted for $m=\frac{2}{3}$ . Therefore, $m$ is tuned by the proportion of binary targets from training samples and by specific costs which can be presented in a cost matrix,

[TABLE]

Concretely, we set this cost matrix to

[TABLE]

6.4 Reinforcement Learning

To best deal with the outlined need for online learning, another feasible approach is reinforcement learning that is tailored to learn without needing to be given a desired output. This problem occurs, whenever the DPM system decides to not decode the control channel of a TTI.

A reinforcement learning system [26], in general, consists of an agent that interacts with the environment. The agent performs an action $a$ , which will affect the environment leading to a new environment state $s$ . Subsequently, the agent will use this environment state $s$ in order to choose the next action. As additional feedback, the agent receives a reward $r$ , which is either calculated from the observed state $s$ or is explicitly given by the environment. The reward indicates how desirable it is for the environment to be in this state. The agent’s goal is to choose actions that maximize the long term reward.

The general approach works as follows: For each possible pair $(s,a)$ of an observable state $s$ of the environment and each action $a$ that may be taken in this state, the agent stores an expected long-term reward value $Q(s,a)$ for taking the action $a$ in state $s$ . Upon encountering a state $s$ , the agent determines an appropriate action according to an action selection algorithm, based on the stored $Q$ -values. After that, the agent calculates a reward $r$ (reward mapping) based on the observed response from the environment and updates the estimated value of $Q(s,a)$ of the last state and last chosen action. Because this update considers the $Q$ -value of the next state, even negative rewards, like retransmissions in our LTE case, will propagate back through previous actions to the erroneous prediction. In the following, we apply these notions and ideas of reinforcement learning to our time series-based task of online prediction of $\mathbf{l}[n+1]$ based on a known time series of $K$ previous TTI information tuples, see Fig. 11.

Here, the state $\mathbf{s}[n]$ during TTI $n$ is given by the presence and absence of ULG and DLG in the current TTI control tuple information $\mathbf{l}[n]$ and the last $K-1$ previous tuples. The action $\mathbf{a}[n]$ selected by the agent for the $n$ th TTI corresponds to the prediction $\mathbf{l}_{p}[n+1]$ of the forthcoming TTI’s control information according to the three scenarios distinguished in Section 4.1.

6.4.1 Action Selection

In our case of predictive LTE grant prediction, the observed training data may itself be affected by the agent’s chosen previous actions. Thus, an area of tension naturally arises: the balance of exploration and exploitation.

Exploration, on the one hand, refers to gathering information of the state and action space. This is generally achieved by choosing actions in states that have not yet, or rather seldom, been visited. Because the $Q(s,a)$ -values cannot be initialized perfectly without prior learning, this means that exploration hazards the consequences of taking initially presumably suboptimal actions. The trade-off is the potential to find better Q-values, or at the very least broadening the agent’s knowledge for the future. Exploitation, on the other hand, is the simple act of taking the – currently estimated – best action $\mathbf{a}^{\mathrm{best}}$ , i.e. the action that maximizes the expected long-term reward:

[TABLE]

Obviously, a suitable balance between exploitation and exploration is key to success when applying reinforcement learning. A handicap in our context of LTE grant prediction is that important information for taking the right decision on DPM is not directly observable by a single UE, like radio condition, number of other UEs in a cell, and the scheduling strategy of the eNodeB. Indeed, in dynamic environments, these unknowns even undergo a constant flux. We therefore argue for a strategy for action selection that permanently has the capacity to explore and refine.

For action selection, an $\epsilon$ -Greedy strategy [27] is proposed, which is a function that chooses at the $n$ th TTI the action with the highest estimated long-term reward value $\mathbf{a}^{\mathrm{best}}$ with a chance of $1-\epsilon$ , and a random action with a chance $\epsilon$ .

[TABLE]

Thereby, the parameter $\epsilon$ can be set to a constant, e.g., $\epsilon=10\,\%$ in our experiments. Alternatively, the value could even be adapted online to explore different behavior (if the trace characteristics change) or to exploit more often (if the trace characteristics are stable).

6.4.2 Dedicated Learning Phase

As we will show in Section 7, the time an agent might need to achieve acceptable FNRs may be significant. Because the UE is operated in an LTE cell that potentially penalizes UEs that exhibit a disproportionately large rate of missed messages, we propose to start up with an initial dedicated learning phase. During this phase, the actions proposed by the agent are not forwarded to the DPM. Instead, existing reactive LTE DPM policies are used. However, the rewards are calculated as if the agent decision were used. For example, if scenario $z_{3}$ (No DLG $\land$ No ULG according to Table I) is predicted, we will not turn off the RF_RX and PHY_RX components immediately. Rather, if a grant should be received within the TTI, a negative reward (for theoretically missing the information) is issued.

During this phase, a moving average of the FNR is calculated. Only after reaching a minimal error threshold, in our case of $\epsilon^{\mathrm{min\_err}}=40\%$ , the system starts powering the system components down according to the actions as suggested by the agent. Furthermore, we introduce a maximal error threshold $\epsilon^{\mathrm{err\_max}}=45\,\%$ , that upon reaching will trigger a new dedicated learning phase, to account for recognizing changes and transients in trace characteristics. Of course, this introduces a certain amount of time where no energy can be saved. This is offset by a guaranteed worst-case impact, because no grants will be missed.

6.4.3 LTE-Specific Reward Mapping Algorithm

After choosing an action, the agent maps the response of the environment to a reward $\mathbf{r}[n-1]$ . This is realized by a mapping function that checks reality $\mathbf{l}[n]$ against the last prediction $\mathbf{l}_{p}[n]$ for desirable and undesirable attributes.

Desirable are all actions that minimize energy consumption, i.e., that favor turning components off as early as possible. Undesirable are in descending order of severity: (i) turning components off too early (false negative error) and (ii) turning components off too late, or not at all (different degrees of false positive errors).

The proposed reward assignment is shown in Table III ordered by the prediction $\mathbf{l}_{p}[n]$ , i.e., the last action $\mathbf{a}[n-1]$ performed by the agent. For the cases $\mathbf{l}_{p}[n]\in\{(\mathbf{t},\mathbf{f}),(\mathbf{t},\mathbf{t}),(\mathbf{f},\mathbf{t})\}$ , the modem will at least receive and decode the control channel, which allows the evaluation of whether data has been lost or an opportunity to save energy was missed. The prediction $\mathbf{l}_{p}[n]=(\mathbf{f},\mathbf{f})$ is special, as it leads to the TTI information $\mathbf{l}[n]$ not being received, meaning no direct evaluation of prediction accuracy is possible.

However, to make sure a wrong prediction of $\mathbf{l}_{p}[n]=(\mathbf{f},\mathbf{f})$ is discouraged, we further introduce three additional mechanisms: (i) If the UE has received a DLG with $ndi\equiv\mathbf{f}$ , a negative reward $r_{\mathrm{ndi}}=-5$ is awarded. As explained in Section 3.1, this indicates that a prior DLG from the eNodeB was missed, indicating a wrong prediction of $(\mathbf{f},\mathbf{f})$ some time in the past. (ii) The agent may not predict $(\mathbf{f},\mathbf{f})$ more than $K=3$ times in a row. This ensures that the agent does not permanently turn off the modem (obtaining the reward $r_{\mathrm{off}}$ ) while evading the first mechanism (by making communication impossible). (iii) Assigning the negative reward $r_{\mathrm{bsr}}=-5$ if no ULG was received for a BSR (see Section 3.4) sent within the last $10$ TTIs. This situation may indicate that control information – the missing ULG – was lost some time in the past.

6.4.4 Learning

Based on the calculated reward $\mathbf{r}[n-1]$ , the $Q$ -values are updated.

For this, we propose to employ SARSA- $\lambda$ (see [28]).

[TABLE]

It updates the respective Q-values after having observed the immediately experienced reward $\mathbf{r}[n-1]$ resulting from the last action $\mathbf{a}[n-1]$ with this reward and the estimated long-term reward $Q(\mathbf{s}[n],\mathbf{a}[n])$ (of the current state) to a degree determined by $\alpha$ . We chose SARSA- $\lambda$ over other simple algorithms like Q-Learning, as SARSA- $\lambda$ generally penalizes actions leading to bad rewards stronger.

6.4.5 Initialization of Q-Values

The initialization of the $Q$ -values can quite significantly affect the initial prediction quality. As explained before, the reception of the control channel information allows for the most accurate reward assignment, because the predictor can deduce the perfect action. Considering that, a Q-value initialization favoring $\mathbf{l}_{p}[n+1]=(\mathbf{f},\mathbf{f})$ is highly discouraged. Additionally, as outlined in Section 6.1, false negative errors may have a significant impact on transmission quality and speed of the whole cell. Therefore, a setup minimizing the FNR seems appropriate. To realize this, we propose an initialization in the following order:

[TABLE]

The discrepancy in value must be small enough to allow for a fast adaption to new experiences.

7 Evaluation

This section compares both presented DPM approaches, regarding both functional and non-functional properties. First, in Section 7.1, we give a short introduction to our simulation-based evaluation framework. Next, the application of video streaming is chosen as introduced in Section 7.2. For trace characterization, we introduce suitable metrics in Section 7.3. Subsequently, we perform a complexity analysis (Section 7.4) based on Floating Point Operations used for training and prediction. Finally, in Section 7.5, an in-depth evaluation and comparison of both approaches in terms of accuracy, learning rates, and energy savings is presented based on the previously introduced concepts.

7.1 Simulation Framework

Designing a predictive DPM system poses certain constraints on the nature of the realization. Apparently, such prediction techniques must be of low computation complexity, but also lead to net overall energy savings in order to be economically of interest. Additionally, in contrast to reactive power management systems, employing a prediction step inherently introduces a certain degree of uncertainty, as explained in Section 6.1.

For the evaluation of potential energy savings and scope of mispredictions of our proposed predictive DPM, we need an evaluation that reflects a UE in a real cell environment with a sophisticated energy modeling. For a realistic and parameterizable model of the LTE cell environment, we employ the ns-3 simulator [29]. In order to quantify the energy consumption (Section 5) of the hardware model (Section 4), a SystemC-based simulator [19] is used. Both simulators exchange relevant information on a subframe basis through a cosimulation interface.

7.2 Functional Application Model

As both predictive DPM techniques are designed to be trained in a live network cell, there are two main characteristics that need to be considered. The first characteristic is the length of stable trace behavior. Obviously, if the grant patterns change too quickly, a predictor will always be stuck in the learning phase, without ever being able to get into the exploitation phase. Obviously, no energy can be saved in such circumstances. It will be shown that the required minimum length of this stable interval depends on the patterns in the trace themselves and the employed approach.

The second characteristic is the grant density, i.e., the proportion of TTIs that contain either an ULG, DLG, or both, which is explained in Section 7.3.

In the following experiments, we carefully investigate traces and prediction accuracy for highly different manifestations of each characteristic. Based on a parameterizable video streaming application, we create a variety of different scenarios with varying transmission length (seconds of short video playback or several minutes long videos) and varying resolutions corresponding to different data rates.

Our modeling realizes the proposed video stream algorithm from [30], which is given in Algorithm 1.

Here, a video transmission of a video of size $S$ bytes consists of a first burst phase (Initialization procedure), followed by smaller periodic transmissions during the Filling procedure. During the initialization procedure, in total $S_{B}=5\cdot R_{\text{encoding}}$ bytes, corresponding to approximately 40s of encoded video length, are transmitted at maximum speed. During the filling procedure, a packet of size $P=64$ KB is transmitted every $t_{\text{steady}}=\frac{P}{R_{sending}}\cdot 10^{3}$ ms time duration, with a sending rate of $R_{sending}=R_{encoding}\cdot 1.25$ . For realistic values of the encoding rate, we use the values as reported in [30], with $R_{\text{encoding}}\in[200\frac{\text{KB}}{\text{s}},3320\frac{\text{KB}}{\text{s}}]$ .

7.3 Trace Characterization

The evaluation of the prediction techniques is performed for traces of lenghts up to $10^{6}$ TTIs in the following, varying in the grant density of both ULGs and DLGs. We define the grant densities $D_{\mathrm{DLG}}$ and $D_{\mathrm{ULG}}$ for a time series of length $K$ as the proportion of TTIs containing the respective kind of grant.

[TABLE]

These densities define upper bounds in terms of achievable energy savings of each of the predictive DPM approaches. To exemplify, consider a trace with a $D_{\mathrm{DLG}}=100\,\%$ , i.e., a DLG in every TTI. Obviously, not even an ideal, perfect predictor could reduce the energy consumption compared to a reactive DPM, because each channel of every TTI contains information and has to be decoded. For a trace with $D_{\mathrm{DLG}}=50\,\%$ , on the other hand, the proposed predictive DPM approaches could optimize the behavior for the remaining $50\,\%$ compared to a naive approach.

7.4 Complexity Analysis

As target platform for testing the presented prediction algorithms, we assume given an LTE base band DSP from literature, as discussed in [31, 32], that runs at a clock frequency of $f=300~{}\text{MHz}$ with a power consumption of $P_{f}=1\,\text{mW/MHz}$ (including the power consumption of its memory). This DSP introduces another component (with a power state machine, see Fig. 12) to the architecture model introduced in Section 4.1. To estimate the inherent energy overhead $E_{Q}$ of the presented algorithms per TTI, we propose to employ a FLOP-based approach. Here, we translate one iteration (that is performed in each case per TTI) of the presented algorithms first to the number of required FLOPs $c_{Q}$ (according to Table IV, from [33]) and second to power consumption. We assume that both the Q-table as well as the operations and weights describing the neural net fit on the on-chip memory of the DSP.

Based on the assumption of one FLOP per DSP cycle, $E_{Q}$ per TTI can be calculated according to:

[TABLE]

If we assume that the predictor is run for the duration of the whole TTI, i.e., $1$ ms, this yields the power consumption $P_{Q}=\frac{E_{Q}}{1\,\textrm{ms}}$ as modeled by a power state machine for the predictor (see Fig. 12).

For the supervised predictor, we distinguish between two power states, corresponding to (i) the learning phase and (ii) the exploitation phase, as both phases differ in algorithm and thus number of computations. For the FFNN introduced in Section 6.3, we obtain $7,018$ FLOPs per TTI for learning, and $2,014$ FLOPs per TTI during the exploitation phase. Because we perform a prediction alongside each learning step to evaluate the prediction accuracy, one complete iteration during learning requires a total of $7,018+2,014=9,032$ FLOPs. With the introduced DSP model, this translates to an energy consumption per TTI of $E_{Q}^{L}=9\,\mu J$ for learning and $E_{Q}^{P}=2\,\mu J$ for exploitation, respectively. Thus, the two power states in Fig. 12 corresponding to a power consumption of $P_{Q}^{L}=9\,\text{mW}$ and $P_{Q}^{P}=2\,\text{mW}$ , respectively.

The reinforcement predictor, always updating its Q-values according to Eq. 17, respectively, is always in the same power state. One step for Q-learning requires $19$ FLOPs, while one step of sarsa- $\lambda$ requires $25$ FLOPs. Factoring in the reward mapping function requiring $12$ comparisons, translating to $12$ FLOPs. In sum sarsa- $\lambda$ requires $37$ FLOPs per TTI translating to an energy consumption of $E_{Q}^{S}=0.037\,\mu\text{J}$ resulting in power states with a power consumption of $P_{Q}^{S}=0.037\,\text{mW}$ .

7.5 Results

This section discusses experimental results that were obtained by both prediction approaches in the presented simulation environment. In Section 7.5.1, we investigate learning time and prediction accuracy in terms of FNR of both approaches for different scenarios. Section 7.5.2 then investigates how this affects the modem energy consumption.

We present the results for three representative traces $\mathbf{l}_{i}\in\{\mathbf{l}_{\text{min}},\mathbf{l}_{\text{avg}},\mathbf{l}_{\text{max}}\}$ defined by their respective encoding rate $R_{\text{encoding}}^{\mathbf{l}_{i}}$ , covering a large spectrum of different grant densities $D_{G}$ :

(i)

$R_{\text{encoding}}^{\mathbf{l}_{\text{min}}}=200\frac{\text{KB}}{\text{s}}$ , with $D_{\text{DLG}}=0.299507$ 2. (ii)

$R_{\text{encoding}}^{\mathbf{l}_{\text{avg}}}=3320\frac{\text{KB}}{\text{s}}$ , with $D_{\text{DLG}}=0.652489$ 3. (iii)

$R_{\text{encoding}}^{\mathbf{l}_{\text{max}}}=\dfrac{3320+200}{2}\frac{\text{KB}}{\text{s}}=1760\frac{\text{KB}}{\text{s}}$ , with $D_{\text{DLG}}=0.827384$

7.5.1 Prediction Accuracy

First, we evaluate the prediction accuracy in terms of FNR, giving an indication of the maximum negative impact on the whole cell.

Reflecting our time-series problem, we evaluate these values absolutely, as well as their shift over time. Fig. 13 shows the calculation of the FNR as a moving average over intervals of $3,000$ TTIs. To first showcase the prediction accuracy, this evaluation is performed while (a) disregarding changed trace characteristics due to missed transmissions and (b) rescheduling missed traffic.

For the first $5,000$ TTIs, the average FNR is high, strengthening our argument for a dedicated learning phase. Afterwards, for stable traces (Fig. 13(a)), both prediction approaches quickly converge to desirably low FNRs. Here, the supervised approach, cognizant of all prediction errors, outperforms the reinforcement predictor especially for both $\mathbf{l}_{\text{min}}$ and $\mathbf{l}_{\text{max}}$ , achieving FNRs of lower than $1\%$ . Due to the $\epsilon$ -Greedy strategy used, and the only grant presence-based prediction, the reinforcement predictor exhibits a higher FNR. Both approaches achieve stable prediction rates of lower than $15\%$ . For traces that are subject to changes due to missing traffic (Fig. 13(b)), the first characteristic we observe is the added difficulty for prediction, as the FNR of both predictors, for all scenarios are significantly higher (but still lower than $\epsilon^{\text{min\_err}}$ ). While the initial convergence remains fast, upon exiting the dedicated learning phase, re-transmissions occur, especially for $\mathbf{l}_{\text{max}}$ , with the highest density $D_{\text{DLG}}$ . While this peak in errors occurs for both approaches and all scenarios, a far more severe oscillation for the supervised predictor is observable due to no mechanism to cope with the information contained in re-transmitted grants. The reinforcement predictor, however, quickly incorporates this information of re-transmissions into subsequent predictions as outlined in Section 6.4.3, show-casing the online learning capabilities. Despite their differences, both learning approaches achieve the desired task of learning grant prediction patterns. For all shown traces, the exploitation phase is reached, allowing for energy to be saved by following the predictions, as shown in Section 7.5.2.

7.5.2 Energy Consumption

Finally, we evaluate the effectiveness of both approaches regarding the stated goal of achieving a reduction of the overall modem energy consumption. Fig. 14 shows the energy consumption for the (i) reinforcement and the (ii) supervised predictor, normalized to the (iii) state-of-the-art reactive DPM for the same traces as Fig. 13.

The most apparent difference, as explained in Section 7.3, is the impact of $D_{\text{DLG}}$ on savable energy. Secondly, we observe the added energy consumption of $E_{Q}$ from Section 5.4, that leads to higher energy consumption both during the learning phase and during traffic intense intervals, where the modem cannot be turned off without missing a grant. In the worst possible case of a failure to learn grant patterns this overhead would not be mitigated. As shown in Section 7.5.1, in the simulations we performed, this did not occur. After finishing the learning phase, both approaches quickly compensate for this energy overhead. Indeed, both predictors for all scenarios enter the exploitation phase and achieve energy savings compared to the naive approach within $6$ and $11$ s.

In terms of achievable energy savings, the reinforcement predictor is generally superior. Only for $\mathbf{l}_{\text{min}}$ the supervised predictor achieves the same energy saving of 17%. This is significant due to the higher inherent energy consumption $E_{Q}$ , indicating a generally better performance of exploited opportunities for energy saving for this trace. For the other two traces, the reinforcement predictor achieves 11% ( $\mathbf{l}_{\text{avg}}$ ) and 7% ( $\mathbf{l}_{\text{max}}$ ). Compared to this, the supervised predictor exhibits ceilings of 6% ( $\mathbf{l}_{\text{avg}}$ ) and 2% ( $\mathbf{l}_{\text{max}}$ ).

Finally, one can observe the impact of trace characteristics, like downlink grant densities, for static traces. In the long run, the trace that is best in terms of achievable FNR ( $\mathbf{l}_{\text{max}}$ ), with $D_{\mathrm{DLG}}=0.82$ , turns out to be the worst in terms of achievable energy savings.

8 Conclusion

This paper presents an approach for predictive dynamic power management for mobile devices in LTE through grant prediction. Two different approaches based on supervised learning and reinforcement learning that are optimized for accurate offline and efficient online learning, respectively, are investigated. Moreover, we use a consistent complexity analysis to derive a comparable power model for both approaches. Thus, using an identical environmental stimulus derived from a relevant simulated application model we perform a fair evaluation of both approaches.

As a result, both approaches need an interval of stable trace behavior to learn the grant patterns. We believe the supervised predictor is the preferable solution if either the traffic density is low enough, or if more training of false negatives can be performed at early, offline design stages, as lower FNRs are achievable. If neither trace stability can be guaranteed, nor offline learning improved, we argue for the reinforcement predictor due to its lower energy overhead and the observed online learning capabilities. For longer stable scenarios, both approaches may achieve up to 17% energy savings, with the reinforcement predictor providing generally higher savings.

Common to both approaches, the prediction-based DPM only gets activated if/once an acceptable error rate is achieved in the learning phase. In general, the supervised predictor exhibits a lower margin for prediction errors, if the training can be performed on representative data, leading to less missed transmissions, at the cost of a higher inherent energy consumption. In contrast to this, the reinforcement predictor achieves higher energy savings, at the cost of a slower learning speed. For real-world applications, one has to find a balance between energy savings (preferring the reinforcement approach) and the least impact on service quality (in favor of supervised learning). This might best be achieved through a combination of both approaches, where the supervised predictor is trained extensively offline, while the reinforcement predictor is employed for unexperienced trace scenarios.

Bibliography33

The reference list from the paper itself. Each links out to its DOI / PubMed record.

1[1] B. Wang, Y. Xu, R. Hasholzner, R. Rosales, M. Glaß, and J. Teich, “End-to-end power estimation for heterogeneous cellular lte socs in early design phases,” in Power and Timing Modeling, Optimization and Simulation (PATMOS), 2014 24th International Workshop on . IEEE, 2014, pp. 1–8.
2[2] L. Benini, A. Bogliolo, and G. De Micheli, “A survey of design techniques for system-level dynamic power management,” IEEE transactions on VLSI systems , vol. 8, no. 3, pp. 299–316, 2000.
3[3] S. Sesia, M. Baker, and I. Toufik, LTE-the UMTS long term evolution: from theory to practice . John Wiley & Sons, 2011.
4[4] M. B. Srivastava, A. P. Chandrakasan, and R. W. Brodersen, “Predictive system shutdown and other architectural techniques for energy efficient programmable computation,” IEEE Transactions on Very Large Scale Integration (VLSI) Systems , vol. 4, no. 1, pp. 42–55, 1996.
5[5] E.-Y. Chung, L. Benini, and G. De Micheli, “Dynamic power management using adaptive learning tree,” in Proc. of the 1999 IEEE/ACM international conference on Computer-aided design . IEEE Press, 1999, pp. 274–279.
6[6] S. Mannor, B. Kveton, S. Siddiqi, and C. Yu, “Machine learning for adaptive power management,” Autonomic Computing , vol. 10, no. 4, pp. 299–312, 2006.
7[7] Y. Wang, Q. Xie, A. Ammari, and M. Pedram, “Deriving a near-optimal power management policy using model-free reinforcement learning and bayesian classification,” in Proc. of the 48th Design Automation Conference , ser. DAC ’11. New York, NY, USA: ACM, 2011, pp. 41–46.
8[8] G. Dhiman and T. S. Rosing, “Dynamic power management using machine learning,” in Proc. of the 2006 IEEE/ACM International Conference on Computer-aided Design , ser. ICCAD ’06. New York, NY, USA: ACM, 2006, pp. 747–754.