Reinforcement Learning to Minimize Age of Information with an Energy   Harvesting Sensor with HARQ and Sensing Cost

Elif Tu\u{g}\c{c}e Ceran; Deniz G\"und\"uz; and Andr\'as Gy\"orgy

arXiv:1902.09467·eess.SP·February 26, 2019

Reinforcement Learning to Minimize Age of Information with an Energy Harvesting Sensor with HARQ and Sensing Cost

Elif Tu\u{g}\c{c}e Ceran, Deniz G\"und\"uz, and Andr\'as Gy\"orgy

PDF

TL;DR

This paper investigates optimal scheduling policies for energy-harvesting sensors to minimize the age of information, using reinforcement learning in unknown environments and analyzing feedback mechanisms.

Contribution

It introduces a reinforcement learning approach for AoI minimization in energy-harvesting sensors with unknown parameters, extending prior work with real-time learning capabilities.

Findings

01

Reinforcement learning effectively minimizes AoI in unknown environments.

02

Optimal policies depend on feedback mechanisms and system parameters.

03

Numerical results validate the proposed methods' effectiveness.

Abstract

The time average expected age of information (AoI) is studied for status updates sent from an energy-harvesting transmitter with a finite-capacity battery. The optimal scheduling policy is first studied under different feedback mechanisms when the channel and energy harvesting statistics are known. For the case of unknown environments, an average-cost reinforcement learning algorithm is proposed that learns the system parameters and the status update policy in real time. The effectiveness of the proposed methods is verified through numerical results.

Equations40

B_{t + 1} = min (B_{t} + E_{t} - (E^{s} + E^{t x}) \mathbbm 1 [A_{t} = n]

B_{t + 1} = min (B_{t} + E_{t} - (E^{s} + E^{t x}) \mathbbm 1 [A_{t} = n]

- E^{t x} \mathbbm 1 [A_{t} = x], B_{ma x}),

(E^{s} + E^{t x}) \mathbbm 1 [A_{t} = n] + E^{t x} \mathbbm 1 [A_{t} = x] \leq B_{t},

Δ_{t + 1}^{t x} = {1 min (Δ_{t}^{t x} + 1, Δ_{ma x}) if A_{t} = n; otherwise.

Δ_{t + 1}^{t x} = {1 min (Δ_{t}^{t x} + 1, Δ_{ma x}) if A_{t} = n; otherwise.

Δ_{t + 1}^{r x} = ⎩ ⎨ ⎧ min (Δ_{t}^{r x} + 1, Δ_{ma x}) 1 min (Δ_{t}^{t x} + 1, Δ_{ma x}) if A_{t} = i or K_{t} = 0; if A_{t} = n and K_{t} = 1; if A_{t} = x and K_{t} = 1.

Δ_{t + 1}^{r x} = ⎩ ⎨ ⎧ min (Δ_{t}^{r x} + 1, Δ_{ma x}) 1 min (Δ_{t}^{t x} + 1, Δ_{ma x}) if A_{t} = i or K_{t} = 0; if A_{t} = n and K_{t} = 1; if A_{t} = x and K_{t} = 1.

R_{t + 1} = ⎩ ⎨ ⎧ 01 R_{t} min (R_{t} + 1, R_{ma x}) if K_{t} = 1; if A_{t} = n and K_{t} = 0; if A_{t} = i; if A_{t} = x and K_{t} = 0.

R_{t + 1} = ⎩ ⎨ ⎧ 01 R_{t} min (R_{t} + 1, R_{ma x}) if K_{t} = 1; if A_{t} = n and K_{t} = 0; if A_{t} = i; if A_{t} = x and K_{t} = 0.

J^{*} ≜ π min T \to \infty lim \frac{1}{T + 1} E [t = 0 \sum T Δ_{t}^{r x}]

J^{*} ≜ π min T \to \infty lim \frac{1}{T + 1} E [t = 0 \sum T Δ_{t}^{r x}]

subject to \eqref e q : c a u s a l i t y 1 and \eqref e q : c a u s a l i t y 2 .

h (s) + J^{*}

h (s) + J^{*}

\displaystyle h(s)=\mathbb{E}\left[\sum_{t=0}^{\infty}(\Delta^{rx}_{t}-J^{*})\big{|}S_{0}=s\right]

\displaystyle h(s)=\mathbb{E}\left[\sum_{t=0}^{\infty}(\Delta^{rx}_{t}-J^{*})\big{|}S_{0}=s\right]

Q ((e, b, δ^{r x}, δ^{t x}, r), a) ≜ δ + E [h (e^{'}, b^{'}, δ^{r x}^{'}, δ^{t x}^{'}, r^{'}) ∣ a] \leavevmode .

Q ((e, b, δ^{r x}, δ^{t x}, r), a) ≜ δ + E [h (e^{'}, b^{'}, δ^{r x}^{'}, δ^{t x}^{'}, r^{'}) ∣ a] \leavevmode .

π^{*} (e, b, δ^{r x}, δ^{t x}, r)

π^{*} (e, b, δ^{r x}, δ^{t x}, r)

Q_{n + 1} (s, a)

Q_{n + 1} (s, a)

V_{n + 1} (s)

h_{n + 1} (s)

Q_{n + 1} (S_{n}, A_{n}) \leftarrow Q_{n} (S_{n}, A_{n}) + α (m (S_{n}, A_{n}, n))

Q_{n + 1} (S_{n}, A_{n}) \leftarrow Q_{n} (S_{n}, A_{n}) + α (m (S_{n}, A_{n}, n))

[Δ_{n}^{r x} - J_{n} + Q_{n} (S_{n + 1}, A_{n + 1}) - Q_{n} (S_{n}, A_{n})],

J_{n + 1} \leftarrow J_{n} + β (n) [\frac{n J _{n} + Δ _{n}^{r x}}{n + 1} - J_{n}]

J_{n + 1} \leftarrow J_{n} + β (n) [\frac{n J _{n} + Δ _{n}^{r x}}{n + 1} - J_{n}]

π (a ∣ S_{n}) = \frac{exp ( - Q ( S _{n} , a ) / τ _{n} )}{a ^{'} \in A \sum exp ( - Q ( S _{n} , a ^{'} ) / τ _{n} )} .

π (a ∣ S_{n}) = \frac{exp ( - Q ( S _{n} , a ) / τ _{n} )}{a ^{'} \in A \sum exp ( - Q ( S _{n} , a ^{'} ) / τ _{n} )} .

A_{t} = ⎩ ⎨ ⎧ i n x if Δ_{t} < T (e, b, δ^{t x}, r) if Δ_{t} \geq T (e, b, δ^{t x}, r) and r = 0 if Δ_{t} \geq T (e, b, δ^{t x}, r) and r \neq = 0

A_{t} = ⎩ ⎨ ⎧ i n x if Δ_{t} < T (e, b, δ^{t x}, r) if Δ_{t} \geq T (e, b, δ^{t x}, r) and r = 0 if Δ_{t} \geq T (e, b, δ^{t x}, r) and r \neq = 0

π_{θ} (e, b, δ^{r x}, δ^{t x}, r) ≜ \frac{1}{1 - e ^{- \frac{δ - θ ( e , b , δ ^{t x} , r )}{τ}}} .

π_{θ} (e, b, δ^{r x}, δ^{t x}, r) ≜ \frac{1}{1 - e ^{- \frac{δ - θ ( e , b , δ ^{t x} , r )}{τ}}} .

\overline{θ}_{n + 1} = \overline{θ}_{n} - γ (n) \leavevmode \partial J / \partial \overline{θ}_{n},

\overline{θ}_{n + 1} = \overline{θ}_{n} - γ (n) \leavevmode \partial J / \partial \overline{θ}_{n},

\partial J / \partial \overline{θ}_{n} \approx (D_{n}^{⊺} D_{n})^{- 1} D_{n}^{⊺} \frac{( J ^{+} - J ^{-} )}{2 σ} .

\partial J / \partial \overline{θ}_{n} \approx (D_{n}^{⊺} D_{n})^{- 1} D_{n}^{⊺} \frac{( J ^{+} - J ^{-} )}{2 σ} .

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Full text

Reinforcement Learning to Minimize Age of Information with an Energy Harvesting Sensor with HARQ and Sensing Cost

Elif Tuğçe Ceran, Deniz Gündüz, and András György

Department of Electrical and Electronic Engineering, Imperial College London

Email: {e.ceran14, d.gunduz, a.gyorgy}@imperial.ac.uk

Abstract

The time average expected age of information (AoI) is studied for status updates sent from an energy-harvesting transmitter with a finite-capacity battery. The optimal scheduling policy is first studied under different feedback mechanisms when the channel and energy harvesting statistics are known. For the case of unknown environments, an average-cost reinforcement learning algorithm is proposed that learns the system parameters and the status update policy in real time. The effectiveness of the proposed methods is verified through numerical results.

I Introduction

There has been a growing interest in minimizing the age of information (AoI) of energy harvesting (EH) communication systems [1, 2, 3, 4, 5, 6, 7, 8, 9]. The AoI quantifies the staleness of the information at the receiver, and is defined as the time elapsed since the generation time of the most recent status update packet successfully received at the receiver.

Prior works have investigated online [1, 3, 7] and offline [1, 5] methods for different scenarios in order to optimize the timeliness of information under the energy causality constraints in EH systems. It is shown in [3, 7, 9] that the optimal policy is of a threshold type for a finite-size battery when the cost of sensing (monitoring) the status of a process is not considered or assumed to be zero. Until recently, prior literature in the AoI framework assumed that the cost of sensing (monitoring) the status of a process is negligible compared to the cost of transmitting the status update. However, in most practical sensing systems acquiring a new sample of the underlying process of interest also has a considerable energy cost. The sampling/sensing cost has been taken into account in [10], where a status update system with ARQ and an unlimited energy source is considered. Closed form expressions are presented for the energy consumption and AoI, assuming that a packet is re-transmitted until either it is received, or a prescribed maximum number of transmissions is reached.

In this paper, similarly to [10], we study a status update system considering both the sensing and transmission energy costs. We consider an EH transmitter, which uses the energy harvested from the environment to power the sensing and communication operations. Moreover, we consider a hybrid automatic repeat request (HARQ) protocol, where the partial information obtained from previous unsuccessful transmission attempts is combined to increase the decoding probability.

In our previous work, we studied status-update systems with HARQ under a transmission-rate constraint [11, 12, 13]. Here we consider the intermittent availability of energy and find the online status updating policy to minimize the long-term average AoI at the receiver, subject to the energy causality constraints at the transmitter. However, in many practical scenarios the statistical information about either the energy arrival process or the channel conditions are not available or may change over time [14]. Previous work on EH communication systems without a-priori information on random processes governing the system exploited reinforcement learning (RL) methods in order to maximize throughput or minimize delay [15, 16].

To adapt the status-update scheme to the unknown energy arrival process and channel statistics, we propose a learning theoretic approach using RL algorithms. In particular, we consider a value-based RL algorithm, GR-learning [17], and a policy-based RL algorithm, finite-difference policy gradient [18], and compare their performances with the relative value iteration (RVI) algorithm which assumes a-priori knowledge on the system characteristics. We propose a suboptimal threshold policy and demonstrate that policy gradient algorithm exploiting the structural characteristics of a threshold policy outperforms GR-learning algorithm. We investigate the effects of the EH process on the average AoI, and we show by simulations that temporal correlations in EH increase the average AoI significantly. We compare the average AoI with EH with the average AoI under an average transmission constraint [11] and demonstrate that the performance of RH transmitter approximates to the one with average transmission constraint for a battery with unlimited capacity and zero sampling/sensing cost.

II System Model

We consider a time-slotted status update system over an error-prone wireless communication link (see Figure 1). The transmitter (TX) can sense the underlying time-varying process and generate a status update at each time slot at a certain energy cost. Status updates are communicated to the receiver (RX) over a time-varying wireless channel. Each transmission attempt of a status update takes constant time, which is assumed to be equal to the duration of one time slot.

The AoI measures the timeliness of the status information at the receiver, and is defined at any time slot $t$ as the number of time slots elapsed since the generation time $U(t)$ of the most up-to-date packet successfully decoded at the receiver. Formally, the AoI at the receiver at time $t$ is defined as $\Delta^{rx}_{t}\triangleq\min(t-U(t),\Delta_{max})$ , where a maximum value $\Delta_{max}$ on the AoI is imposed to limit the impact of the AoI on the performance after some level of staleness is reached.

We assume that the channel changes randomly from one time slot to the next in an independent and identically distributed (i.i.d.) fashion, and the instantaneous channel state information is available only at the receiver. We further assume the availability of an error- and delay-free single-bit feedback from the receiver to the transmitter for each transmission attempt. Successful reception of the status update at the end of time slot $t$ is acknowledged by an ACK signal (denoted by $K_{t}=1$ ), while a NACK signal is sent in case of a failure (denoted by $K_{t}=0$ ).

There are three possible actions $A_{t}$ the transmitter can take at each time slot $t$ : it can either sample and transmit a new status update ( $A_{t}=\mathrm{n}$ ), remain idle ( $A_{t}=\mathrm{i}$ ) or retransmit the last transmitted status update ( $A_{t}=\mathrm{x}$ ). If an ACK is received at the transmitter, we can restrict the action space to $\{\mathrm{i},\mathrm{n}\}$ as retransmitting an already decoded status update is strictly suboptimal.

We consider the HARQ protocol: that is, the received signals from previous transmission attempts for the same packet are combined for decoding. The probability of error using $r$ retransmissions, denoted by $g(r)<1$ , depends on $r$ and the particular HARQ scheme used for combining multiple transmission attempts (an empirical method to estimate $g(r)$ is presented in [19]). As in any reasonable HARQ strategy, we assume that $g(r)$ is non-increasing in the number of retransmissions $r$ ; that is, $g(r_{1})\geq g(r_{2})$ for all $r_{1}\leq r_{2}$ . Standard HARQ methods only combine information from a finite maximum number of retransmissions [20]. Accordingly, we consider a truncated retransmission count of a status update, denoted by $R_{t}$ for the status update transmitted at time $t$ , where $R_{t}\in\{0,\ldots,R_{max}\}$ ; that is, the receiver can combine information from the last $R_{max}$ retransmissions at most. We also assume that $R_{0}=0$ so that there is no previously transmitted packet at the transmitter at time $t=0$ .

At the end of each time slot $t$ , a random amount of energy is harvested and stored in a rechargeable battery at the transmitter, denoted by $E_{t}\in\mathcal{E}\triangleq\{0,1,\ldots,E_{max}\}$ , following a first-order discrete-time Markov model, characterized by stationary probabilities $p_{E}(e_{1}|e_{2})$ , defined as $p_{E}(e_{1}|e_{2})\triangleq Pr(E_{t+1}=e_{2}|E_{t}=e_{1}),\leavevmode\nobreak\ \forall t$ . It is also assumed that $p_{E}(0|e)>0$ , $\forall e\in\mathcal{E}$ . Harvested energy is first stored in a rechargeable battery with a limited capacity of $B_{max}$ energy units and the energy harvested when the battery is full is lost. The energy consumption for status sensing is denoted by $E^{s}\in\mathbb{Z}^{+}$ , while the energy consumption for a transmission attempt is denoted by $E^{tx}\in\mathbb{Z}^{+}$ .

The battery state at time $t$ , denoted by $B_{t}$ , and the energy causality constraints can be written as follows:

[TABLE]

where the indicator function $\mathbbm{1}[C]$ is equal to $1$ if event $C$ holds, and zero otherwise. Eqn. (1) implies that the battery overflows if energy is harvested when the battery is full, while Eqn. (2) imposes that the energy consumed by sensing or transmission operations at time slot $t$ is limited by the energy $B_{t}$ available in the battery at the beginning of that time slot.

The age $\Delta^{tx}_{t}$ of the most recently generated status update at the transmitter at the beginning of time slot $t$ resets to $1$ if a new status update is generated at time slot $t-1$ , and increases up to $\Delta_{max}$ otherwise, i.e.,

[TABLE]

The AoI of the most recent successfully decoded packet at the receiver at time $t$ , $\Delta^{rx}_{t}$ , evolves as follows:

[TABLE]

We note that $\Delta^{tx}_{t}$ refers to the number of time slots elapsed since the generation of the most recently sensed status update at the transmitter side, while $\Delta^{rx}_{t}$ denotes the AoI of the most recently received status update at the receiver side. The system model also implies that whenever a new status update packet is generated, the previous packet at the transmitter is dropped and can not be retransmitted. The number of retransmissions is zero for a newly sensed and generated status update and increases up to $R_{max}$ as we keep retransmitting the same packet.

[TABLE]

The state of the system is formed by five components $S_{t}=(E_{t},B_{t},\Delta^{rx}_{t},\Delta^{tx}_{t},R_{t})$ . At each time slot, the transmitter knows the state of the system and the goal is to find a policy $\pi$ which minimizes the expected average AoI at the receiver over an infinite time horizon, which is given by:

[TABLE]

III Markov Decision Process (MDP) and RVI

An average-cost finite-state MDP provides the necessary framework for modeling and solving the AoI minimization problem in (3). An MDP is defined by the quadruple $\big{(}\mathcal{S},\mathcal{A},$ P $,c\big{)}$ [21]: The finite set of states $(E_{t},B_{t},\Delta^{rx}_{t},\Delta^{tx}_{t},R_{t})$ is $\mathcal{S}=\mathcal{E}\times\{0,\ldots,B_{max}\}\times\{1,\ldots,\Delta_{max}\}^{2}\times\{0,\ldots,R_{max}\}$ and the finite set of actions $\mathcal{A}=\{\mathrm{i},\mathrm{n},\mathrm{x}\}$ are already defined. $P$ refers to the transition probabilities, where $P(s^{\prime}|s,a)=\Pr(S_{t+1}=s^{\prime}\mid S_{t}=s,A_{t}=a)$ is the probability that action $a$ in state $s$ at time $t$ will lead to state $s^{\prime}$ at time $t+1$ , which is characterized by the EH statistics and channel error probabilities. The cost function $c:\mathcal{S}\times\mathcal{A}\rightarrow\mathbbm{Z}$ , is the AoI at the receiver, and is defined as $c(s,a)=\Delta^{rx}_{t}$ for any $s\in\mathcal{S}$ , $a\in\mathcal{A}$ , independent of the action $a$ .

We note that there exists an optimal stationary deterministic policy, $\pi:\mathcal{S}\rightarrow\mathcal{A}$ , for this problem111For Markov chains corresponding to every stationary policy, there is only one recurrent class as the state $(0,0,\Delta_{max},\Delta_{max},0)$ is reachable from all other states (e.g., every transmission is successful but no EH is harvested for a period of $\max(\Delta_{max},B_{max})$ time slots) from Theorem 8.4.3 of [21]. [21]. In particular, there exists a function $h(s$ , called the differential cost function for all $s=(e,b,\delta^{rx},\delta^{tx},r)\in\mathcal{S}$ , satisfying the following Bellman optimality equations for the average-cost finite-state finite-action MDP [21]:

[TABLE]

where $s^{\prime}\triangleq(e^{\prime},b^{\prime},{\delta^{rx}}^{\prime},{\delta^{tx}}^{\prime},r^{\prime})$ is the next state obtained from $(e,b,\delta^{rx},\delta^{tx},r)$ after taking action $a$ , and $J^{*}$ represents the optimal achievable average AoI under policy $\pi^{*}$ . Note that the function $h$ satisfying (4) is unique up to an additive factor, and with selecting this additive factor properly, it also satisfies

[TABLE]

We also introduce the state-action cost function:

[TABLE]

Then an optimal policy, for any $(e,b,\delta^{rx},\delta^{tx},r)\in\mathcal{S}$ , takes the action achieving the minimum in (5):

[TABLE]

An optimal policy solving (4), (5) and (6) defined above can be found by relative value iteration (RVI) for finite-state finite-action average-cost MDPs from Section 8.5.5 of [21]:

Starting with an arbitrary initialization of $h_{0}(s)$ , $\forall s\in\mathcal{S}$ , and setting an arbitrary but fixed reference state $s^{ref}\triangleq(e^{ref},b^{ref},{\delta^{rx}}^{ref},{\delta^{tx}}^{ref},r^{ref})$ , a single iteration of the RVI algorithm $\forall(s,a)\in\mathcal{S}\times\mathcal{A}$ is given as follows:

[TABLE]

where $Q_{n}(s,a)$ , $V_{n}(s)$ and $h_{n}(s)$ denote the state-action value function, value function and differential value function for iteration $n$ , respectively. By Theorem 8.5.7 and Section 8.5.5 of [21], $h_{n}$ converges to $h$ , and $\pi_{n}^{*}(s)\triangleq\operatorname*{arg\,min}_{a}Q_{n}(s,a)$ converges to $\pi^{*}(s)$ .

IV A Reinforcement Learning Approach

In most practical scenarios, channel error probabilities for retransmissions and the EH characteristics may not be known at the time of deployment, or may change over time. In this section, we assume that the transmitter does not know the system characteristics a-priori, and has to learn them. We employ two different online learning algorithms. First, we employ a value-based RL algorithm, namely GR-learning, which converges to an optimal policy; then, we consider a structured policy search algorithm, namely finite-difference policy gradient, which does not necessarily find the optimal policy but performs very well in practice, as demonstrated through simulations in Section V. We also note that GR-learning learns from a single trajectory generated during learning steps while policy gradient uses Monte-Carlo roll-outs for each policy update. Thus, GR-learning is more applicable to real-time systems.

IV-A GR-Learning with Softmax

The literature for average-cost RL is quite limited compared to discounted cost problems [22, 23]. For the average AoI minimization problem in (3), we employ a modified version of the GR-learning algorithm proposed in [17], as outlined in Algorithm 1, with Boltzmann (softmax) exploration. The resulting algorithm is called GR-learning with softmax.

Notice that, by only knowing $Q(s,a)$ , one can find the optimal policy $\pi^{*}$ using (6) without knowing the transition probabilities $P$ characterized by $g(r)$ and $p_{E}$ . Thus, GR-learning with softmax starts with an initial estimation of $Q_{0}(s,a)$ and finds the optimal policy by estimating state-action values in a recursive manner. In the $n^{th}$ iteration, after taking action $A_{n}$ , the transmitter observes the next state $S_{n+1}$ , and the instantaneous cost value $\Delta^{rx}_{n}$ . Based on this, the estimate of $Q_{n+1}(s,a)$ is updated by a weighted average of the previous estimate $Q_{n}(s,a)$ and the estimated expected value of the current policy in the next state $S_{n+1}$ . Moreover, we update the gain $J_{n}$ at every time slot based on the empirical average of AoI.

In each time slot, the learning algorithm

•

observes the current state $S_{n}\in\mathcal{S}$ ,

•

selects and performs an action $A_{n}\in\mathcal{A}$ ,

•

observes the next state $S_{n+1}\in\mathcal{S}$ and the instantaneous cost $\Delta^{rx}_{n}$ ,

•

updates its estimate of $Q(S_{n},A_{n})$ using the current estimate of $J_{n}$ by

[TABLE]

where $\alpha(m(S_{n},A_{n},n))$ is the update parameter (learning rate) in the $n^{th}$ iteration, and depends on the function $m(S_{n},A_{n},n)$ , which is the number of times the state–action pair $(S_{n},A_{n})$ was visited till the $n^{th}$ iteration.

•

updates its estimate of $J_{n}$ based on the empirical average as follows:

[TABLE]

where $\beta(n)$ is the update parameter in the $n^{th}$ iteration.

The transmitter action selection method should balance the exploration of new actions with the exploitation of actions known to perform well. In particular, the Boltzmann (softmax) action selection method, which chooses each action randomly relative to its expected cost, is used in this paper as follows:

[TABLE]

Parameter $\tau$ in (12) is called the temperature parameter and decays exponentially with decay parameter $\gamma$ . High $\tau$ corresponds to more uniform action selection (exploration) whereas low $\tau$ is biased toward the best action (exploitation). According to Theorem 2 of [17], if $\alpha$ , $\beta$ satisfy $\sum_{m=1}^{\infty}\alpha(m),\sum_{m=1}^{\infty}\beta(m)\rightarrow\infty$ , $\sum_{m=1}^{\infty}\alpha^{2}(m),\sum_{m=1}^{\infty}\beta^{2}(m)<\infty$ , $\lim_{x\to\infty}\frac{\beta(m)}{\alpha(m)}\rightarrow 0$ , GR-Learning converges to an optimal policy.

IV-B Finite-Difference Policy Gradient

GR-learning in Section IV-A is a value-based RL method, which learns the state-action value function for each state-action pair. In practice, $\Delta_{max}$ can be large, which might slow down the convergence of GR-learning due to a large state-space.

In this section, we are going to simplify the problem and obtain a structured possibly sub-optimal policy, which can be learned via the policy gradient method [18]. We make two assumptions on the policy space in order to obtain a more efficient learning algorithm:

•

We assume that a packet is retransmitted until it is successfully decoded, provided that there is enough energy in the battery, that is, the transmitter is not allowed to preempt an undecoded packet and transmit a new one.

•

The solution to the simplified problem is threshold-type, that is,

[TABLE]

for some $\mathcal{T}(e,b,\delta^{tx},r)$ .

Note that $A_{t}=\mathrm{i}$ if $b<E^{tx}$ ( $b<E^{tx}+E^{s}$ ) for $r>1$ ( $r=1$ ); that is, $\mathcal{T}(e,b,\delta^{tx},r)=\Delta_{max}+1$ . This ensures that energy causality constraints in (2) hold. Other thresholds will be determined using policy gradient.

In order to employ the policy gradient method, we approximate the policy by a parameterized smooth function with parameters $\theta(e,b,\delta^{tx},r)$ , and convert the discrete policy search problem into estimating the optimal values of some continuous parameters, which can be numerically solved by stochastic approximation algorithms [24].

In particular, with a slight abuse of notation, we let $\pi_{\theta}(e,b,\delta^{rx},\delta^{tx},r)$ denote the probability of taking action $A_{t}=\mathrm{n}$ ( $A_{t}=\mathrm{x}$ ) if $r=0$ ( $r\neq 0$ ), and consider the parameterized sigmoid function:

[TABLE]

We note that $\pi_{\theta}(e,b,\delta^{rx},\delta^{tx},r)\rightarrow\{0,1\}$ and $\theta(e,b,\delta^{tx},r)\rightarrow\mathcal{T}(e,b,\delta^{tx},r)$ as $\tau\rightarrow 0$ . Therefore, in order to converge to a deterministic policy $\pi$ , $\tau>0$ can be taken as a sufficiently small constant, or can be decreased gradually to zero. The total number of parameters to be estimated is $|\mathcal{E}|\times B_{max}\times\Delta_{max}\times R_{max}+1$ minus the parameters corresponding to $b<E^{tx}$ ( $b<E^{tx}+E^{s}$ ) for $r>0$ ( $r=0$ ) due to energy causality constraints as stated previously.

With a slight abuse of notation, we map the parameters $\theta(e,b,\delta^{tx},r)$ to a vector $\overline{\theta}$ of size $d\triangleq|\mathcal{E}|\times B_{max}\times\Delta_{max}\times R_{max}+1$ . Starting with some initial estimates of $\overline{\theta}_{0}$ , the parameters can be updated in each iteration $n$ using the gradients as follows:

[TABLE]

where the step size parameter $\gamma(n)$ is a positive decreasing sequence and satisfies the first two convergence properties given at the end of Section IV-A.

Computing the gradient of the average AoI directly is not possible; however, several methods exist in the literature to estimate the gradient [24]. In particular, we employ the finite-difference policy gradient [18] method. In this method, the gradient is estimated by estimating $J$ at slightly perturbed parameter values. First, a random perturbation vector $D_{n}$ of size $d$ is generated according to a predefined probability distribution, e.g., each component of $D_{n}$ is an independent Bernoulli random variable with parameter $q\in(0,1)$ . The thresholds are perturbed with a small amount $\sigma>0$ in the directions defined by $D_{n}$ to obtain $\overline{\theta}_{n}^{\pm}(e,b,\delta^{tx},r)\triangleq\overline{\theta}_{n}(e,b,\delta^{tx},r)\pm\sigma D_{n}$ . Then, empirical estimates $\widehat{J}^{\pm}$ of the average AoI corresponding to the perturbed parameters $\overline{\theta}_{n}^{\pm}$ , obtained from Monte-Carlo rollouts, are used to estimate the gradient:

[TABLE]

where $D_{n}^{\intercal}$ denotes the transpose of vector $D_{n}$ .

V Simulation Results

In this section, we provide numerical results for all the proposed algorithms, and compare the achieved average AoI. Motivated by previous research on HARQ [25], [19], [20], we assume that the decoding error reduces exponentially with the number of retransmissions, that is, $g(r)\triangleq p_{0}\lambda^{r}$ for some $\lambda\in(0,1)$ , where $p_{0}$ denotes the error probability of the first transmission and $r$ is the retransmission count (set to [math] for the first transmission). The exact value of the rate $\lambda$ depends on the particular HARQ protocol and the channel model. Following the IEEE 802.16 standard[20], the maximum number of retransmissions used for decoding is set to $R_{max}=3$ . In the following experiments, $\lambda$ and $p_{0}$ are set to $0.5$ . $E^{tx}$ and $E^{s}$ are both assumed to be constant and equal to 1 unit of energy unless otherwise stated. $\Delta_{max}$ is set to $40$ .

We choose the exact step sizes for the learning algorithms by fine-tuning in order to balance the algorithm stability in the early time steps with nonnegligible step sizes in the later time steps. In particular, we use step size parameters of $\alpha(m),\beta(m),\gamma(m)=y/(m+1)^{z}$ , where $0.5<z\leq 1$ and $y>0$ (which satisfy the convergence conditions) and choose $y$ and $z$ such that the oscillations are low and the convergence rate is high. We have observed that a particular choice of parameters gives similar performance results for scenarios addressed in simulations results.

V-A Uncorrelated EH

We first investigate the average AoI with HARQ when the EH process, $E_{t}\in\mathcal{E}=\{0,1\}$ , is i.i.d. over time with probability distribution $Pr(E_{t}=1)=p_{e}$ , $\forall t$ . The RVI algorithm in Section III is employed, and the effects of the battery capacity $B_{max}$ , energy consumption of sensing $E^{s}$ , and $p_{e}$ on the average AoI are shown in Figure 2. As expected, the average AoI increases with decreasing $B_{max}$ , decreasing $p_{e}$ and increasing $E^{s}$ . We note that, when $E^{s}=0$ and $B_{max}=\infty$ , the problem defined in (3) corresponds to minimizing the average AoI under an average transmission rate constraint $p_{e}$ , studied in [11, 13]. The average AoI under average transmission rate constraint ( $B_{max}=\infty$ ) is also shown in Figure 2.

Figure 3 shows the evolution of the average AoI over time when the average-cost RL algorithms are employed. As a baseline, we have also included the performance of a greedy policy, which sends a new status update whenever there is sufficient energy for both sensing and transmission. It retransmits the last transmitted status update when the energy in the battery is sufficient only for transmission, and it remains idle otherwise; that is, $A_{t}=\mathrm{n}$ if $B_{t}\geq E^{tx}+E^{s}$ , $A_{t}=\mathrm{x}$ if $E^{tx}\leq B_{t}<E^{tx}+E^{s}$ and $A_{t}=\mathrm{i}$ if $B_{t}<E^{tx}$ . It can be observed that the average AoI achieved by the proposed RL algorithms, converge to values close to the one obtained from the RVI algorithm, which has a priori knowledge of $g(r)$ and $p_{e}$ , while the AoI of the greedy algorithm is significantly higher. Although the policy gradient algorithm based on threshold policy does not allow preemption of an undecoded status update, it performs better than GR-learning since it tries to learn significantly smaller number of threshold values (i.e., $\Delta_{max}\times B_{max}\times R_{max}+1$ ) than GR-learning which learns one value for each state-action pair (i.e., $\Delta_{max}^{2}\times B_{max}\times(R_{max}+1)\times|\mathcal{A}|$ ).

V-B Temporally Correlated EH

Next, we investigate the performance when the EH process has temporal correlations. A symmetric two-state Markovian EH process is assumed, such that $\mathcal{E}=\{0,1\}$ and $Pr(E_{t+1}=1|E_{t}=0)=Pr(E_{t+1}=0|E_{t}=1)=0.3$ . That is, if the transmitter is in harvesting state, it is more likely to continue harvesting energy, and vice versa for the non-harvesting state.

Figure 4 illustrates the policy obtained by RVI. As it can be seen from the figure, the resulting policy is less likely to transmit if the battery level or the AoI is low. Moreover, the policy tends to retransmit the previous update rather than sensing a new update when the battery level is low and the AoI is high. When the system is in the non-harvesting state (i.e., $E_{t}=0$ ), the transmitter is more conservative in transmitting the status updates compared to the case $E_{t}=1$ , e.g., it might not transmit even if the battery is full depending on the AoI level.

Figure 5 shows the evolution of the average AoI over time when the average-cost RL algorithms are employed. It can be observed again that the average AoI achieved by the learned threshold parameters in Section IV-B, denoted by policy gradient in the figure, performs very close to the one obtained from the RVI algorithm, which has a priori knowledge of $g(r)$ and $p_{e}$ . GR-learning, on the other hand, outperforms the greedy policy but converges to the optimal policy much more slowly, and the gap between the two RL algorithms is even longer compared to the i.i.d. case. Tabular methods in RL, like GR-learning, need to visit each state-action pair infinitely often for RL to converge [22]. GR-learning in the case of temporally correlated EH does not perform as well as in the i.i.d. case since the state space becomes larger with the addition of the EH state.

Next, we investigate the impact of the burstiness of the EH process, measured by the correlation coefficient between $E_{t}$ and $E_{t+1}$ . Figure 6 illustrates the performance of the proposed RL algorithms for different correlation coefficients, which can be computed easily for the 2-state symmetric Markov chain; that is, $\rho\triangleq(2p_{E}(1,1)-1)$ . Note that $\rho=0$ corresponds to the i.i.d. EH with $p_{e}=1/2$ . We note that the average AoI is minimized by transmitting new packets successfully at regular intervals, which has been well investigated in previous works [1, 11, 2]. Intuitively, for highly correlated EH, there are either successive transmissions or successive idle time slots, which increases the average AoI. Hence, the AoI is higher for higher values of $\rho$ . Figure 6 also shows that both RL algorithms result in much lower average AoI than the greedy policy and policy gradient RL outperforms GR-learning since it benefits from the structural characteristics of a threshold policy.

VI Conclusions

We have considered an EH system with a finite size battery and investigated scheduling policies transmitting time-sensitive data over a noisy channel with the average AoI as the performance measure, which quantifies the timeliness of the data available at the receiver. In addition to identifying a RVI solution for the optimal policy when the system characteristics are known, efficient RL algorithms are also presented for practical applications when the system characteristics may not be known in advance. The effects of battery size, EH characteristics and the HARQ structure on the average AoI are investigated through numerical simulations. The algorithms adopted in this paper are relevant to other systems concerning the timeliness of information or those powered by renewable energy sources.

Bibliography25

The reference list from the paper itself. Each links out to its DOI / PubMed record.

1[1] B. T. Bacinoglu, E. T. Ceran, and E. Uysal-Biyikoglu, “Age of information under energy replenishment constraints,” in Inf. Theory and Applications Workshop (ITA) , Feb 2015, pp. 25–31.
2[2] R. D. Yates, “Lazy is timely: Status updates by an energy harvesting source,” in IEEE Int’l Symposium on Information Theory (ISIT) , 2015, pp. 3008–3012.
3[3] B. T. Bacinoglu and E. Uysal-Biyikoglu, “Scheduling status updates to minimize age of information with an energy harvesting sensor,” Co RR , vol. abs/1701.08354, 2017.
4[4] A. Arafa, J. Yang, and S. Ulukus, “Age-minimal online policies for energy harvesting sensors with random battery recharges,” Co RR , vol. abs/1802.01563, 2018.
5[5] A. Arafa and S. Ulukus, “Age minimization in energy harvesting communications: Energy-controlled delays,” Co RR , vol. abs/1712.03945, 2017.
6[6] X. Wu, J. Yang, and J. Wu, “Optimal status update for age of information minimization with an energy harvesting source,” IEEE Tran. on Green Comms. and Networking , vol. 2, no. 1, pp. 193–204, March 2018.
7[7] B. T. Bacinoglu, Y. Sun, E. Uysal-Biyikoglu, and V. Mutlu, “Achieving the age-energy tradeoff with a finite-battery energy harvesting source,” Co RR , vol. abs/1802.04724, 2018.
8[8] S. Feng and J. Yang, “Age of information minimization for an energy harvesting source with updating erasures: With and without feedback,” Co RR , 2018.