Decentralized Deep Reinforcement Learning for Delay-Power Tradeoff in   Vehicular Communications

Xianfu Chen; Celimuge Wu; Honggang Zhang; Yan Zhang; Mehdi; Bennis; Heli Vuojala

arXiv:1906.00625·eess.SP·June 4, 2019

Decentralized Deep Reinforcement Learning for Delay-Power Tradeoff in Vehicular Communications

Xianfu Chen, Celimuge Wu, Honggang Zhang, Yan Zhang, Mehdi, Bennis, Heli Vuojala

PDF

Open Access

TL;DR

This paper introduces a decentralized deep reinforcement learning approach to optimize delay-power tradeoff in vehicular communications by enabling VUE-pairs to make local decisions based on partial network observations.

Contribution

It proposes a novel online LSTM-based deep reinforcement learning algorithm that decomposes a complex MDP into manageable per-VUE-pair problems for decentralized control.

Findings

01

The algorithm effectively balances delay and power consumption in vehicular networks.

02

Decentralized decision-making achieves near-optimal performance compared to centralized solutions.

03

Numerical simulations confirm the algorithm's robustness and efficiency.

Abstract

This paper targets at the problem of radio resource management for expected long-term delay-power tradeoff in vehicular communications. At each decision epoch, the road side unit observes the global network state, allocates channels and schedules data packets for all vehicle user equipment-pairs (VUE-pairs). The decision-making procedure is modelled as a discrete-time Markov decision process (MDP). The technical challenges in solving an optimal control policy originate from highly spatial mobility of vehicles and temporal variations in data traffic. To simplify the decision-making process, we first decompose the MDP into a series of per-VUE-pair MDPs. We then propose an online long short-term memory based deep reinforcement learning algorithm to break the curse of high dimensionality in state space faced by each per-VUE-pair MDP. With the proposed algorithm, the optimal channel…

Tables1

Table 1. Table I: Parameter values in simulations.

Parameter	Value
Replay memory capacity $M$	$5000$
Mini-batch size $\tilde{M}$	$200$
Observation pool size $N$	$20$
Path loss exponent $ρ$ , $ξ$	$- 68.5$ dB, $- 54.5$ dB
Path loss coefficient $e$	$1.61$
Distance $φ_{0}$	$15$ m
Number of VUE-pair group $I$	$10$
Clustering interval $T$	$10$ epochs
Frequency bandwidth $w$	$500$ kHz
Aggregate interference $ϑ$	$2 \cdot 10^{- 9}$ W
Noise power spectral density $σ^{2}$	$7.95 \cdot 10^{- 21}$ W/Hz
Scheduling epoch duration $δ$	$18$ ms
Weights $ϕ$ , $η$	$30$ , $1$
Data packet size $μ$	$9$ kb
Discount factor $γ$	$0.9$
Exploration probability $ϵ$	$0.06$

Equations58

\displaystyle H_{k}^{t}=\left\{\!\!\begin{array}[]{l@{~}l}\rho\cdot\left(\sqrt{\left|x_{k}^{(1),t}-x_{k}^{(2),t}\right|^{2}+\left|y_{k}^{(1),t}-y_{k}^{(2),t}\right|^{2}}\right)^{-e},\hfil\leavevmode\nobreak\ &\mbox{if VUE-pair $k$ is in LOS}\\ \rho\cdot\left(\left|x_{k}^{(1),t}-x_{k}^{(2),t}\right|+\left|y_{k}^{(1),t}-y_{k}^{(2),t}\right|\right)^{-e},\hfil\leavevmode\nobreak\ &\mbox{if VUE-pair $k$ is in WLOS}\\ \xi\cdot\left(\left|x_{k}^{(1),t}-x_{k}^{(2),t}\right|\cdot\left|y_{k}^{(1),t}-y_{k}^{(2),t}\right|\right)^{-e},\hfil\leavevmode\nobreak\ &\mbox{if VUE-pair $k$ is in NLOS}\end{array}\right.

\displaystyle H_{k}^{t}=\left\{\!\!\begin{array}[]{l@{~}l}\rho\cdot\left(\sqrt{\left|x_{k}^{(1),t}-x_{k}^{(2),t}\right|^{2}+\left|y_{k}^{(1),t}-y_{k}^{(2),t}\right|^{2}}\right)^{-e},\hfil\leavevmode\nobreak\ &\mbox{if VUE-pair $k$ is in LOS}\\ \rho\cdot\left(\left|x_{k}^{(1),t}-x_{k}^{(2),t}\right|+\left|y_{k}^{(1),t}-y_{k}^{(2),t}\right|\right)^{-e},\hfil\leavevmode\nobreak\ &\mbox{if VUE-pair $k$ is in WLOS}\\ \xi\cdot\left(\left|x_{k}^{(1),t}-x_{k}^{(2),t}\right|\cdot\left|y_{k}^{(1),t}-y_{k}^{(2),t}\right|\right)^{-e},\hfil\leavevmode\nobreak\ &\mbox{if VUE-pair $k$ is in NLOS}\end{array}\right.

\displaystyle u_{k,j}^{t}=\left\{\!\!\begin{array}[]{l@{~}l}1,\hfil\leavevmode\nobreak\ &\mbox{if channel }j\mbox{ is allocated to VUE-pair }k\\ \hfil\leavevmode\nobreak\ &\mbox{during decision epoch }t;\\ 0,\hfil\leavevmode\nobreak\ &\mbox{otherwise}.\end{array}\right.

\displaystyle u_{k,j}^{t}=\left\{\!\!\begin{array}[]{l@{~}l}1,\hfil\leavevmode\nobreak\ &\mbox{if channel }j\mbox{ is allocated to VUE-pair }k\\ \hfil\leavevmode\nobreak\ &\mbox{during decision epoch }t;\\ 0,\hfil\leavevmode\nobreak\ &\mbox{otherwise}.\end{array}\right.

j \in J \sum u_{k, j}^{t}

j \in J \sum u_{k, j}^{t}

k \in K_{i} \sum u_{k, j}^{t}

q_{k}^{t + 1} = max {q_{k}^{t} - r_{k}^{t} \cdot \mathds 1_{{\sum_{j \in J} u_{k, j}^{t} = 1}}, 0} + a_{k}^{t},

q_{k}^{t + 1} = max {q_{k}^{t} - r_{k}^{t} \cdot \mathds 1_{{\sum_{j \in J} u_{k, j}^{t} = 1}}, 0} + a_{k}^{t},

p_{k}^{t} = \frac{ϑ + w \cdot σ ^{2}}{g _{k, j}^{t}} \cdot (2^{\frac{μ \cdot r _{k}^{t}}{w \cdot δ}} - 1) \cdot \mathds 1_{{u_{k, j}^{t} = 1}},

p_{k}^{t} = \frac{ϑ + w \cdot σ ^{2}}{g _{k, j}^{t}} \cdot (2^{\frac{μ \cdot r _{k}^{t}}{w \cdot δ}} - 1) \cdot \mathds 1_{{u_{k, j}^{t} = 1}},

P (s^{t + 1} ∣ s^{t}, π (s^{t})) = k \in K \prod P (g_{k}^{t + 1} ∣ (x_{k}^{t + 1}, y_{k}^{t + 1})) \cdot

P (s^{t + 1} ∣ s^{t}, π (s^{t})) = k \in K \prod P (g_{k}^{t + 1} ∣ (x_{k}^{t + 1}, y_{k}^{t + 1})) \cdot

P ((x_{k}^{t + 1}, y_{k}^{t + 1}) ∣ (x_{k}^{t}, y_{k}^{t})) \cdot P (q_{k}^{t + 1} ∣ q_{k}^{t}, u_{k}^{t}, r_{k}^{t}),

f_{k} (s^{t}, u_{k}^{t}, r_{k}^{t}) = ϕ \cdot d (q_{k}^{t}) + η \cdot p_{k}^{t},

f_{k} (s^{t}, u_{k}^{t}, r_{k}^{t}) = ϕ \cdot d (q_{k}^{t}) + η \cdot p_{k}^{t},

V_{k} (s, π) = (1 - γ) \cdot E_{π} [t = 1 \sum \infty (γ)^{t - 1} f_{k} (s^{t}, u_{k}^{t}, r_{k}^{t}) ∣ s],

V_{k} (s, π) = (1 - γ) \cdot E_{π} [t = 1 \sum \infty (γ)^{t - 1} f_{k} (s^{t}, u_{k}^{t}, r_{k}^{t}) ∣ s],

π min V (s, π)

π min V (s, π)

= (1 - γ) \cdot E_{π} [t = 1 \sum \infty (γ)^{t - 1} f (s^{t}, π (s^{t})) ∣ s]

\displaystyle\mathrm{s.t.}\leavevmode\nobreak\ \leavevmode\nobreak\ \leavevmode\nobreak\ \leavevmode\nobreak\ \leavevmode\nobreak\ \leavevmode\nobreak\ \leavevmode\nobreak\ \leavevmode\nobreak\ \leavevmode\nobreak\ \leavevmode\nobreak\

V (s) =

V (s) =

π (s) min {(1 - γ) \cdot f (s, π (s)) + γ \cdot s^{'} \in S^{K} \sum P (s^{'} ∣ s, π (s)) \cdot V (s^{'})},

Q (s, u, r)

Q (s, u, r)

+ γ \cdot s^{'} \in S^{K} \sum P (s^{'} ∣ s, u, r) \cdot V (s^{'}),

V (s) = u, r min Q (s, u, r) .

V (s) = u, r min Q (s, u, r) .

Q (s, u, r)

Q (s, u, r)

+ γ \cdot s^{'} \in S^{K} \sum P (s^{'} ∣ s, u, r) \cdot u^{'}, r^{'} min Q (s^{'}, u^{'}, r^{'}),

Q^{t + 1} (s, u, r) = Q^{t} (s, u, r) +

Q^{t + 1} (s, u, r) = Q^{t} (s, u, r) +

α^{t} \cdot ((1 - γ) \cdot f (s, u, r) + γ \cdot Q^{t} (s^{'}, u^{'}, r^{'}) - Q^{t} (s, u, r)),

Q (s, u, r) = k \in K \sum Q_{k} (s, u_{k}, r_{k}),

Q (s, u, r) = k \in K \sum Q_{k} (s, u_{k}, r_{k}),

Q_{k} (s, u_{k}, r_{k}) = (1 - γ) \cdot f_{k} (s, u_{k}, r_{k}) +

Q_{k} (s, u_{k}, r_{k}) = (1 - γ) \cdot f_{k} (s, u_{k}, r_{k}) +

γ \cdot s^{'} \in S^{K} \sum P (s^{'} ∣ s, (u_{k}, u_{- k}), (r_{k}, r_{- k})) \cdot Q_{k} (s^{'}, u_{k}^{'}, r_{k}^{'}),

π^{*} (s^{'}) = u^{'}, r^{'} ar g min k \in K \sum Q_{k} (s^{'}, u_{k}^{'}, r_{k}^{'}),

π^{*} (s^{'}) = u^{'}, r^{'} ar g min k \in K \sum Q_{k} (s^{'}, u_{k}^{'}, r_{k}^{'}),

Q_{k} (s, u_{k}, r_{k}) \approx Q_{k} (s_{k}, o_{k}, u_{k}, r_{k}) .

Q_{k} (s, u_{k}, r_{k}) \approx Q_{k} (s_{k}, o_{k}, u_{k}, r_{k}) .

m^{t - m + 1} =

m^{t - m + 1} =

(((s_{k}^{t - m}, o_{k}^{t - m}), (u_{k}^{t - m}, r_{k}^{t - m}), f_{k} (s^{t - m}, u_{k}^{t - m}, r_{k}^{t - m}),

\leavevmode \leavevmode (s_{k}^{t - m + 1}, o_{k}^{t - m + 1}), (u_{k}^{t - m + 1}, r_{k}^{t - m + 1})) : k \in K) .

M^{t_{m}} =

M^{t_{m}} =

\leavevmode \leavevmode N_{k}^{t_{m} + 1}, (u_{k}^{t_{m} + 1}, r_{k}^{t_{m} + 1})) : k \in K},

L (θ^{t}) =

L (θ^{t}) =

\displaystyle\textsf{E}_{\left\{\left(\left(\mathcal{N}_{k},(\mathbf{u}_{k},r_{k}),f_{k}(\mathbf{s},\mathbf{u}_{k},r_{k}),\mathcal{N}_{k}^{\prime},\left(\mathbf{u}_{k}^{\prime},r_{k}^{\prime}\right)\right):k\in\mathcal{K}\right)\in\widetilde{\mathcal{M}}^{t}\right\}}\!\left[\left(\sum_{k\in\mathcal{K}}\left(\!\!\begin{array}[]{c}\displaystyle(1-\gamma)\cdot\displaystyle f_{k}(\mathbf{s},\mathbf{u}_{k},r_{k})+\gamma\cdot Q_{k}\!\left(\mathcal{N}_{k}^{\prime},\mathbf{u}_{k}^{\prime},r_{k}^{\prime};\bm{\theta}_{-}^{t}\right)-\\ Q_{k}\!\left(\mathcal{N}_{k},\mathbf{u}_{k},r_{k};\bm{\theta}^{t}\right)\end{array}\!\!\right)\right)^{2}\right]

\nabla_{θ^{t}} L (θ^{t}) =

\nabla_{θ^{t}} L (θ^{t}) =

\displaystyle\textsf{E}_{\left\{\left(\left(\mathcal{N}_{k},(\mathbf{u}_{k},r_{k}),f_{k}(\mathbf{s},\mathbf{u}_{k},r_{k}),\mathcal{N}_{k}^{\prime},\left(\mathbf{u}_{k}^{\prime},r_{k}^{\prime}\right)\right):k\in\mathcal{K}\right)\in\widetilde{\mathcal{M}}^{t}\right\}}\!\left[\!\!\begin{array}[]{c}\displaystyle\sum_{k\in\mathcal{K}}\left(\!\!\begin{array}[]{c}\displaystyle(1-\gamma)\cdot\displaystyle f_{k}(\mathbf{s},\mathbf{u}_{k},r_{k})+\gamma\cdot Q_{k}\!\left(\mathcal{N}_{k}^{\prime},\mathbf{u}_{k}^{\prime},r_{k}^{\prime};\bm{\theta}_{-}^{t}\right)-\\ Q_{k}\!\left(\mathcal{N}_{k},\mathbf{u}_{k},r_{k};\bm{\theta}^{t}\right)\end{array}\!\!\right)\cdot\\ \nabla_{\bm{\theta}^{t}}\!\!\left(\displaystyle\sum_{k\in\mathcal{K}}Q_{k}\!\left(\mathcal{N}_{k},\mathbf{u}_{k},r_{k};\bm{\theta}^{t}\right)\right)\end{array}\!\!\right]

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsVehicular Ad Hoc Networks (VANETs) · Advanced MIMO Systems Optimization · Advanced Wireless Network Optimization

Full text

Decentralized Deep Reinforcement Learning for Delay-Power Tradeoff in Vehicular Communications

Xianfu Chen, Celimuge Wu, Honggang Zhang, Yan Zhang, Mehdi Bennis, and Heli Vuojala X. Chen and H. Vuojala are with the VTT Technical Research Centre of Finland, Finland (email: {xianfu.chen, heli.vuojala}@vtt.fi). C. Wu is with the Graduate School of Informatics and Engineering, University of Electro-Communications, Japan (email: [email protected]). H. Zhang is with the College of Information Science and Electronic Engineering, Zhejiang University, China (e-mail: [email protected]). Y. Zhang is with the Department of Informatics, University of Oslo, Norway (e-mail: [email protected]). M. Bennis is with the Centre for Wireless Communications, University of Oulu, Finland (email: [email protected]).

Abstract

This paper targets at the problem of radio resource management for expected long-term delay-power tradeoff in vehicular communications. At each decision epoch, the road side unit observes the global network state, allocates channels and schedules data packets for all vehicle user equipment-pairs (VUE-pairs). The decision-making procedure is modelled as a discrete-time Markov decision process (MDP). The technical challenges in solving an optimal control policy originate from highly spatial mobility of vehicles and temporal variations in data traffic. To simplify the decision-making process, we first decompose the MDP into a series of per-VUE-pair MDPs. We then propose an online long short-term memory based deep reinforcement learning algorithm to break the curse of high dimensionality in state space faced by each per-VUE-pair MDP. With the proposed algorithm, the optimal channel allocation and packet scheduling decision at each epoch can be made in a decentralized way in accordance with the partial observations of the global network state at the VUE-pairs. Numerical simulations validate the theoretical analysis and show the effectiveness of the proposed online learning algorithm.

I Introduction

The vehicle-to-vehicle (V2V) communication technologies have been gaining increasing popularity for the feasibility of enabling emerging vehicle-related services [1, 2, 3]. However, this ad hoc type of vehicular communications requires intense coordinations among the vehicles in close proximity [4]. Without the support of an infrastructure, the high vehicle mobility makes the design of efficient radio resource management (RRM) techniques extremely challenging [5]. There are a large body of literatures on RRM in V2V communications. In [6], Sun et al. proposed a separate resource block and power allocation algorithm for the RRM in device-to-device based V2V communications. In [7], Yao et al. derived a loss differentiation rate adaptation scheme to meet the stringent delay and reliability requirements for V2V safety communications. In [8], Egea-Lopez et al. designed a fair adaptive beaconing rate algorithm for the problem of beaconing rate control in inter-vehicular communications. Most of these efforts have not taken into account the network dynamics, such as the temporal and spatial variations in transmission quality as well as data traffic, and hence fail to optimize the expected long-term RRM performance.

A Markov decision process (MDP) has been successfully applied to model RRM in vehicular communications with time-varying nature. In [9], Liu and Bennis formulated a latency and reliability [10] constrained transmit power minimization problem, for which the Lyapunov stochastic optimization was leveraged to handle the network dynamics. The problem with the Lyapunov stochastic optimization is that only an approximately optimal solution can be constructed. In [11], Chen et al. studied the non-cooperative RRM in vehicular communications from an oblivious game-theoretic perspective and put forward an online algorithm based on reinforcement learning to approach the optimal solution. Consider a more practical scenario, where the channel qualities are affected by the vehicle mobility, the explosion in the state space makes the technique developed in our priori work [11] infeasible.

In this paper, we investigate a Manhattan grid V2V network, where the data traffic changes across the time horizon and the channel quality state depends on the locations of vehicle user equipment (VUE)-transmitter (vTx) and VUE-receiver (vRx) of a VUE-pair. The primary goal of this paper is to design an optimal RRM algorithm for each VUE-pair to strike a tradeoff between the queuing delay and the transmit power consumption over the long run. We formulate the RRM problem as a MDP and resort to a deep neural network based function approximator to deal with the curse of state space explosion [12]. In [13], Ye and Li devised a decentralized RRM mechanism based on deep reinforcement learning (DRL) for V2V communication systems. However, the mechanism does not account for the vehicle mobility, which helps facilitate frequency resource sharing among different groups of VUE-pairs. As the major contribution from this paper, we propose an online decentralized learning algorithm by exploring the recent advances in both long short-term memory (LSTM) [14] and DRL [15], with which each VUE-pair with partially local network state observations is hence able to realize a significant performance improvement.

II System Model

As in Fig. 1, we consider a Manhattan grid V2V communication scenario. A set $\mathcal{K}=\{1,\cdots,K\}$ 111For a well defined road segment, the VUE density tends to be steady [16]. of VUE-pairs share a set $\mathcal{J}=\{1,\cdots,J\}$ of orthogonal channels within the coverage $\mathcal{C}$ of a road side unit (RSU), where $\mathcal{C}$ represents a two-dimensional Euclidean space. The time horizon is discretized into decision epochs, each of which is of duration $\delta$ and is indexed by an integer $t\in\mathds{N}_{+}$ . Each vTx always follows the corresponding vRx with a fixed distance of $\varphi$ and the vRx moves in $\mathcal{C}$ according to a Manhattan mobility model [11]. Denote by $\mathbf{x}_{k}^{t}=(x_{k}^{(1),t},x_{k}^{(2),t})$ and $\mathbf{y}_{k}^{t}=(y_{k}^{(1),t},y_{k}^{(2),t})$ , respectively, the Euclidean coordinates of the vTx and the vRx of a VUE-pair $k\in\mathcal{K}$ during each epoch $t$ . Depending on whether the vTx and the vRx are in the same lane or in perpendicular lanes, the channel model during each decision epoch belongs to: 1) line-of-sight (LOS) – both the vTx and the vRx are in the same lane; 2) weak-line-of-sight (WLOS) – the vTx and the vRx are in perpendicular lanes and at least one of them is near the intersection within a distance of $\varphi_{0}$ ; and otherwise, 3) none-line-of-sight (NLOS). More specifically, the channel quality state $g_{k,j}^{t}=\nu_{k,j}^{t}\cdot H_{k}^{t}\in\mathcal{G}$ experienced by VUE-pair $k$ over channel $j\in\mathcal{J}$ during epoch $t$ includes a fast fading component $\nu_{k,j}^{t}$ of a Rayleigh distribution with a unit scale parameter and a path loss $H_{k}^{t}$ that applies the model in (4) for urban areas using 5.9 GHz carrier frequency [9],

where $e$ is the path loss coefficient while $\rho$ and $\xi$ are the path loss exponents with $\xi<\rho\cdot(\varphi_{0}/2)^{e}$ .

In order to mitigate the interference during wireless transmissions and maximize the channel utilization, the RSU clusters222Considering the vehicle mobility, clustering is done every $T$ epochs [9]. the VUE-pairs into a set $\mathcal{I}=\{1,\cdots,I\}$ of disjoint groups based on their geographical locations, where $I>1$ . The RSU allocates $J$ channels to the $I$ groups, while in each group, we assume that a VUE-pair can be assigned at most one channel and a channel can be assigned to at most one VUE-pair. Let $\mathbf{u}_{k}^{t}=(u_{k,j}^{t}:j\in\mathcal{J})$ denote the channel allocation for a VUE-pair $k\in\mathcal{K}_{i}$ during decision epoch $t$ , where $\mathcal{K}_{i}$ is the set of VUE-pairs in a group $i\in\mathcal{I}$ and

[TABLE]

Thus we have

[TABLE]

At the vTx of each VUE-pair $k$ , a data queue is maintained to buffer the arriving packets. Let $a_{k}^{t}$ be the random new packet arrivals at epoch $t$ with average arrival rate $\textsf{E}[a_{k}^{t}]=\lambda$ . The queue evolution for VUE-pair $k$ can be expressed as

[TABLE]

where $q_{k}^{t}$ and $r_{k}^{t}$ are, respectively, the queue length and the number of packets to depart during decision epoch $t$ , while $\mathds{1}_{\{\Xi\}}$ is an indicator function that equals $1$ if the condition $\Xi$ is satisfied and [math] otherwise. In this paper, we assume a large enough buffer size to neglect the probability of packet drops. The required transmit power for delivering $r_{k}^{t}\cdot\mathds{1}_{\left\{\sum_{j\in\mathcal{J}}u_{k,j}^{t}=1\right\}}$ packets can be computed as

[TABLE]

where $\vartheta$ is the received interference due to inter-group channel reuse, $w$ is the frequency bandwidth of the channels, $\sigma^{2}$ is the power spectral density of additive background noise, and $\mu$ is the constant size of a data packet.

III Problem Description

This section formulates the problem of RRM in the considered V2V network as a discrete-time MDP with a discounted criterion and discusses the general solution.

III-A MDP Formulation

During each decision epoch $t$ , the local state of a VUE-pair $k\in\mathcal{K}$ can be described by $\mathbf{s}_{k}^{t}=(\mathbf{g}_{k}^{t},(\mathbf{x}_{k}^{t},\mathbf{y}_{k}^{t}),q_{k}^{t})\in\mathcal{S}=\mathcal{G}^{J}\times\mathcal{C}\times\mathcal{Q}$ , which includes the information of channel quality $\mathbf{g}_{k}^{t}=(g_{k,j}^{t}:j\in\mathcal{J})$ , geographical location $(\mathbf{x}_{k}^{t},\mathbf{y}_{k}^{t})$ and queue state $q_{k}^{t}$ . We use $\mathbf{s}^{t}=(\mathbf{s}_{k}^{t},\mathbf{s}_{-k}^{t})\in\mathcal{S}^{K}$ to represent the global network state, where $-k$ denotes all the other VUE-pairs in $\mathcal{K}$ without the presence of VUE-pair $k$ . The RSU aims to design a stationary control policy $\bm{\pi}=(\pi_{(u)},\pi_{(r)})$ , where $\pi_{(u)}$ and $\pi_{(r)}$ are, respectively, the channel allocation policy and the packet scheduling policy. Specifically, the RSU observes $\mathbf{s}^{t}$ at the beginning of epoch $t$ and accordingly, makes channel allocation and packet scheduling decisions for the VUE-pairs, that is, $\bm{\pi}(\mathbf{s}^{t})=(\pi_{(u)}(\mathbf{s}^{t}),\pi_{(r)}(\mathbf{s}^{t}))=(\mathbf{u}^{t},\mathbf{r}^{t})$ , where $\mathbf{u}^{t}=(\mathbf{u}_{k}^{t}:k\in\mathcal{K})$ and $\mathbf{r}^{t}=(r_{k}^{t}:k\in\mathcal{K})$ . From the assumptions on the mobility of a VUE-pair, the packet arrivals and the queue evolution, the randomness lying in $\{\mathbf{s}^{t}:t\in\mathds{N}_{+}\}$ is Markovian with the following controlled state transition probability

[TABLE]

where $\mathbb{P}(\cdot)$ denotes the probability of an event.

We need a cost function to tradeoff the queuing delay and the consumed transmit power for each VUE-pair $k\in\mathcal{K}$ during each decision epoch $t$ , which can be chosen as

[TABLE]

where $d(q_{k}^{t})=q_{k}^{t}/\lambda$ , while $\phi$ and $\eta$ are two positive weights. Given a control policy $\bm{\pi}$ and an initial global network state $\mathbf{s}^{1}=\mathbf{s}\in\mathcal{S}^{K}$ , we express the expected long-term cost function $V_{k}(\mathbf{s},\bm{\pi})$ for VUE-pair $k$ as

[TABLE]

where $\gamma\in[0,1)$ is the discount factor. As a result, the delay-power tradeoff problem, which the RSU aims to solve, can be formally formulated as a MDP, namely, $\forall\mathbf{s}\in\mathcal{S}^{K}$ ,

[TABLE]

where $f(\mathbf{s}^{t},\bm{\pi}(\mathbf{s}^{t}))=\sum_{k\in\mathcal{K}}f_{k}(\mathbf{s}^{t},\mathbf{u}_{k}^{t},r_{k}^{t})$ is the immediate cost accumulated across all the VUE-pairs in the network at a decision epoch $t$ . $V(\mathbf{s},\bm{\pi})$ is also named as the state value function in state $\mathbf{s}$ under a policy $\bm{\pi}$ .

III-B Optimal Solution

The problem formulated as in (16) is a typical infinite-horizon discrete-time MDP with a discounted criterion. Denote by $\bm{\pi}^{*}=(\pi_{(u)}^{*},\pi_{(r)}^{*})$ the optimal control policy, which can be obtained from solving the Bellman’s equation: $\forall\mathbf{s}\in\mathcal{S}^{K}$ ,

[TABLE]

where $V(\mathbf{s})=V(\mathbf{s},\bm{\pi}^{*})$ is the optimal state value function and $\mathbf{s}^{\prime}\in\mathcal{S}^{K}$ is the resulting global network state at a subsequent epoch. The conventional solutions to (17) based on the value or policy iteration [17] require the complete knowledge of network dynamics (III-A), which is challenging in practice. Let us define the right-hand side of (17) by

[TABLE]

the $Q$ -function, where $\mathbf{u}=(\mathbf{u}_{k}:k\in\mathcal{K})$ and $\mathbf{r}=(r_{k}:k\in\mathcal{K})$ are the decision makings under $\mathbf{s}$ with $\mathbf{u}_{k}=(u_{k,j}:j\in\mathcal{J})$ . $V(\mathbf{s})$ can then be directly obtained from

[TABLE]

By substituting (19) back into (III-B), we have

[TABLE]

where $\mathbf{u}^{\prime}=(\mathbf{u}_{k}^{\prime}:k\in\mathcal{K})$ and $\mathbf{r}^{\prime}=(r_{k}^{\prime}:k\in\mathcal{K})$ denote the decision makings under $\mathbf{s}^{\prime}$ with $\mathbf{u}_{k}^{\prime}=(u_{k,j}^{\prime}:j\in\mathcal{J})$ .

Using a state-action-reward-state-action (SARSA) algorithm [18, 17], the RSU tries to learn $Q(\mathbf{s},\mathbf{u},\mathbf{r})$ in a recursive way with observations of the global network state $\mathbf{s}=\mathbf{s}^{t}$ , the decision making $(\mathbf{u},\mathbf{r})=(\mathbf{u}^{t},\mathbf{r}^{t})$ , the realized cost $f(\mathbf{s},\mathbf{u},\mathbf{r})$ at a current decision epoch $t$ and the resulting global network state $\mathbf{s}^{\prime}=\mathbf{s}^{t+1}$ , the decision making $(\mathbf{u}^{\prime},\mathbf{r}^{\prime})=(\mathbf{u}^{t+1},\mathbf{r}^{t+1})$ at the next epoch $t+1$ . The updating rule is given by

[TABLE]

where $\alpha^{t}\in[0,1)$ is the learning rate. It has been proven that if 1) the network state transition probability under the optimal stationary control policy is stationary, 2) $\sum_{t=1}^{\infty}\alpha^{t}$ is infinite and $\sum_{t=1}^{\infty}(\alpha^{t})^{2}$ is finite, and 3) all state-action pairs are visited infinitely often (which can be satisfied by a $\epsilon$ -greedy strategy [17]), the SARSA learning process converges and finds $\bm{\pi}^{*}$ [19]. However, two challenges remain as follows:

from the channel model applied in this work, the global network state space $\mathcal{S}^{K}$ is semi-continuous; and 2. 2.

the number $((1+J)\cdot(1+A))^{K}$ of decision makings at the RSU grows exponentially as $K$ increases, where $A$ is the maximum number of packet departures at a vTx, i.e., $a_{k}^{t}\leq A$ , $\forall k\in\mathcal{K}$ and $\forall t\in\mathds{N}_{+}$ .

IV A Deep Reinforcement Learning Approach

We shall address in this section the technical challenges in solving an optimal control policy and derive a deep reinforcement learning algorithm.

IV-A Linear $Q$ -function Decomposition

The centralized decisions made by the RSU are performed by the VUE-pairs in a decentralized way. We hence propose to linearly decompose the $Q$ -function, that is,

[TABLE]

where $Q_{k}(\mathbf{s},\mathbf{u}_{k},r_{k})$ is the per-VUE-pair $Q$ -function for each VUE-pair $k\in\mathcal{K}$ that satisfies

[TABLE]

where the optimal decision making from a VUE-pair $k$ across the time horizon should reflect the optimal control policy implemented by the RSU. In other words, $(\mathbf{u}_{k}^{\prime},r_{k}^{\prime})$ in (23) under the network state $\mathbf{s}^{\prime}$ follows $\bm{\pi}^{*}(\mathbf{s}^{\prime})$ , i.e.,

[TABLE]

which minimizes the sum of per-VUE-pair $Q$ -function values from all VUE-pairs in the network. Two key advantages of the decomposition approach in (22) are highlighted.

Simplified decision makings: The linear decomposition motivates the RSU to let the VUE-pairs submit the local per-VUE-pair $Q$ -functions of the channel allocation and packet scheduling decisions with the global network state observations, based on which the RSU allocates channels and the VUE-pairs then schedule packet transmissions. This reduces $((1+J)\cdot(1+A))^{K}$ centralized decision makings at the RSU to $K\cdot((1+J)\cdot(1+A))$ decentralized decisions for all VUE-pairs. 2. 2.

Near optimality: The approach in (22) ensures a guarantee of approximation error of the $Q$ -function [20].

IV-B Learning the Optimal Control Policy

In spite of the advantages brought by the linear decomposition approach as in (22), a new challenge, however, arises. That is, each VUE-pair $k\in\mathcal{K}$ can only obtain a partial observation $(\mathbf{s}_{k}^{t},\mathbf{o}_{k}^{t})$ of the global network state $\mathbf{s}^{t}$ at each decision epoch $t$ . In this work, we assume that when VUE-pair $k$ was in a group $i_{k}^{t-1}\in\mathcal{I}$ (i.e., $k\in\mathcal{K}_{i_{k}^{t-1}}$ ) during the previous decision epoch $t-1$ , $\mathbf{o}_{k}^{t}=(i_{k}^{t-1},b_{i_{k}^{t-1}}^{t-1},\bm{\upsilon}_{i_{k}^{t-1}}^{t-1})\in\mathcal{O}$ includes the group index $i_{k}^{t-1}$ and the number $b_{i_{k}^{t-1}}^{t-1}$ of VUE-pairs as well as the channel utilization state $\bm{\upsilon}_{i_{k}^{t-1}}^{t-1}=(\upsilon_{i_{k}^{t-1},j}^{t-1}:j\in\mathcal{J})$ in group $i_{k}^{t-1}$ , where $\upsilon_{i_{k}^{t-1},j}^{t-1}$ equals $1$ if channel $j\in\mathcal{J}$ is utilized in group $i_{k}^{t-1}$ at epoch $t-1$ and otherwise, [math]. Note that $\mathbf{o}_{k}^{t}$ is restricted to local group information since the decision makings across different groups are independent.

With the local observation $(\mathbf{s}_{k},\mathbf{o}_{k})\in\mathcal{S}\times\mathcal{O}$ at a current decision epoch, we abstract the per-VUE-pair $Q$ -function (23) of each VUE-pair $k\in\mathcal{K}$ as [20]

[TABLE]

The semi-continuity in $\mathcal{S}$ and the high dimensionality in $\mathcal{O}$ make it infeasible for the conventional SARSA algorithm (21) to learn the per-VUE-pair $Q$ -function $Q_{k}(\mathbf{s}_{k},\mathbf{o}_{k},\mathbf{u}_{k},r_{k})$ , $\forall k\in\mathcal{K}$ . Moreover, from the assumptions made in this paper and the definition of a cost function (14), there exists homogeneity in the VUE-pair behaviours. Inspired by the success of modelling the $Q$ -function with a deep neural network (DNN) [12], we adopt a common double deep $Q$ -network (DQN) to approximate $Q_{k}(\mathbf{s}_{k},\mathbf{o}_{k},\mathbf{u}_{k},r_{k})$ [15, 21]. On the other hand, the accuracy of (25) from the observations can be, in general, arbitrarily bad. As in [22], we propose to add a LSTM layer [14] to the DQN and obtain a hybrid DNN to learn a better control policy in a partially observable V2V network. Specifically, let $Q_{k}(\mathbf{s},\mathbf{u}_{k},r_{k})$ $\approx Q_{k}(\mathcal{N}_{k},\mathbf{u}_{k},r_{k};\bm{\theta})$ , $\forall k\in\mathcal{K}$ , where $\mathcal{N}_{k}$ denotes a set of most recent $N$ local observations up to a current decision epoch (which will be specified later in this subsection) and is taken as an input to the LSTM layer for a more accurate prediction of $\mathbf{s}$ , while $\bm{\theta}$ denotes a vector of parameters associated with the hybrid DNN. Our proposed novel LSTM based deep reinforcement learning (LSTM-DRL) algorithm for long-term delay-power tradeoff in the considered V2V network is illustrated in Fig. 2, during which instead of finding the per-VUE-pair $Q$ -function, the parameters of the hybrid DNN can be trained centrally at the RSU.

For online training of the LSTM-DRL algorithm, at each decision epoch $t$ , the RSU updates the replay memory $\mathcal{M}$ with the most recent $M$ experiences $\{\mathbf{m}^{t-M+1},\cdots,\mathbf{m}^{t}\}$ with each experience $\mathbf{m}^{t-m+1}$ ( $\forall m\in\{1,\cdots,M\}$ ) being given by

[TABLE]

Meanwhile, an observation pool $\mathcal{N}^{t}=\cup_{k\in\mathcal{K}}\mathcal{N}_{k}^{t}=\{\mathbf{n}^{t-N+1},$ $\cdots,\mathbf{n}^{t}\}$ , the information of which is collected from all VUE-pairs, is kept to predict the global network state $\mathbf{s}^{t}$ at epoch $t$ for control policy evaluation, where $\mathbf{n}^{t}=\{\mathbf{n}_{k}^{t}=(\mathbf{s}_{k}^{t},\mathbf{o}_{k}^{t}):k\in\mathcal{K}\}$ . To train the hybrid DNN parameters, the RSU first randomly samples a mini-batch $\widetilde{\mathcal{M}}^{t}=\{\widetilde{\mathcal{M}}^{t_{1}},\cdots,\widetilde{\mathcal{M}}^{t_{\widetilde{M}}}\}$ of size $\widetilde{M}$ from $\mathcal{M}^{t}$ , where $\forall m\in\{1,\cdots,\widetilde{M}\}$ ,

[TABLE]

with $\mathcal{N}_{k}^{t_{m}}=\{\mathbf{n}_{k}^{t_{m}-N+1},\cdots,\mathbf{n}_{k}^{t_{m}}\}$ . Then the set $\bm{\theta}^{t}$ of parameters at epoch $t$ is updated by minimizing the accumulative loss function, which is defined as in (IV-B),

where $\bm{\theta}_{-}^{t}$ is the set of parameters of the target hybrid DNN at a certain previous decision epoch before epoch $t$ . The gradient is calculated as (IV-B).

We summarize in Algorithm 1 the online training of the proposed LSTM-DRL algorithm.

V Simulation Results

This section evaluates the performance from our proposed studies through numerical simulations based on TensorFlow [23]. We simulate a $250\times 250$ m2 Manhattan mobility model with nine intersections [9, 11]. In the model, a road consists of two lanes, each of which is in one direction and is of width $4$ m. The average vehicle speed is set to be $60$ km/h, and the vehicle grouping is performed by means of spectral clustering [11]. We list other parameter values used in simulations in Table I. For performance comparison purpose, the following three baseline algorithms are simulated as well.

Channel-Aware: At each decision epoch, the RSU allocates the channels to VUE-pairs in each group based on the channel quality states. 2. 2.

Queue-Aware: Different from the Channel-Aware algorithm, the RSU allocates at each decision epoch the channels to VUE-pairs in each group according to the queue lengths. 3. 3.

Random: Across the decision epochs, the RSU randomly allocates the channels to a set of randomly picked VUE-pairs in each group.

Implementing these baselines, the RSU schedules packets to minimize the immediate cost for each VUE-pair.

V-A Convergence Property of the Proposed Algorithm

This simulation examines the convergence property of online training of our LSTM-DRL algorithm. We select $K=36$ VUE-pairs with an average packet arrival rate $\lambda=1$ , and the distance between the VTx and the vRx of each VUE-pair is fixed to be $\varphi=20$ . Fig. 3 plots the loss function defined by (IV-B) over the learning time horizon, which validates that the convergence needs around $3\cdot 10^{4}$ decision epochs. Since the training is performed centrally at the RSU, each VUE-pair only needs to periodically update the set $\bm{\theta}$ of parameters of the LSTM-DRL algorithm with a new one from the RSU.

V-B Performance under Various Simulation Settings

We further verify the average cost performance per VUE-pair across the time horizon under different simulation settings. First, we configure a networking environment as: $\lambda=2$ and $\varphi=20$ . In Fig. 4(a), we depict the realized average cost performance versus $K$ , which shows the average cost per VUE-pair from all four algorithms increases as the number of VUE-pairs increases. It is obvious that a larger number of VUE-pairs leads to less chance of being allocated one channel. Next, we assume there are $K=52$ VUE-pairs in the network and $\varphi=35$ . By increasing the value of $\lambda$ , the average cost performance per VUE-pair is shown in Fig. 4(b) With more packets arriving into the queues, more power is consumed for the packet transmissions in order to maintain the queue stability. Hence all four algorithms exhibit worse performance. Finally, we illustrate in Fig. 4(c) the average cost performance per VUE-pair when the value of $\varphi$ varies. As the distance between the vTx and the vRx of a VUE-pair increases, the channel quality drops. This indicates more transmit power for transmitting the same number of packets, which conforms what we see from the curves in Fig. 4(c). Interestingly and importantly, in all above three simulations, our proposed algorithm achieves the best performance, demonstrating the feasibility of a better delay-power tradeoff, compared with the other three baselines.

VI Conclusions

In this paper, we put our emphasis on investigating the RRM for an expected long-term delay-power tradeoff in a V2V communication network. The RSU allocates channels and schedules packet transmissions for all VUE-pairs according to the observations of global network states over the discrete time horizon. This kind of decision-making process straightforwardly falls into the realm of a MDP. The technical challenges in solving an optimal control policy for the MDP motivates us to first decompose the MDP into a series of per-VUE-pair MDPs with much simplified decision makings. To overcome the curse of high dimensionality in state space of a per-VUE-pair MDP, we resort to the DQN technique and propose an online LSTM-DRL algorithm. The LSTM-DRL algorithm enables decentralized channel allocation and packet scheduling decisions with only partially local network state observations from the VUE-pairs but without a priori statistics knowledge of network dynamics. From numerical simulations, significant gains in average cost performance from the proposed learning algorithm can be expected.

Bibliography23

The reference list from the paper itself. Each links out to its DOI / PubMed record.

1[1] S. Kuutti et al., “A survey of the state-of-the-art localization techniques and their potentials for autonomous vehicle applications,” IEEE Internet Things J. , vol. 5, no. 2, pp. 829–846, Mar. 2018.
2[2] Y. Dai, D. Xu, S. Maharjan, G. Qiao, and Y. Zhang, “Artificial intelligence empowered edge computing and caching for internet of vehicles,” IEEE Wireless Commun. Mag. , accepted, 2019.
3[3] K. Zhang, S. Leng, X. Peng, P. Li, S. Maharjan, and Y. Zhang, “Artificial intelligence inspired transmission scheduling in cognitive vehicular communications and networks,” IEEE Internet Things J. , Early Access Article, 2018.
4[4] M. Amadeo, C. Campolo, and A. Molinaro, “Information-centric networking for connected vehicles: A survey and future perspectives,” IEEE Commun. Mag. , vol. 54, no. 2, pp. 98–104, Feb. 2016.
5[5] K. Zheng, Q. Zheng, P. Chatzimisios, W. Xiang, and Y. Zhou, “Heterogeneous vehicular networking: A survey on architecture, challenges, and solutions,” IEEE Commun. Surveys Tuts , vol. 17, no. 4, pp. 2377–2396, Q 4 2015.
6[6] W. Sun, E. G. Ström, F. Brännström, K. C. Sou, and Y. Sui, “Radio resource management for D 2D-based V 2V communication,” IEEE Trans. Veh. Technol. , vol. 65, no. 8, pp. 6636–6650, Aug. 2016.
7[7] Y. Yao, X. Chen, L. Rao, X. Liu, and X. Zhou, “LORA: Loss differentiation rate adaptation scheme for vehicle-to-vehicle safety communications,” IEEE Trans. Veh. Technol. , vol. 66, no. 3, pp. 2499–2512, Mar. 2017.
8[8] E. Egea-Lopez and P. Pavon-Mariño, “Distributed and fair beaconing rate adaptation for congestion control in vehicle network,” IEEE Trans. Mobile Comput. , vol. 15, no. 12, pp. 3028–3041, Dec. 2016.

TL;DR

Contribution

Findings

Abstract

Peer Reviews

Videos

Taxonomy

Decentralized Deep Reinforcement Learning for Delay-Power Tradeoff in Vehicular Communications

Abstract

I Introduction

II System Model

III Problem Description

III-A MDP Formulation

III-B Optimal Solution

IV A Deep Reinforcement Learning Approach

IV-A Linear QQQ-function Decomposition

IV-B Learning the Optimal Control Policy

V Simulation Results

V-A Convergence Property of the Proposed Algorithm

V-B Performance under Various Simulation Settings

VI Conclusions

IV-A Linear $Q$ -function Decomposition