Deep Reinforcement Learning for UAV Navigation Through Massive MIMO   Technique

Hongji Huang; Yuchun Yang; Hong Wang; Zhiguo Ding; Hikmet Sari,; Fumiyuki Adachi

arXiv:1901.10832·eess.SP·November 28, 2019·IEEE Trans. Veh. Technol.

Deep Reinforcement Learning for UAV Navigation Through Massive MIMO Technique

Hongji Huang, Yuchun Yang, Hong Wang, Zhiguo Ding, Hikmet Sari,, Fumiyuki Adachi

PDF

TL;DR

This paper introduces a deep reinforcement learning approach using a deep Q-network to improve UAV navigation in massive MIMO environments, enhancing real-time decision-making and link selection.

Contribution

It presents a novel deep Q-network based method for UAV navigation that effectively captures UAV motion and optimizes link selection in real time.

Findings

01

Enhanced coverage and convergence in simulations

02

Superior performance over existing schemes

03

Effective decision-making based on received signal strengths

Abstract

Unmanned aerial vehicles (UAVs) technique has been recognized as a promising solution in future wireless connectivity from the sky, and UAV navigation is one of the most significant open research problems, which has attracted wide interest in the research community. However, the current UAV navigation schemes are unable to capture the UAV motion and select the best UAV-ground links in real time, and these weaknesses overwhelm the UAV navigation performance. To tackle these fundamental limitations, in this paper, we merge the state-of-theart deep reinforcement learning with the UAV navigation through massive multiple-input-multiple-output (MIMO) technique. To be specific, we carefully design a deep Q-network (DQN) for optimizing the UAV navigation by selecting the optimal policy, and then we propose a learning mechanism for processing the DQN. The DQN is trained so that the agent is…

Equations28

h_{k} (t) = \int_{φ_{min}}^{φ_{max}} a^{t} (φ_{k} (t)) g_{k}^{t} (φ_{k} (t)) d φ_{k} (t),

h_{k} (t) = \int_{φ_{min}}^{φ_{max}} a^{t} (φ_{k} (t)) g_{k}^{t} (φ_{k} (t)) d φ_{k} (t),

a^{t} (φ_{k} (t)) = \frac{1}{N _{t}} [1, e^{\frac{- j 2 π d}{λ} s i n φ_{k} (t)}, \cdot \cdot \cdot, e^{\frac{- j 2 π d}{λ} (N_{t} - 1) s i n φ_{k} (t)}]^{T},

a^{t} (φ_{k} (t)) = \frac{1}{N _{t}} [1, e^{\frac{- j 2 π d}{λ} s i n φ_{k} (t)}, \cdot \cdot \cdot, e^{\frac{- j 2 π d}{λ} (N_{t} - 1) s i n φ_{k} (t)}]^{T},

E {g_{k}^{t} (φ_{k} (t)) g_{k}^{*, t} (φ_{k} (t^{^{'}}))} = γ^{k} υ_{k}^{t} (φ_{k} (t)) δ (φ_{k} (t) - φ_{k} (t^{^{'}})),

E {g_{k}^{t} (φ_{k} (t)) g_{k}^{*, t} (φ_{k} (t^{^{'}}))} = γ^{k} υ_{k}^{t} (φ_{k} (t)) δ (φ_{k} (t) - φ_{k} (t^{^{'}})),

η_{k} (t) = \frac{P _{tr} ∣ h _{k}^{H} ( t ) w _{k} ∣ ^{2}}{P _{tr} j \neq = k , 1 \leq j \leq K \sum ∣ h _{k}^{H} ( t ) w _{j} ∣ ^{2} + σ _{k}^{2}},

η_{k} (t) = \frac{P _{tr} ∣ h _{k}^{H} ( t ) w _{k} ∣ ^{2}}{P _{tr} j \neq = k , 1 \leq j \leq K \sum ∣ h _{k}^{H} ( t ) w _{j} ∣ ^{2} + σ _{k}^{2}},

Q^{π} (s, a) = R (s, a) + τ s^{^{'}} \in S \sum P_{s s^{^{'}}} V^{π} (s^{^{'}}),

Q^{π} (s, a) = R (s, a) + τ s^{^{'}} \in S \sum P_{s s^{^{'}}} V^{π} (s^{^{'}}),

R (s, a; t) = t_{0} = 0 \sum T τ (t_{0}) r (t - t_{0}),

R (s, a; t) = t_{0} = 0 \sum T τ (t_{0}) r (t - t_{0}),

r(t)=\left\{\begin{array}[]{rcl}\alpha\eta_{k}(t),&&{\eta_{k}(t)>\eta_{0},}\\ -1,&&{\eta_{k}(t)\leq\eta_{0}.}\end{array}\right.

r(t)=\left\{\begin{array}[]{rcl}\alpha\eta_{k}(t),&&{\eta_{k}(t)>\eta_{0},}\\ -1,&&{\eta_{k}(t)\leq\eta_{0}.}\end{array}\right.

Q^{π^{*}} (s, a)

Q^{π^{*}} (s, a)

= E [r + τ a^{^{'}} max Q^{π^{*}} (s^{^{'}}, a^{^{'}}) ∣ s, a],

V^{π^{*}} (s) = a \in A max [Q^{π^{*}} (s, a)] .

V^{π^{*}} (s) = a \in A max [Q^{π^{*}} (s, a)] .

Q_{t + 1} (s, a) =

Q_{t + 1} (s, a) =

+ β (r + τ [a^{^{'}} max Q_{t} (s^{^{'}}, a^{^{'}})] - Q_{t} (s, a)),

y = r + τ a^{^{'}} max Q_{t} (s^{^{'}}, a^{^{'}}; ω_{j}),

y = r + τ a^{^{'}} max Q_{t} (s^{^{'}}, a^{^{'}}; ω_{j}),

loss (ω) = E [(y - Q (s, a; ω))^{2}] .

loss (ω) = E [(y - Q (s, a; ω))^{2}] .

r_{m} = \frac{F _{m}}{T},

r_{m} = \frac{F _{m}}{T},

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Full text

Deep Reinforcement Learning for UAV Navigation Through Massive MIMO Technique

Hongji Huang, Yuchun Yang, Hong Wang, Zhiguo Ding, Hikmet Sari, and Fumiyuki Adachi H. Huang, H. Wang, and H. Sari are with Key Lab of Broadband Wireless Communication and Sensor Network Technology, Ministry of Education, Nanjing University of Posts and Telecommunications, Nanjing 210003, China. H. Sari is also with Sequans Communications, 92700 Colombes, France. (E-mail: [email protected], [email protected], [email protected]).Y. Yang is with Jilin University, Changchun 130012, China. (E-mail: [email protected])Z. Ding is with School of Electrical and Electronic Engineering, University of Manchester, Manchester, M13 9PL, U.K. (E-mail: [email protected])F. Adachi is with Wireless Signal Processing Research Group, Research Organization of Electrical Communication (ROEC), Tohoku University, Sendai 980-8577, Japan. (E-mail: [email protected])

Abstract

Unmanned aerial vehicles (UAVs) technique has been recognized as a promising solution in future wireless connectivity from the sky, and UAV navigation is one of the most significant open research problems, which has attracted wide interest in the research community. However, the current UAV navigation schemes are unable to capture the UAV motion and select the best UAV-ground links in real-time, and these weaknesses overwhelm the UAV navigation performance. To tackle these fundamental limitations, in this paper, we merge the state-of-the-art deep reinforcement learning with the UAV navigation through massive multiple input multiple output (MIMO) technique. To be specific, we carefully design a deep Q-network (DQN) for optimizing the UAV navigation by selecting the optimal policy, and then we propose a learning mechanism for processing the DQN. The DQN is trained so that the agent is capable of making decisions based on the received signal strengths for navigating the UAVs with the aid of the powerful Q-learning. Simulation results are provided to corroborate the superiority of the proposed schemes in terms of the coverage and convergence compared with that of other schemes.

Index Terms:

Massive multiple input multiple output (MIMO), deep reinforcement learning, UAV navigation

I Introduction

Future communication networks must tackle not only incremental throughput and explosive traffic, but also for energy consumption, ultra-reliability, and supporting highly diversified applications with heterogeneous quality-of service (QoS) requirements [1]. Therefore, wireless connectivity techniques have drawn universal attention in academy and industry communities, and several emerging techniques have been proposed, such as balloons [2], and unmanned aerial vehicles (UAVs) [3]. Particularly, thanks to their wide applications, high mobility, and superior line-of-sight (LoS) propagation, UAVs have a great potential to be the airborne nodes such as relays and terminals, and therefore they are considered as an essential part of future networks.

In the recent years, a large quantity of works have been devoted to enhancing the performance of the UAV-enabled communications. In [4], the authors proposed a UAV-enabled data collection system for optimizing the energy consumption issue, where a UAV is assigned to collect data from a ground terminal at the fixed location. To deal with the endurance problem, a novel scheme which leverages the proactive caching at the users was provided and numerical results have demonstrated that proactive caching is a good candidate to resolve the endurance issue in the UAV-based systems [5]. Taking the advantages of the massive multiple input multiple output (MIMO) technique which can boost the system capacity, a UAV cellular-based system through massive MIMO was explored and this approach can increase the reception reliability [6].

Due to the high mobility of the UAVs, UAV navigation is an important technique and it has already been applied in the public safety, emergency rescue, and search operation. In [7], an evolutionary-based scheme was proposed which integrates the classic genetic algorithm into a breeder genetic algorithm, but this method cannot realize a reliable UAV navigation because of the randomness of the genetic algorithm and its complicated implementation. With the aid of sensors, the authors presented an autonomous UAV navigation approach (TF-UAV) to address the simultaneous localization and mapping issues but it requires robots which hinder its flexibility [8]. When we use the received signal strength indicator (RSSI) for UAV navigation, the occurrence of deep fades degrades the performance if the emerging methods are not introduced. Also, UAVs always move in a wide range and it is of great significance to make them work automatically for improving their communication coverage. But the conventional algorithms cannot satisfy the increasing coverage because UAV movements require much energy and slow navigation deteriorates the coverage performance. Recently, by exploiting the potentials of machine learning into wireless communication, a deep learning-based wireless communication method provides an alternative mean for optimizing the UAV navigation problem, whose performance has been corroborated in non-orthogonal multiple access (NOMA) [9], massive MIMO [10, 11], traffic control [12, 13], routing techniques [14], software defined network (SDN) [15], UAV [16, 17], and millimeter-wave (mmWave) communication [18], etc.. In particular, [17] proposed a deep learning-based method for UAV navigation without requiring sensing data that provides mapping information, however this method cannot converge quickly, which makes it difficult to be applied in real-time navigation scenarios. Reinforcement learning. which is a branch of machine learning that can address model-free problems by leveraging past observation and rewards, is a state-of-the-art method to produce control policies by using action space. It should be pointed out that the number of the action spaces is determined by the complexity of the state. In 2015, Deepmind proposed a reinforcement learning-based framework called deep Q-network (DQN) [19], which integrates the deep learning into the Q-learning. DQN is a promising tool to address multi-agents optimization problems such as the UAV navigation.

Inspired by the above considerations, in this paper, we incorporate the deep reinforcement learning technique into the UAV navigation through the massive MIMO. The main contributions of this paper are listed as follows.

First, we employ the deep reinforcement learning technique to achieve UAV navigation through the massive MIMO. By constructing a DQN [19], we obtain the optimal location selection policy based on the received signal strengths. Different from the previous works which mainly introduce the speed or geographic position for UAV navigation, the proposed method converges quickly and it also realizes good coverage performance. 2. 2.

Second, based on the developed DQN, we propose an efficient deep learning-based scheme for optimizing UAV navigation performance. After training the DQN, an environment simulator is developed for UAV navigation with a better coverage and faster convergence. Furthermore, extensive numerical results are provided to verify the superior performance of the proposed DQN navigation schemes.

II System Model

Consider a special massive MIMO system, which comprises one mobile BS with $N_{t}$ antennas and $K$ UAVs with single antenna. According to the well-known ray-tracing-based wireless channel model, the channel model of the $k$ -th UAV at the $t$ -th time slot is formulated as

[TABLE]

where $\mathbf{a}^{t}(\varphi_{k}(t))$ and $g_{k}^{t}(\varphi_{k}(t))$ represent the array response and the complex gain coefficient of the $k$ -th UAV at time slot $t$ , respectively. To be specific, $\varphi_{k}(t)$ denotes the incidence angle of the $k$ -th UAV, and $\varphi_{\textup{min}}$ and $\varphi_{\textup{max}}$ are its minimum and maximum values, respectively. Here, $\mathbf{a}^{t}(\varphi_{k}(t))$ can be written as

[TABLE]

Here, it is noted that $d$ is the antenna size, while $\lambda$ represents the carrier wavelength. The autocorrelation of the gain coefficient of the $k$ -th UAV can be expressed as

[TABLE]

where $\gamma^{k}$ represents the received signal power at UAV $k$ , and $\upsilon_{k}^{t}(\varphi_{k}(t))$ is the power azimuth spectrum at time slot $t$ that describes the power distribution of the channel in the angle domain. It is noted that $\gamma^{k}$ often experiences severe deep fades, leading to serious errors in UAV navigation. Therefore, we provide a deep reinforcement learning-based scheme to address this issue for boosting the UAV navigation performance. Besides, $\delta(\cdot)$ is denoted as the Dirac delta function and it has the property that $\int_{-\infty}^{+\infty}\delta(\varphi)d\varphi=1$ . Furthermore, the signal-to-interference-plus-noise ratio (SINR) $\eta_{k}$ at the $k$ -th UAV is formulated as

[TABLE]

where $P_{\textup{tr}}$ is the transmitted power at the BS, and $\mathbf{w}_{k}$ is the unit-norm vector for the $k$ -th UAV. Also, we assume that this system is corrupted by additive white Gaussian noise (AWGN) with zero mean and variance $\sigma_{k}^{2}$ . The UAV navigation is based on the RSSI and $\eta_{k}(t)$ is used to compute the immediate reward as described in the next section. We consider the maximum communication range of each UAV is $R$ , and the coverage range $R_{c}$ is defined as the communication range for the ground users when the UAV flies in the sky with $R_{c}\leq R$ .

III DQN-based UAV Navigation Framework

In this section, we provide a deep reinforcement learning-based framework for UAV navigation through the massive MIMO. To be specific, we first develop a DQN framework, and then formulate a learning policy to train the developed network. Furthermore, we propose an efficient deep reinforcement learning-based strategy for UAV navigation.

III-A Deep Q-network

As an embranchment of machine learning, reinforcement learning has attracted great attention among academia and industry. For attaining the best situation, multi-agents interact with the environment and they search for the optimal strategy with the maximum reward. Generally speaking, reinforcement learning can be regarded as a specific description of Markov decision processes (MDPs). It is comprised of four elements: a policy, a reward signal, an environment, and a utility function, which is a good candidate for resolving the high-complexity situations and capturing the realistic scenarios.

However, the conventional reinforcement learning requires the agents to adopt the appropriate representations of the environment based on the high-dimensional input and generate past knowledge to the new state. Meanwhile, its applicability only covers the low-dimensional area where the features can be fully exploited. To break up these gaps, DQN which integrates the deep neural networks into the reinforcement learning has been provided, and deep reinforcement learning has become a remarkable tool to handle the complex problems. Therefore, we introduce the DQN to optimize the UAV navigation issue.

In the proposed DQN framework, since we assume there are 32 UAVs in the UAV system, the input layer is a $32\times 32\times 4$ space and the first hidden layer is a convolutional (conv.) layer with 8 $4\times 4$ filters with stride 2. Followed by a rectifier nonlinear operation, the second hidden layer is designed as a conv. layer with 16 of $2\times 2$ filters with stride 2, which reduces the dimension for suppressing the complexity without losing important information of the network. Then, the next layer is also a conv. layer with 16 filters, and the dimension of these filters is $3\times 3$ with stride 1. The remaining hidden layer is a fully-connected (FC) layer with 256 neurons. Additionally, the output layer is a FC layer which provides the valid actions in the UAV navigation optimization.

III-B Learning Policy

To enable the UAV navigation, a novel learning policy is proposed based on the developed DQN. At first, the state space $S$ is supposed to represent the received signal strengths, and this set is formulated as $S=\{P_{R}^{k}<-120dBm,-120dBm\leq P_{R}^{k}\leq-40dBm,P_{R}^{k}\geq-40dBm|\forall k\}$ . Following the state space $S$ , we assume $R$ , $P$ , and $V$ as the mean value of the immediate reward, the transition probability, and the utility function, respectively, and the $Q$ -function is expressed by

[TABLE]

where $\pi$ is denoted as the policy, and our goal is to obtain the best policy $\pi^{\ast}$ . Also, $s$ and $a$ represent the state and action, respectively. Concretely, the action $a$ is performed through the environment simulator, and it updates its state and its reward based on the information from the BS. Furthermore, $\tau$ defines the discount factor in the region $0<\tau<1$ , while $S$ represents the state space. The future reward function obtained at time slot $t$ after learning the channel state over the last $T$ time slots duration is expressed as

[TABLE]

Here, $r(t)$ is the immediate reward function, while $t$ is defined as the time index. It is pointed out that $r(t)$ is determined by the SINR of the UAVs, which can be collected from the received signals of the UAVs. The SINR varies when the UAV moves from one position to another position in different time slot, and $r(t)$ is updated as the SINR changes. Eq. (6) is the sum of the rewards in different time slots and it is used to update the $Q$ state. Supposing $\alpha$ and $\eta_{0}$ as the positive constant and the power threshold, we formulate $r(t)$ as

[TABLE]

Then, we obtain the maximum $Q$ -function as

[TABLE]

Afterwards, noting that $A$ as the action space, the discounted cumulative state function is formulated as

[TABLE]

After obtaining the maximum $Q$ -function, we need to derive the optimal policy. Using the recursive mechanism, the $Q$ -function can be updated as

[TABLE]

where $\beta$ defines the learning rate. Since the received signal strength fluctuates as the UAVs’ position changes, $\beta$ is required to vary from different position. For example, to collect the received signal strengths when the UAV is close to the destination, the learning rate should be increased.

Thereafter, supposing $\omega_{j}$ as the weight at the $j$ -th iteration of the DQN, the target values of the DQN can be given as

[TABLE]

Afterwards, to find the optimum solution, the loss function of the DQN can be designed as

[TABLE]

After deriving the learning policy, it is noted that action selection and execution for the agents should be processed and we propose a $\varepsilon$ -greedy-based policy for selecting behavior distribution. To be specific, $\varepsilon$ is denoted as the exploration probability. We select the behavior distribution which follows the greedy strategy with probability $1-\varepsilon$ and choose an action with the biggest $Q$ value. In order to explain the proposed DQN-based navigation scheme clearly, the DQN-based navigation framework is illustrated in Fig. 1. Concretely, the proposed DQN-based strategy is provided in Algorithm 1 and Algorithm 2.

As described in Algorithm 1, at first (Lines 1-4), we initialize the network parameters randomly. To enhance the learning stability, we introduce the target DQN and it has the same structure as the original network. Then, exploration process is conducted. The action is derived from current DQN and the action is mixed noise with Gaussian distribution to maintain the exploration. The DQN employs the SINR to update the reward function, since the SINR is regarded as the received signal strengths of the UAVs and it can reflect current location of the UAVs. This is because that the channel conditions are changing in different location and the SINR is a common index to illustrate the channel conditions. By trying all the actions for obtaining better rewards estimation, the UAVs will choose an action with highest utility (i.e., highest reward) and fly along this direction. Next (Lines 14-17), we use the mini-batch method to randomly collect examples from the replay memory. And we update the weights and bias of the network by training the DQN according to the loss function (11). Once the UAVs arrive at the terminal destination, the training process is stopped and the UAVs stop choosing actions.

IV Simulation Results and Analysis

In this section, we present numerical results of the proposed DQN-based UAV navigation scheme through massive MIMO. In our experiment, we consider a massive MIMO system with $N_{t}=128$ transmit antennas and $K=32$ single-antenna UAVs. Here, $l$ is the distance between UAV and BS. Each $\gamma^{k}$ is expressed as $\sqrt{\frac{\kappa}{\kappa+1}l^{-\beta}}$ and the LoS phases follow uniform distribution over $\{-\pi,\pi\}$ radians, in which $\beta=3.8$ and $\kappa=6$ dB are the passloss exponent and the Rician factor, respectively. Based on the Rician fading with a Rician factor of 6dB, the UAV is simulated in a $500\textup{m}\times 500\textup{m}$ indoor space. Specifically, BSs are placed at the opposite corners and UAVs are directly above each BS, which is the worst case. The UAVs can only fly along several particular routes without running into the walls, and they need to try other directions if any crash occurs. Also, we set the total transmitted power as 20 W and the sampling period is initialized as 0.02 ms, while $d=\frac{\lambda}{2}$ is initialized. Furthermore, the batch size is 100 and the number of training examples is 250000, while the amount of testing examples is set as 50000.

To evaluate the coverage performance of the UAVs, we divide the whole area into $M$ zones, and each zone should be covered by at least one UAV at each time period. Assuming $F_{m}$ as the amount of time slots when the $m$ -th zone is covered, the coverage score of zone $m$ is defined as

[TABLE]

Our objective is to maximize the coverage score. In Fig. 2(a), we compare the coverage score via coverage range performance with that of the TF-UAV scheme [8], and the DRL-JSAC method [21]. It can be seen that the proposed DQN method outperforms other methods in terms of the coverage performance when the UAV coverage range is less than 2.4 or larger than 3.1. Although the DRL-JSAC method obtains the highest coverage score when the coverage range is from 2.4 to 3.1, its curve is not smooth and this result indicates that the DRL-JSAC method results in randomness and this method is not robust. Different from the result of the DRL-JSAC method, we observe that the curve of the proposed DQN-based method is smoother compared with that of other schemes, showing that the proposed method is more robust in UAV navigation. As the coverage range increases, the proposed scheme still performs very well since the curve is monotonically increasing. As can be observed in Fig. 2(b), the accumulated reward increases monotonically as the epoch increases. The curve grows slowly as the epoch is more than 300. It is because the zones were not well covered at the initial time and the action selection brings an improving reward. When the zones are well covered, the reward is increasing smoothly and slowly. Hence, the proposed scheme achieves the better coverage performance.

Fig. 2(c) shows the navigation performance of the DQN-based scheme. It can be seen from Fig. 2(c) that the UAV is moving away from the origin of the cartesian coordinate system as the time increases, which indicates that the UAV can fly under accurate UAV navigation. Also, we observe that part of the navigation curve changes sharply, which are induced by the fact that the UAV encounters obstacles such as walls of the indoor space during flying, which implies that the proposed DQN-based method is capable of extracting the environment information and making the best decision.

The performance comparison of the convergence time of the UAV navigation against the sampling duration is presented in Fig. 3(a), in which the DQN-based scheme, the TF-UAV scheme [8], the evolutionary-based method [7], and the Silhouette-based image approach [20] are included. Here, the speed of the UAV is set as 10 km/h. It is observed from Fig. 3(a) that the convergence time of each UAV navigation algorithm is reduced as the sampling duration increases, for the reason that the probability of making a wrong decision increases as the UAVs travel a longer distance. Meanwhile, it can be seen that when the sampling duration approaches a threshold time, the speed of the convergence of all the algorithms reduces and the speed would reduce to 0 in theory, which means that the algorithm converges. In particular, the proposed DQN-based scheme converges when the sampling duration is 2s, while other methods still shake sharply. Also, the proposed scheme requires less convergence time compared with that of other schemes in most cases, although the Silhouette-based image approach requires less convergence time when the sampling duration increases from 0.5s to 0.6s.

Fig. 3(b) exhibits the convergence performance of the UAV navigation against the sampling duration of the DQN-based method, where the initial learning rate is set as 0.1, 0.01, 0.005, and 0.001, respectively. Initially, the speed of the UAV is 10 km/h. Learning rate is an essential parameter in a deep learning-based approach, which is always introduced to evaluate the robustness and convergence performance of a deep learning-based method. We observe from Fig. 3(b) that the DQN-based scheme converges quickly when adopting a larger initial learning rate, due to the fact that a larger initial learning rate facilitates the convergence behavior. However, it should be noted that a larger learning rate leads to a strong vibration and degrades the convergence performance. As shown in Fig. 3(b), the curve is more stable and it becomes smooth finally when introducing the learning rate as 0.001. It indicates that a smaller learning rate enhances the UAV navigation performance.

Fig. 3(c) shows the convergence performance of the UAV navigation when the UAV flies at different velocities, in the case of 10 km/h, 15 km/h, 20 km/h, 25 km/h, and 30 km/h, respectively. Here, the initial learning rate is set as 0.001. It is observed from Fig. 3(c) that a larger velocity requires a smaller convergence time compared to that at a smaller velocity, since a larger velocity reduces the responsiveness of the UAV system. However, it can also be seen from Fig. 3(c) that the tendency of the curves is that they are not decreasing in general and they shake frequently when adopting a larger UAV speed, which indicates that a larger UAV speed degrades the UAV navigation performance.

V Conclusions

In this paper, we have presented a deep reinforcement learning-based scheme for UAV navigation through massive MIMO. Specifically, we first design an efficient DQN which comprises conv. layers and FC layers to extract useful features of the massive MIMO. In addition, a Q-learning-based learning policy is proposed to realize the UAV navigation. Here, we treat each UAV-ground link as an agent, and the optimal location at the UAVs is obtained based on the received signal strengths without requiring global information. Numerical results also show the superior UAV navigation performance of the DQN-based strategy compared with several typical strategies in terms of convergence and coverage.

Bibliography21

The reference list from the paper itself. Each links out to its DOI / PubMed record.

1[1] A. Osseiran et al., “Scenarios for 5G mobile and wireless communications: The vision of the METIS project,” IEEE Commun. Mag. , vol. 52, no. 5, pp. 26–35, May 2014.
2[2] S. Chandrasekharan et al., “Designing and implementing future aerial communication networks,” IEEE Commun. Mag. , vol. 54, no. 5, pp. 26–34, May 2016.
3[3] Y. Zeng, R. Zhang, and T. J. Lim, “Wireless communications with unmanned aerial vehicles: Opportunities and challenges,” IEEE Commun. Mag. , vol. 54, no. 5, pp. 36–42, May 2016.
4[4] D. Yang, Q. Wu, Y. Zeng, and R. Zhang, “Energy tradeoff in ground-to-UAV communication via trajectory design,” IEEE Trans. Veh. Technol. , vol. 67, no. 7, pp. 6721-6726, Jul. 2018.
5[5] X. Xu, Y. Zeng, Y. L. Guan, and R. Zhang, “Overcoming endurance issue: UAV-enabled communications with proactive caching,” IEEE J. Sel. Areas Commun. , vol. 36, no. 6, pp. 1231-1244, Jun. 2018.
6[6] G. Geraci, A. Garcia-Rodriguez, L. G. Giordano, D. Lopez-Perez, and E. Bjoernson, “Supporting UAV cellular communications through massive MIMO,” in Proc. ICC Workshops , Kansas City, MO, 2018, pp. 1-6.
7[7] I. K. Nikolos, K. P. Valavanis, N. C. Tsourveloudis, and A. N. Kostaras, “Evolutionary algorithm based offline/online path planner for UAV navigation,” IEEE Trans. Systems, Man, and Cybernetics , vol. 33, no. 6, pp. 898-912, Dec. 2003.
8[8] T. Tomic et al., “Toward a fully autonomous UAV: Research platform for indoor and outdoor urban search and rescue,” IEEE Robotics & \& Automation Mag. , vol. 19, no. 3, pp. 46-56, Sept. 2012.