Massive Autonomous UAV Path Planning: A Neural Network Based Mean-Field   Game Theoretic Approach

Hamid Shiri; Jihong Park; Mehdi Bennis

arXiv:1905.04152·cs.SY·May 14, 2019

Massive Autonomous UAV Path Planning: A Neural Network Based Mean-Field Game Theoretic Approach

Hamid Shiri, Jihong Park, Mehdi Bennis

PDF

TL;DR

This paper presents a neural network-based mean-field game approach for autonomous UAV path planning that reduces communication and computation energy while ensuring collision avoidance in large UAV swarms.

Contribution

It introduces a novel ML-assisted MFG control method that efficiently solves PDEs for large UAV groups with minimal communication and computational costs.

Findings

01

Effective collision avoidance demonstrated in simulations.

02

Reduced communication energy by exchanging UAV states only once.

03

Low computational energy achieved through ML approximation of PDE solutions.

Abstract

This paper investigates the autonomous control of massive unmanned aerial vehicles (UAVs) for mission-critical applications (e.g., dispatching many UAVs from a source to a destination for firefighting). Achieving their fast travel and low motion energy without inter-UAV collision under wind perturbation is a daunting control task, which incurs huge communication energy for exchanging UAV states in real time. We tackle this problem by exploiting a mean-field game (MFG) theoretic control method that requires the UAV state exchanges only once at the initial source. Afterwards, each UAV can control its acceleration by locally solving two partial differential equations (PDEs), known as the Hamilton-Jacobi-Bellman (HJB) and Fokker-Planck-Kolmogorov (FPK) equations. This approach, however, brings about huge computation energy for solving the PDEs, particularly under multi-dimensional UAV…

Equations36

d v_{i} (t)

d v_{i} (t)

d r_{i} (t)

\displaystyle\psi_{i}(t)\!=\!\mathsf{E}\left[\;\int_{t}^{\intercal}\bigg{(}\phi_{L}\!\left(s_{i}(\tau)\right)+c_{4}\phi_{G}\!\left(s(\tau)\right)\bigg{)}\textup{d}\tau\right]

\displaystyle\psi_{i}(t)\!=\!\mathsf{E}\left[\;\int_{t}^{\intercal}\bigg{(}\phi_{L}\!\left(s_{i}(\tau)\right)+c_{4}\phi_{G}\!\left(s(\tau)\right)\bigg{)}\textup{d}\tau\right]

\displaystyle\;\;\phi_{L}\!\big{(}s_{i}(t)\big{)}=\overbrace{\frac{v_{i}(t)\cdot r_{i}(t)}{\left\|r_{i}(t)\right\|}+c_{1}{\left\|r_{i}(t)\right\|^{2}}}^{\text{{1) travel time} minimization}}+\overbrace{c_{2}{\left\|v_{i}(t)\right\|^{2}}+c_{3}{\left\|a_{i}(t)\right\|^{2}}}^{\text{{2) motion energy} minimization}},

\displaystyle\;\;\phi_{L}\!\big{(}s_{i}(t)\big{)}=\overbrace{\frac{v_{i}(t)\cdot r_{i}(t)}{\left\|r_{i}(t)\right\|}+c_{1}{\left\|r_{i}(t)\right\|^{2}}}^{\text{{1) travel time} minimization}}+\overbrace{c_{2}{\left\|v_{i}(t)\right\|^{2}}+c_{3}{\left\|a_{i}(t)\right\|^{2}}}^{\text{{2) motion energy} minimization}},

\displaystyle\;\;\phi_{G}\!\big{(}s_{N_{i}}\!(t)\big{)}=\underbrace{\frac{1}{N_{i}(t)-1}\sum_{u_{j}\in\mathcal{N}_{i}(t)\backslash\{u_{i}\}}\frac{\left\|v_{j}(t)-v_{i}(t)\right\|^{2}}{\left(\varepsilon+\left\|r_{j}(t)-r_{i}(t)\right\|^{2}\right)^{\beta}}}_{\text{{3) collision} avoidance \& connectivity guarantee}},

\displaystyle\text{(For {HJB})}\hskip 5.0pt\mathsf{H}\big{(}\psi_{i}\!(t);s_{N_{i}}\!(t)\big{)}

\displaystyle\text{(For {HJB})}\hskip 5.0pt\mathsf{H}\big{(}\psi_{i}\!(t);s_{N_{i}}\!(t)\big{)}

= \partial_{t} ψ_{i} (t) + [A s_{i} (t) - \frac{1}{4 c _{3}} B B^{⊺} \nabla ψ_{i} (t) + c_{0} v_{o} B]^{⊺} \nabla ψ_{i} (t) + \frac{1}{2} tr (G G^{⊺} \nabla^{2} ψ_{i} (t)) + ϕ_{L} (s_{i} (t)) + ϕ_{G} (s_{N_{i}} (t))

\displaystyle\text{(For {MFG})}\hskip 4.0pt\mathsf{H}\big{(}\psi_{i}\!(t);s_{i}(t),m(t)\big{)}

\displaystyle\text{(For {MFG})}\hskip 4.0pt\mathsf{H}\big{(}\psi_{i}\!(t);s_{i}(t),m(t)\big{)}

\displaystyle\mathsf{F}\!\big{(}m(t);s_{i}(t),\psi_{i}\!(t)\big{)}

= \partial_{t} m (t) + \nabla ([A s - \frac{1}{2 c _{3}} B B^{⊺} \nabla ψ_{i} (t) + c_{0} v_{o} B] m (t)) - \frac{1}{2} tr (G G^{⊺} \nabla^{2} m (t))

ψ_{i}^{*} (t) = a_{i} (t) min ψ_{i} (t)

ψ_{i}^{*} (t) = a_{i} (t) min ψ_{i} (t)

d s_{i} (t) = (A s_{i} (t) + B (a_{i} (t) + c_{0} v_{o})) d t + G d W_{i} (t),

ϕ_{G} (s_{i} (t), m (t)) = \int_{s} m (t) \frac{∥ v ( t ) - v _{i} ( t ) ∥ ^{2}}{( ε ^{2} + ∥ r ( t ) - r _{i} ( t ) ∥ ) ^{2} ) ^{β}} d s .

ϕ_{G} (s_{i} (t), m (t)) = \int_{s} m (t) \frac{∥ v ( t ) - v _{i} ( t ) ∥ ^{2}}{( ε ^{2} + ∥ r ( t ) - r _{i} ( t ) ∥ ) ^{2} ) ^{β}} d s .

\hat{ψ}_{i} (t) = w_{i, H} (t)^{⊺} σ_{H} (s_{N_{i}} (t)) .

\hat{ψ}_{i} (t) = w_{i, H} (t)^{⊺} σ_{H} (s_{N_{i}} (t)) .

\displaystyle L_{i,\textsf{H}}(t)=\underbrace{\frac{1}{2}\left|\hat{\mathsf{H}}\big{(}\hat{\psi}_{i}(t);s_{N_{i}}\!(t)\big{)}\right|^{2}}_{\ell_{i,\textsf{H}}(t)}+\;c_{\hskip 0.5pt\textsf{H}}\underbrace{\max\left\{0,s_{i}(t)^{\intercal}\frac{\text{d}s_{i}(t)}{\text{d}t}\right\}}_{R_{i}(t)},

\displaystyle L_{i,\textsf{H}}(t)=\underbrace{\frac{1}{2}\left|\hat{\mathsf{H}}\big{(}\hat{\psi}_{i}(t);s_{N_{i}}\!(t)\big{)}\right|^{2}}_{\ell_{i,\textsf{H}}(t)}+\;c_{\hskip 0.5pt\textsf{H}}\underbrace{\max\left\{0,s_{i}(t)^{\intercal}\frac{\text{d}s_{i}(t)}{\text{d}t}\right\}}_{R_{i}(t)},

w_{i, H} (t) = w_{i, ψ} (t - Δ t) - μ sign (\nabla_{w} ℓ_{i, H} (t)) - c_{H} \nabla_{w} R_{i} (t) .

w_{i, H} (t) = w_{i, ψ} (t - Δ t) - μ sign (\nabla_{w} ℓ_{i, H} (t)) - c_{H} \nabla_{w} R_{i} (t) .

\partial_{t} \hat{ψ}_{i} (t) = (\frac{d s _{i} ( t )}{d t})^{⊺} \nabla \hat{ψ}_{i} (t) .

\partial_{t} \hat{ψ}_{i} (t) = (\frac{d s _{i} ( t )}{d t})^{⊺} \nabla \hat{ψ}_{i} (t) .

\overset{m}{^} (t) = w_{i, F} (t)^{⊺} σ_{F} (s_{i} (t)) .

\overset{m}{^} (t) = w_{i, F} (t)^{⊺} σ_{F} (s_{i} (t)) .

\displaystyle L_{i,\textsf{F}}(t)=\frac{1}{2}|{\mathsf{F}}\!\big{(}\hat{m}(t);s_{i}(t),\hat{\psi}_{i}(t)\big{)}|^{2}.

\displaystyle L_{i,\textsf{F}}(t)=\frac{1}{2}|{\mathsf{F}}\!\big{(}\hat{m}(t);s_{i}(t),\hat{\psi}_{i}(t)\big{)}|^{2}.

w_{i, H}^{[k + 1]} (t) = w_{i, H}^{[k]} (t) - μ sign (\nabla_{w} ℓ_{i, H}^{[k]} (t)) - c_{H} \nabla_{w} R_{i}^{[k]} (t) .

w_{i, H}^{[k + 1]} (t) = w_{i, H}^{[k]} (t) - μ sign (\nabla_{w} ℓ_{i, H}^{[k]} (t)) - c_{H} \nabla_{w} R_{i}^{[k]} (t) .

w_{i, F}^{[k + 1]} (t) = w_{i, F}^{[k]} (t) - μ sign (\nabla_{w} L_{i, F}^{[k]} (t)) .

w_{i, F}^{[k + 1]} (t) = w_{i, F}^{[k]} (t) - μ sign (\nabla_{w} L_{i, F}^{[k]} (t)) .

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Full text

Massive Autonomous UAV Path Planning:

A Neural Network Based Mean-Field Game Theoretic Approach

Hamid Shiri, Jihong Park, and Mehdi Bennis

Centre for Wireless Communications, University of Oulu, Finland, Email: {hamid.shiri, jihong.park, mehdi.bennis}@oulu.fi

Abstract

This paper investigates the autonomous control of massive unmanned aerial vehicles (UAVs) for mission-critical applications (e.g., dispatching many UAVs from a source to a destination for firefighting). Achieving their fast travel and low motion energy without inter-UAV collision under wind perturbation is a daunting control task, which incurs huge communication energy for exchanging UAV states in real time. We tackle this problem by exploiting a mean-field game (MFG) theoretic control method that requires the UAV state exchanges only once at the initial source. Afterwards, each UAV can control its acceleration by locally solving two partial differential equations (PDEs), known as the Hamilton-Jacobi-Bellman (HJB) and Fokker-Planck-Kolmogorov (FPK) equations. This approach, however, brings about huge computation energy for solving the PDEs, particularly under multi-dimensional UAV states. We address this issue by utilizing a machine learning (ML) method where two separate ML models approximate the solutions of the HJB and FPK equations. These ML models are trained and exploited using an online gradient descent method with low computational complexity. Numerical evaluations validate that the proposed ML aided MFG theoretic algorithm, referred to as MFG learning control, is effective in collision avoidance with low communication energy and acceptable computation energy.

Index Terms:

Autonomous UAV, communication-efficient online path planning, mean-field game, machine learning.

I Introduction

Many unmanned aerial vehicles (UAVs) are essential in mission-critical applications, for covering wide disaster sites in emergency cellular networks [1] and for delivering heavy payload in rescue mission and firefighting scenarios [2, 3]. These applications are real-time, and do not tolerate remote control delays from a central controller. Besides, they necessitate reliable control under uncertainty such as wind perturbations, making pre-programed offline control algorithms ill-suited. In view of this, in this paper we focus on the problem of controlling a large number of UAVs in a distributed and online way, so as to achieve 1) the fastest travel from a source to a destination, while jointly minimizing 2) motion energy, and 3) inter-UAV collision, under wind dynamics.

This problem is challenging as illustrated in Fig. 1, wherein each UAV is faced with making control decisions with many degrees of freedom, while taking into account energy-saving and collision-avoidance. For collision avoidance, multiple UAVs need to interact with each other, which require inter-UAV communications whose delay and/or energy cost increases exponentially with the number of UAVs. Such communication and control overhead is persistent as the control must be continual under wind perturbations.

To address the aforementioned issues, we leverage mean-field game (MFG) theory [4, 5], a mathematical framework that is effective in reducing the communication and control overhead of distributed control under agent interactions (e.g., collisions) through their states (e.g., locations) [1]. At its core, MFG considers a large number of agents, each of which approximately views the other agents’ states as the global state averaged across all agents. The global state is identically given for all agents at any given time, and one can thus focus only on controlling a single agent while incorporating its interactions via the global state distribution, referred to as the mean-field (MF) distribution.

The said MFG theoretic control is operated by locally solving two partial differential equations (PDEs) at each agent. Namely, a single agent computes the MF distribution by solving the Fokker-Plank-Kolmogorov (FPK) equation, so long as the initial global state is known by exchanging agents’ states only once. For the given MF distribution, the optimal control of the agent is determined by solving the other PDE induced by a continuous-time Markov decision problem (MDP), known as the Hamilton-Jacobi-Bellman (HJB) equation [5].

While effective, MFG theoretic approaches are computationally expensive due to solving both HJB and FPK equations, particularly with multi-dimensional states [6], limiting their adoption for real-time multi-dimensional control applications. To circumvent this problem, we propose an MFG learning control algorithm in which the HJB and FPK solutions are approximated using two separate machine learning (ML) models (e.g., neural networks), denoted as HJB model and FPK model, respectively. The HJB and FPK models stored at each UAV are simultaneously trained and exploited for control in an online manner. Numerical evaluations validate that the proposed MFG learning control more reliably guarantees collision avoidance with significant communication energy reduction, at the cost of a slight increase in computation and motion energy consumption.

Related works. The problem of UAV placement for supporting communication systems has been studied in [7]. Under wind perturbations, the real-time placement of massive UAVs without collision has been investigated in [1]. Path planning is a more challenging problem wherein UAVs are controlled to reach a destination. In offline control, a multiple-UAV scenario have been addressed in [7]. In online control, an evolutionary algorithm [8] and a partially observable Markov decision process based method [9] have been proposed. For other communication and control related issues for UAV systems, readers are encourage to check [10] and [3], respectively.

II System model

We assume a set $\mathcal{N}$ of $N$ UAVs traveling from a common source to a destination in a two-dimensional plane, where the origin is set as the destination. At time $t\geq 0$ , the $i$ -th UAV $u_{i}\in\mathcal{N}$ controls its acceleration $a_{i}(t)\in\mathbb{R}^{2}$ , so as to minimize its: 1) travel time, 2) motion energy, and 3) inter-UAV collision, during the remaining travel to the destination.

The control $a_{i}(t)$ of $u_{i}$ is based not only on its local state $s_{i}(t)$ , but also on the states $s_{-i}(t)$ of a set $\mathcal{N}_{i}(t)\!\subset\!\mathcal{N}$ of the other $(N_{i}-1)$ UAVs within $u_{i}$ ’s communication range $d>0$ with $N_{i}(t)=|\mathcal{N}_{i}(t)|\leq N$ for collision avoidance, and $s_{-i}(t)\!=\!\{s_{j}(t)\!\mid\!\|r_{j}(t)\!-\!r_{i}(t)\|\!\leq\!d,\forall j\neq i\}$ . The communication range $d$ is determined by the minimum received signal-to-noise ratio (SNR) $\theta>0$ required for successful decoding under the standard path loss model, which is given as $d=[P/(\theta\sigma^{2})]^{1/\alpha}$ with an identical transmission power $P$ , noise power $\sigma^{2}$ , and path loss exponent $\alpha\geq 2$ .

The state $s_{i}(t)=[r_{i}(t)^{\intercal},v_{i}(t)^{\intercal}]^{\intercal}\in\mathbb{R}^{4}$ of UAV $u_{i}$ is comprised of its location $r_{i}(t)\in\mathbb{R}^{2}$ and velocity $v_{i}(t)\in\mathbb{R}^{2}$ that are dynamically updated by the control $a_{i}(t)$ under random wind dynamics. Following [11], the wind dynamics are assumed to follow an Ornstein-Uhlenbeck process with an average wind velocity $v_{o}$ . The temporal state dynamics are thereby given as:

[TABLE]

where $c_{0}$ is a positive constant, $V_{o}\in\mathbb{R}^{2\times 2}$ is the covariance matrix of the wind velocity, and $W_{i}(t)\in\mathbb{R}^{2}$ is the standard Wiener process independently and identically distributed (i.i.d.) across UAVs.

To achieve the aforementioned goals 1), 2), and 3), UAV $u_{i}$ at time $t\!<\!T$ aims to minimize its average cost $\psi_{i}(t)$ , where the average is taken with respect to the measure induced by all possible controls for $\tau\!\in\![t,T]$ . The cost $\psi_{i}(t)$ consists of the term $\phi_{L}\!\left(s_{i}(t)\right)$ depending only on the local state $s_{i}(t)$ and the term $\phi_{G}\!\left(s_{N_{i}}\!(t)\right)$ relying on the global state $s_{N_{i}}\!(t)=[s_{i}(t)^{\intercal},s_{-i}(t)^{\intercal}]^{\intercal}\in\mathbb{R}^{4N_{i}}$ observed by $u_{i}$ , given as:

[TABLE]

where

[TABLE]

and the terms $c_{1}$ , $c_{2}$ , $c_{3}$ , $\beta$ , and $\varepsilon$ are positive constants.

The local term $\phi_{L}\!\left(s_{i}(t)\right)$ in (3) focuses on the following two objectives. For 1) travel time minimization, it is intended to minimize the remaining travel distance $\|r_{i}(t)\|^{2}$ , while maximizing the velocity towards the destination, i.e., minimizing the projected velocity $v_{i}(t)\cdot r_{i}(t)/\|r_{i}(t)\|$ towards the opposite direction to the destination. For 2) motion energy minimization, it is planned to minimize the kinetic energy and the acceleration control energy that are proportional to $\|v_{i}(t)\|^{2}$ and $\|a_{i}(t)\|^{2}$ , respectively [12, 13].

The global term $\phi_{G}\!\left(s_{N_{i}}\!(t)\right)$ in (3) refers to 3) collision avoidance, and is intended to form a flock of UAVs moving together [14]. The flocking leads to small relative inter-UAV velocities for avoiding collision even when their controlled velocities are slightly perturbed by wind dynamics. Furthermore, the flocking yields closer inter-UAV distances without collision. This is beneficial for allowing more UAVs to exchange their states, i.e., larger $N_{i}(t)$ , thereby contributing also to collision avoidance. In view of this, we adopt the Cucker-Smale flocking [1, 14] that reduces the relative velocities for the UAVs. The relative velocity $\|v_{j}(t)-v_{i}(t)\|$ and the inter-UAV distance $\|r_{j}(t)-r_{i}(t)\|$ are thus incorporated in the numerator and denominator of $\phi_{G}\!\left(s_{N_{i}}\!(t)\right)$ , respectively.

Incorporating the cost function (3) under its temporal dynamics (1) and (2), the control problem of UAV $u_{i}$ at time $t$ is formulated as:

[TABLE]

where $A\!=\!\left(\begin{smallmatrix}0&I\\ 0&-c_{0}I\end{smallmatrix}\right)$ , $B\!=\!\left(\begin{smallmatrix}0\\ I\end{smallmatrix}\right)$ , $G\!=\!\left(\begin{smallmatrix}0\\ V_{o}\end{smallmatrix}\right)$ , and $I$ denotes the two-dimensional identity matrix. The minimum cost $\psi_{i}^{*}\!(t)$ is referred to as the value function of the optimal control, and is derived using two different control methods in the next section.

III HJB Control and MFG Control

Deriving the UAV $u_{i}$ ’s value function $\psi_{i}^{*}(t)$ in (4) is intertwined with other UAVs, through the collision avoidance term $\phi_{G}\!\left(s_{N_{i}}\!(t)\right)$ in (3). Therefore, this is an $N_{i}$ -player non-cooperative game whose well-known solution is the Nash equilibrium (NE), i.e., the control decisions under which no UAV can unilaterally decrease its cost [5]. Its solution complexity exponentially increases with $N_{i}$ , which is a poor fit for real-time applications. To address this pressing concern, in this section we consider two different control methods: 1) HJB control, our baseline method in which each UAV’s control only takes into account the other UAVs’ states before taking their actions; and 2) MFG control, our proposed method that incorporates the intertwined controls via an approximated global state distribution, i.e., the MF distribution.

It is noted that HJB control does not always achieve the NE as it intentionally neglects the actual control interactions, i.e., the states when taking actions. On the other hand, MFG control relies on the MF approximation, and only achieves the NE asymptotically when $N\rightarrow\infty$ [5]. The operational details of both control schemes are elaborated in the following subsections, and their effectiveness under a large finite number of UAVs will be numerically examined in Sec. V.

III-A HJB Control

The UAV $u_{i}$ ’s value function $\psi_{i}^{*}(t)$ in (4) is equivalent to the solution of its corresponding HJB $\mathsf{H}\big{(}\psi_{i}^{*}\!(t);s_{N_{i}}\!(t)\big{)}\!=\!0$ formulated according to the Markov decision principle. The left-hand side $\mathsf{H}\big{(}\psi_{i}^{*}\!(t);s_{N_{i}}\!(t)\big{)}$ is given by putting $\psi_{i}^{*}(t)$ into $\mathsf{H}\big{(}\psi_{i}\!(t);s_{N_{i}}\!(t)\big{)}$ in (7) at the bottom of the next page (see the derivation details in [5]). Due to the global term $\phi_{G}(s_{N_{i}}\!(t))$ therein for collision avoidance, the HJB solution requires collecting the other UAVs’ states. Furthermore, achieving the NE of $N_{i}$ -UAV controls, necessitates solving $N_{i}$ -coupled HJBs whose required number of state exchanges exponentially increases with $N_{i}$ . For example, each HJB is first solved while the other $(N_{i}-1)$ UAVs’ states are fixed, and this should be iterated for $N_{i}$ UAVs in a recursive manner until all action changes stop, i.e., convergence to the NE [5]. The said $N_{i}$ -coupled HJB solutions require $N_{i}\times(N_{i}-1)\times K$ state exchanges per time instant $t$ , where $K$ denotes the number of iterations until convergence to the NE.

Such excessive communication overhead is not bearable for real-time UAV controls. Therefore, while compromising convergence to the NE, as a baseline control scheme we instead consider HJB control of UAVs that exchange $\sum_{i=1}^{N_{i}}N_{i}(t)$ number of states before solving the HJBs, i.e., before taking actions, at each time instant $t$ . Afterwards, each HJB is solved independently without recursion, as visualized in Fig. 2-a. At time $t$ , $u_{i}$ ’s HJB control is summarized as below.

Algorithm 1. HJB Control

Collect the states $s_{-i}(t)$ from $(N_{i}(t)-1)$ UAVs.

Calculate the value $\psi_{i}^{*}\!(t)$ by solving the HJB $\mathsf{H}\big{(}\psi_{i}^{*}\!(t);s_{N_{i}}\!(t)\big{)}=0$ (see (7)).

Take the optimal action $a_{i}^{*}(t)\!=\!\frac{1}{2c_{3}}B^{\intercal}\nabla\psi_{i}^{*}\!(t)$ .

Here, (7) is derived by applying the optimal control $a_{i}^{*}(t)$ to (6), where $\nabla$ denotes the differential operator taken with respect to $s_{i}(t)$ . The optimal control $a_{i}^{*}(t)\!=\!\frac{1}{2c_{3}}B^{\intercal}\nabla\psi_{i}^{*}\!(t)$ is obtained according to the Karush-Kuhn-Tucker (KKT) conditions, since the HJB’s Hamiltonian, i.e., the terms inside the infimum in (6), is convex with respect to $a_{i}(t)$ . The existence of $a_{i}^{*}(t)$ is ensured by the fact that the HJB with (6) has a unique solution $\psi_{i}^{*}\!(t)$ according to [5], as long as the drift term $As_{i}(t)+B(a_{i}(t)+c_{0}v_{o})$ in (5) and the instantaneous cost $\phi_{L}\!\left(s_{i}(t)\right)+\phi_{G}\!\left(s_{N_{i}}\!(t)\right)$ are smooth, i.e., continuous first derivatives.

III-B MFG Control

Compared to HJB control with $\sum_{i=1}^{N_{i}}N_{i}(t)$ state exchanges per time instant $t$ , MFG control requires $N\times(N-1)$ state exchanges only at the initial time $t=0$ , while asymptotically guaranteeing the NE anytime as $N$ goes to infinity. This is viable by locally calculating the MF distribution $m(t)$ that asymptotically converges to the (empirical) global state distribution when all actions are taken under the NE, i.e., $\lim_{N\to\infty}\!\frac{1}{N}\sum_{i=1}^{N}\mathds{1}_{s_{i}(t)}\!=\!m(t)$ . With finite UAVs, it yields an MF approximation that achieves the $\epsilon$ -NE [5].

To this end, each UAV under MFG control locally solves a pair of the HJB $\mathsf{H}\big{(}\psi_{i}^{*}\!(t);s_{i}(t),m(t)\big{)}=0$ (see (8) with $\psi_{i}^{*}\!(t)$ ) and its coupled FPK $\mathsf{F}\!\big{(}m(t);s_{i}(t),\psi_{i}^{*}\!(t)\big{)}=0$ (see (10) with $\psi_{i}^{*}\!(t)$ ) that is derived from the state dynamics (5) with the Itô’s lemma [5]. As illustrated in Fig. 2-b, solving the HJB produces the value $\psi_{i}^{*}\!(t)$ (or its corresponding optimal action $a_{i}^{*}(t)$ ), which is fed to the FPK whose solution is the MF distribution $m(t)$ . This operation is locally iterated $K$ times until it converges to the NE. At time $t$ , $u_{i}$ ’s MFG control is described as follows.

Algorithm 2. MFG Control

For $k\in[1,K]$ :

Calculate the value $\psi_{i}^{[k]}\!(t)$ by solving the HJB $\mathsf{H}\big{(}\psi_{i}^{[k]}\!(t);s_{i}(t),m^{[k-1]}\!(t)\big{)}=0$ (see (8)).

Calculate the MF distribution $m^{[k]}\!(t)$ by solving the FPK $\mathsf{F}\big{(}m^{[k]}\!(t);s_{i}(t),\psi_{i}^{[k]}\!(t)\big{)}$ .

Iterate 1) and 2) until $k=K$ .

Take the optimal action $a_{i}^{*}(t)\!=\!\frac{1}{2c_{3}}B^{\intercal}\nabla\psi_{i}^{[K]}\!(t)$ .

Initial MF distribution $m^{[0]}\!(t)$ at $k=0$ :

•

If $t=0$ , $m^{[0]}\!(0)=1/N\sum_{i=1}^{N}\mathds{1}_{s_{i}(t)}$ , computed by collecting the states $s_{-i}(0)$ from N UAVs.

•

Otherwise, $m^{[0]}\!(t)=m^{[K-1]}\!(t-\Delta t)$ , i.e., the converged MF distribution in the previous control where $\Delta t$ denotes the control interval.

It is noted that the HJB’s global term $\phi_{G}(s_{i}(t),m(t))$ in (8) approximates $\phi_{G}\!\left(s_{N_{i}}\!(t)\right)$ in (7), where

[TABLE]

This MF approximation is based on treating each of the UAVs’ states as $s=[r(t)^{\intercal},v(t)^{\intercal}]^{\intercal}$ induced by the MF distribution $m(t)$ . The approximation converges to the exact value as $N\rightarrow\infty$ , so long as $\phi_{G}\!\left(s_{N_{i}}\!(t)\right)$ is bounded and UAV indices are permutable, i.e., the exchangeability of actions for the same states (see the condition details in [5]).

IV ML Aided HJB and MFG Controls

Both HJB and MFG controls are facilitated by the HJB and FPK equations. These PDEs are solved by discretizing the domain in a way that the derivatives therein can be approximated using finite differences. Unfortunately, such a finite difference method requires finer discretization as the domain dimension increases, incurring higher computational complexity. For instance, in a two-dimensional $x$ - $y$ domain, the convergence of a numerical PDE solution with the temporal discretization step size $\Delta t$ is guaranteed by the Courant-Friedrichs-Lewy (CFL) condition $\Delta t\leq(\Delta x^{-1}+\Delta y^{-1})^{-1}$ whose feasible step size is smaller than the required step size in a one-dimensional domain, i.e., $\Delta t\leq\Delta x$ [6].

To enable multi-dimensional control in real time with low computational complexity, we propose HJB learning control and MFG learning control that approximate both HJB control and FPK control in Sec. III, respectively. Via these methods, ML models learn to solve the HJB and FPK in an online way, as elaborated in the following subsections.

IV-A HJB Learning Control

HJB learning control exploits ML to enable and represent the baseline method, HJB control in Sec. III-A. The key idea is to approximate the problem of solving the HJB equation $\mathsf{H}\big{(}\psi_{i}^{*}\!(t);s_{N_{i}}\!(t)\big{)}=0$ by minimizing $\mathsf{H}\big{(}\hat{\psi}_{i}\!(t);s_{N_{i}}\!(t)\big{)}$ via a data-driven regression method as proposed in [15]. To this end, a single hidden layer ML model, hereafter referred to as an HJB model, is constructed at the UAV $u_{i}$ . Its input $s_{N_{i}}\!(t)$ is fed to $M_{\textsf{H}}$ hidden nodes with a given activation function $\sigma_{\textsf{H}}(\cdot)$ , which are fully connected to the model output $\hat{\psi}_{i}(t)$ through a weight vector $w_{i,\textsf{H}}(t)$ , i.e.,

[TABLE]

The model is trained by adjusting $w_{i,\textsf{H}}(t)$ per each observation $s_{N_{i}}\!(t)$ , so as to minimize its cost function $L_{i,\textsf{H}}(t)$ comprising a loss function $\ell_{i,\textsf{H}}(t)$ and a regularizer $R_{i}(t)$ :

[TABLE]

where $c_{\hskip 0.5pt\textsf{H}}$ is a positive constant. The loss function is intended to minimize $\hat{\mathsf{H}}\big{(}\hat{\psi}_{i}\!(t);s_{N_{i}}\!(t)\big{)}$ in (7). The regularizer is meant to stop the movement when reaching the destination, i.e., $s_{i}(T)=[r_{i}(T)^{\intercal},v_{i}(T)^{\intercal}]^{\intercal}\!=\!0$ . At time $t$ , $u_{i}$ ’s HJB learning control is given as below.

Algorithm 3. HJB Learning Control

Collect the states $s_{-i}(t)$ from $(N_{i}(t)-1)$ UAVs.

Update the weight $w_{i,\textsf{H}}(t)$ as:

$\displaystyle w_{i,\textsf{H}}(t)\!=\!w_{i,\psi}(t\!-\!\Delta t)\!-\!\mu\text{sign}\left(\nabla_{\!w}\ell_{i,\textsf{H}}(t)\right)\!-\!c_{\hskip 0.5pt\textsf{H}}\nabla_{\!w}R_{i}(t).$

Calculate the value $\hat{\psi}_{i}(t)=w_{i,\psi}(t)^{\intercal}\sigma_{\psi}(s_{N_{i}}\!(t))$ .

Take the optimal action $a_{i}^{*}(t)\!=\!\frac{1}{2c_{3}}B^{\intercal}\nabla\hat{\psi}_{i}(t)$ .

The weight update rule in 2) of Algorithm 3 is derived by applying a normalized gradient descent algorithm (NGD), modified from the gradient descent algorithm (GD) in order to avoid saddle points under non-convex loss functions [16]. To be specific, the weight update rule under GD with the step size $\mu>0$ is $w_{i,\textsf{H}}(t)=w_{i,\textsf{H}}(t-\Delta t)-\mu\nabla_{\!w}L_{i,\textsf{H}}(t)$ , which is modified as $w_{i,\textsf{H}}(t)=w_{i,\textsf{H}}(t-\Delta t)-\mu\text{sign}\left(\nabla_{\!w}L_{i,\textsf{H}}(t)\right)$ under the original NGD [16] with $\text{sign}(x)=x/\|x\|$ . As opposed to this, the weight update rule in Algorithm 3 applies the sign operation only to the loss function $\ell_{i,\textsf{H}}(t)$ in $L_{i,\textsf{H}}(t)$ , in order not to disturb $R_{i}(t)$ activations as detailed next.

The regularizer $R_{i}(t)$ aims to ensure stably reaching the destination without further movement, i.e., the terminal zero-state convergence $s_{i}(T)\!=\!0$ . With this, $R_{i}(t)$ becomes activated, i.e., $R_{i}(t)\!>\!0$ , for penalizing the loss function $\ell_{i,\textsf{H}}(t)$ , when the state change direction (the sign of $\text{d}s_{i}(t)/\text{d}t$ ) under the current control is the same as the current state direction (the sign of $s_{i}(t)$ ), i.e., $s_{i}(t)^{\intercal}\text{d}s_{i}(t)/\text{d}t>0$ . Otherwise, the current control is capable of stabilizing the state, and the regularizer is thus inactivated, i.e., $R_{i}(t)\!=\!0$ . The regularizer activations during UAV travels will be discussed in Sec. V.

In the loss function $\ell_{i,\mathsf{H}}(t)$ , the expression $\hat{\mathsf{H}}\big{(}\hat{\psi}_{i}(t);s_{N_{i}}\!(t)\big{)}$ is derived by applying $\hat{\psi}_{i}(t)$ to (7) with the same procedure as described in Algorithm 1, except for the following detail. The cost function $L_{i,\textsf{H}}(t)$ includes ${\text{d}s_{i}(t)}/{\text{d}t}$ ; namely, within $R_{i}(t)$ in (13) as well as $\ell_{i,\textsf{H}}(t)$ that contains

[TABLE]

According to (5), this term introduces $\text{d}W_{i}(t)/\text{d}t$ that is computationally intractable. Instead, following [15], we apply the nominal state dynamics without random wind perturbations $\text{d}s_{i}(t)\!=\!(\!As_{i}(t)\!+\!B(a_{i}(t)\!+\!c_{0}v_{o}))\text{d}t$ when calculating ${\text{d}s_{i}(t)}/{\text{d}t}$ .

IV-B MFG Control Learning

In a similar vein to Algorithm 3, MFG learning control exploits ML to approximate the solutions of the HJB $\mathsf{H}\big{(}\psi_{i}^{*}\!(t);s_{i}(t),m(t)\big{)}\!=\!0$ and the FPK $\mathsf{F}\!\big{(}m(t);s_{i}(t),\psi_{i}^{*}\!(t)\big{)}\!\!=\!0$ induced by MFG control in Sec. III-B as the minima of $\mathsf{H}\big{(}\hat{\psi}_{i}\!(t);s_{i}(t),\hat{m}(t)\big{)}$ and $\mathsf{F}\!\big{(}\hat{m}(t);s_{i}(t),\hat{\psi}_{i}\!(t)\big{)}$ . To this end, each UAV constructs two separate ML models: the HJB model used in Algorithm 3 and an FPK model, minimizing ${\mathsf{H}}\big{(}\hat{\psi}_{i}\!(t);s_{i}(t),\hat{m}(t)\big{)}$ (see (8) with $\hat{\psi}_{i}\!(t)$ and $\hat{m}(t)$ ) and ${\mathsf{F}}\!\big{(}\hat{m}(t);s_{i}(t),\hat{\psi}_{i}\!(t)\big{)}$ (see (10) with $\hat{\psi}_{i}\!(t)$ and $\hat{m}(t)$ ), respectively. The FPK model has the same structure with $M_{\textsf{F}}$ hidden nodes, and produces the approximated MF distribution $\hat{m}(t)$ by adjusting its weight vector $w_{i,\textsf{F}}(t)$ , i.e.,

[TABLE]

Per each observation $s_{i}(t)$ , the FPK model is trained by adjusting $w_{i,\textsf{F}}(t)$ so as to minimize the cost function $L_{i,\textsf{F}}(t)$ :

[TABLE]

The HJB model’s cost function is the same as (13), except for replacing its ${\mathsf{H}}\big{(}\hat{\psi}_{i}(t);s_{N_{i}}\!(t)\big{)}$ with ${\mathsf{H}}\big{(}\hat{\psi}_{i}\!(t);s_{i}(t),\hat{m}(t)\big{)}$ . At time $t$ , UAV $u_{i}$ ’s MFG learning control is described as Algorithm 4 on the next page.

Algorithm 4. MFG Learning Control

For $k\in[1,K]$ :

Update the weight $w_{i,\textsf{H}}^{[k+1]}(t)$ as:

$\displaystyle w_{i,\textsf{H}}^{[k+1]}(t)\!=\!w_{i,\textsf{H}}^{[k]}(t)\!-\!\mu\text{sign}(\nabla_{\!w}\ell_{i,\textsf{H}}^{[k]}(t))-c_{\textsf{H}}\nabla_{\!w}R_{i}^{[k]}(t).$

Calculate the value $\hat{\psi}_{i}^{[k]}\!(t)=w_{i,\psi}^{[k]}(t)^{\intercal}\sigma_{\textsf{H}}(s_{i}(t))$ .

Update the weight $w_{i,\textsf{F}}^{[k+1]}(t)$ as:

$\displaystyle w_{i,\textsf{F}}^{[k+1]}(t)\!=\!w_{i,\textsf{F}}^{[k]}(t)\!-\!\mu\text{sign}\left(\nabla_{\!w}L_{i,\textsf{F}}^{[k]}(t)\right).$

Obtain the MF distribution $\hat{m}^{[k]}\!(t)\!=\!w_{i,\textsf{F}}^{[k]}(t)^{\intercal}\!\sigma_{\textsf{F}}(s_{i}(t))$

Iterate 1-4) until $k=K$ .

Take the optimal action $a_{i}^{*}(t)\!=\!\frac{1}{2c_{3}}B^{\intercal}\nabla\hat{\psi}_{i}^{[K]}\!(t)$ .

Initial MF distribution $\hat{m}^{[0]}\!(t)$ at $k=0$ :

•

If $t=0$ , $\hat{m}^{[0]}\!(0)=1/N\sum_{i=1}^{N}\mathds{1}_{s_{i}(t)}$ , computed by collecting the states $\bar{s}_{i}(0)$ from N UAVs.

•

Otherwise, $\hat{m}^{[0]}\!(t)=\hat{m}^{[K-1]}\!(t-\Delta t)$ .

V Numerical Results

In this section, we numerically compare the performances of HJB and MFG learning controls, in terms of travel time, energy consumption, and collision avoidance. For each travel, $N$ UAVs are dispatched to the origin from the source that is a square centered at $(150,100)$ in meters. At the source, each UAV is separated $\sqrt{2}$ m away from each other (see Fig. 3-a), and its velocity is solely determined by the wind dynamics with $V_{o}=0.1I$ and $v_{o}=(1,-1)$ in m/s. Under MFG learning control, hereafter denoted as MFG, all UAVs are assumed to exchange their states at the source. Under HJB learning control, before every control, each UAV exchanges its state with the UAVs within the communication range $d$ meter, henceforth referred to as $\textsf{HJB}_{d}$ , without incurring interference via frequency division multiple access (FDMA).

For an HJB or MFG model, following [15], a single hidden layer model is constructed, wherein each hidden node’s activation function corresponds to each non-scalar term in a polynomial expansion. The polynomial is heuristically chosen as: $(1+x_{i}(t)+v_{x,i}(t))^{6}+(1+y_{i}(t)+v_{y,i}(t))^{6}$ for $\sigma_{\textsf{H}}(s_{i}(t))$ and $(1+x_{i}(t)+v_{x,i}(t)+y_{i}(t)+v_{y,i}(t))^{4}$ for $\sigma_{\textsf{F}}(s_{i}(t))$ , where $r_{i}(t)=[x_{i}(t),y_{i}(t)]^{\intercal}$ and $v_{i}(t)=[v_{x,i}(t),v_{y,i}(t)]^{\intercal}$ . Compared to sigmoidal activations, polynomial activations enables smaller model sizes (i.e., $M_{\textsf{H}}\!=\!54$ , $M_{\textsf{F}}\!=\!69$ ), yet the models are known to be less robust against unseen state observations. Optimizing the model architecture is an interesting topic for future research. Other simulation parameters are summarized as follows: $\Delta t=1\text{s}$ , $\alpha=2$ , $\sigma^{2}=2^{-2}$ mW, $\theta=-10$ dB, $c_{0}=0.1$ , $c_{1}=100$ , $c_{2}=c_{3}=1.5$ , $c_{4}=0.5$ , $c_{\hskip 0.5pt\textsf{H}}=0.5$ , $\varepsilon=0.001$ , $\mu=0.01$ , and $w_{i,\textsf{H}}(0)=w_{i,\textsf{F}}(0)=0$ .

Figure 3 visualizes the trajectories of $25$ UAVs under $\textsf{HJB}_{1}$ , $\textsf{HJB}_{\text{100}}$ , and MFG. During the entire travel, UAVs under $\textsf{HJB}_{1}$ hardly communicate with each other. This makes their trajectories almost identical, causing frequent collision, where a collision is counted for an inter-UAV distance less than $0.1$ m. Focusing on $\textsf{HJB}_{\text{100}}$ , and MFG, at the beginning, all UAVs tend to follow the average wind direction to save motion energy, and then turn towards the destination. At this north-eastern turning point, $\textsf{HJB}_{\text{100}}$ fails to avoid collision due to its less trained HJB model. By contrast, MFG incurs no collision thanks to the locally iterated training operations between the HJB and FPK models (see $K$ iterations in Algorithm 4), yielding its more trained HJB (i.e., less variance in weight parameters), as observed in the rightmost subplot of Fig. 3-c. After the turning point, there is a long-distance flight of a UAV fleet. MFG shows the highest flight velocity owing to its better flocking, which partly compensates the longer travel distance for guaranteeing collision avoidance. Finally, at the last part of the travel, UAVs tend to hover around the destination in order to stop their movement while reaching the destination (i.e., $v_{i}(T)=r_{i}(T)=0$ ), which is detailed next.

Fig. 4 illustrates the accumulated number of regularizer $R_{i}(t)$ activations (see the details in Sec. IV-A) in the HJB models of $\textsf{HJB}_{1}$ , $\textsf{HJB}_{\text{100}}$ , and MFG as time elapses. For all controls, $R_{i}(t)$ is more frequently activated near the destination (i.e., $t\geq 100$ s) so as to reduce the velocity, thereby avoiding excessive hovering around and/or passing by the destination. Note that a better flocking behavior (i.e., lower inter-UAV velocities without collision) enables a more stable control without the regularization. For this reason, MFG achieving the best flocking behavior shows the least number of $R_{i}(t)$ activations. With more UAVs, MFG yields less frequent $R_{i}(t)$ activations. This is because the MF approximation (see Sec. III-B) becomes more accurate as the number of UAVs increases, providing its better flocking behavior earlier.

Lastly, Fig. 5 compares the communication, computation, and motion energy consumptions of $\textsf{HJB}_{\text{100}}$ and MFG during the entire travel. Each energy is averaged over UAVs, and is normalized by the energy of $\textsf{HJB}_{1}$ . We consider that communication, computation, and motion energy consumptions are proportional to the number of state exchanges, the number of gradient calculations, and $\|v_{i}(t)\|^{2}\!+\!\|a_{i}(t)\|^{2}$ , respectively. Focusing on communication energy, MFG exchanges UAV states only once at the source, whereas $\textsf{HJB}_{\text{100}}$ does it for every observation. Therefore, MFG consumes significantly less energy, irrespective of the number of UAVs, as opposed to $\textsf{HJB}_{\text{100}}$ whose energy increases with the number of communicating UAVs. Next, motion energy is proportional to the travel distance. As MFG yields its longer travel distance for avoiding collision, it consumes more motion energy. For computation energy, it is also proportional to the travel distance under online learning. Besides, in contrast to $\textsf{HJB}_{\text{100}}$ having only an HJB model, MFG performs gradient calculations for both HJB and FPK models, which makes MFG consume more computation energy.

VI conclusion

To control massive autonomous UAVs, in this work we proposed MFG learning control algorithm that enables each UAV’s real-time acceleration control in a distributed manner, by training and exploiting HJB and FPK ML models in an online way. Our simulation validated that MFG learning control guarantees collision avoidance with low communication energy, at the cost of a slight increase in computation and motion energy, compared to a baseline scheme, HJB learning control. The effectiveness of MFG learning control hinges on the level of the HJB and FPK model training. Collaborative HJB and FPK model training across UAVs via federated learning frameworks [17] could thus be an interesting topic for future work.

Bibliography17

The reference list from the paper itself. Each links out to its DOI / PubMed record.

1[1] H. Kim, J. Park, M. Bennis, and S.-L. Kim, “Massive UAV-to-ground communication and its stable movement control: A mean-field approach,” in Proc. IEEE SPAWC, Kalamata, Greece , Jun. 2018.
2[2] E. Ackerman and E. Strickland, “Medical delivery drones take flight in east africa,” IEEE Spectrum , vol. 55, no. 1, pp. 34–35, Jan. 2018.
3[3] J. Tisdale, Z. Kim, and J. K. Hedrick, “Autonomous UAV path planning and estimation,” IEEE Robot. Autom. Mag. , vol. 16, no. 2, pp. 35–42, Jun. 2009.
4[4] M. Huang, P. E. Caines, and R. P. Malhamé, “Large-population cost-coupled LQG problems with nonuniform agents: individual-mass behavior and decentralized ε 𝜀 \varepsilon -Nash equilibria,” IEEE Trans. Autom. Control , vol. 52, no. 9, pp. 1560–1571, Sep. 2007.
5[5] J.-M. Lasry and P.-L. Lions, “Mean field games,” Japan. J. Math. , vol. 2, no. 1, pp. 229–260, Mar. 2007.
6[6] R. Courant, K. Friedrichs, and H. Lewy, “On the partial difference equations of mathematical physics,” IBM J. Res. Dev. , vol. 11, no. 2, pp. 215–234, Mar. 1967.
7[7] M. Mozaffari, W. Saad, M. Bennis, and M. Debbah, “Efficient deployment of multiple unmanned aerial vehicles for optimal wireless coverage,” IEEE Commun. Lett. , vol. 20, no. 8, pp. 1647–1650, Aug. 2016.
8[8] I. K. Nikolos, K. P. Valavanis, N. C. Tsourveloudis, and A. N. Kostaras, “Evolutionary algorithm based offline/online path planner for UAV navigation,” IEEE Trans. Syst., Man, Cybern. B, Cybern. , vol. 33, no. 6, pp. 898–912, Dec. 2003.