Deep Model Predictive Control

Prabhat K. Mishra; Mateus V. Gasparino; Andres E. B. Velasquez; Girish; Chowdhary

arXiv:2302.13558·eess.SY·February 28, 2023

Deep Model Predictive Control

Prabhat K. Mishra, Mateus V. Gasparino, Andres E. B. Velasquez, Girish, Chowdhary

PDF

Open Access

TL;DR

This paper introduces a deep learning-based model predictive control method for nonlinear systems with unknown uncertainties, using neural networks for disturbance approximation and a tube-based controller for stability and constraint satisfaction.

Contribution

It proposes a novel integration of deep neural networks with tube-based MPC to handle unknown, state-dependent uncertainties in nonlinear systems.

Findings

01

Neural networks effectively approximate unknown disturbances.

02

The combined approach guarantees constraint satisfaction.

03

Closed-loop stability is maintained during learning.

Abstract

This paper presents a deep learning based model predictive control algorithm for control affine nonlinear discrete time systems with matched and bounded state-dependent uncertainties of unknown structure. Since the structure of uncertainties is not known, a deep neural network (DNN) is employed to approximate the disturbances. In order to avoid any unwanted behavior during the learning phase, a tube based model predictive controller is employed, which ensures satisfaction of constraints and input-to-state stability of the closed-loop states.

Equations143

x_{t + 1} = f (x_{t}) + g (x_{t}) (u_{t} + h (x_{t})), where

x_{t + 1} = f (x_{t}) + g (x_{t}) (u_{t} + h (x_{t})), where

u_{t} = u_{t}^{a} + u_{t}^{m},

u_{t} = u_{t}^{a} + u_{t}^{m},

x_{t + 1} = f (x_{t}) + g (x_{t}) u_{t}^{m} : = \overset{ˉ}{f} (x_{t}, u_{t}^{m}) .

x_{t + 1} = f (x_{t}) + g (x_{t}) u_{t}^{m} : = \overset{ˉ}{f} (x_{t}, u_{t}^{m}) .

x_{t + 1} = \overset{ˉ}{f} (x_{t}, u_{t}^{m}) + g (x_{t}) (u_{t}^{a} + h (x_{t})) .

x_{t + 1} = \overset{ˉ}{f} (x_{t}, u_{t}^{m}) + g (x_{t}) (u_{t}^{a} + h (x_{t})) .

\mathds W^{'} : = {v \in \mathds R^{d} ∣ ∥ v ∥ ⩽ w_{m a x}^{'}}, \mathds U^{'} : = {v \in \mathds R^{m} ∣ ∥ v ∥_{\infty} ⩽ u_{m a x} - u_{m a x}^{a}} .

\mathds W^{'} : = {v \in \mathds R^{d} ∣ ∥ v ∥ ⩽ w_{m a x}^{'}}, \mathds U^{'} : = {v \in \mathds R^{m} ∣ ∥ v ∥_{\infty} ⩽ u_{m a x} - u_{m a x}^{a}} .

(x_{t}^{r})_{t \in \mathds N_{0}}

(x_{t}^{r})_{t \in \mathds N_{0}}

(u_{i}^{r})_{i = 0}^{N - 1} min

(u_{i}^{r})_{i = 0}^{N - 1} min

s. t.

x_{i + 1}^{r} = \overset{ˉ}{f} (x_{i}^{r}, u_{i}^{r}), x_{i}^{r} \in X_{r} \subset X,

u_{i}^{r} \in \mathds U_{r} \subset \mathds U^{'}; i = 0, \dots, N - 1,

c_{s} (x_{t + i ∣ t}, u_{t + i ∣ t}) : = x_{t + i ∣ t} - x_{t + i}^{r}_{Q}^{2} + u_{t + i ∣ t} - u_{t + i}^{r}_{R}^{2}

c_{s} (x_{t + i ∣ t}, u_{t + i ∣ t}) : = x_{t + i ∣ t} - x_{t + i}^{r}_{Q}^{2} + u_{t + i ∣ t} - u_{t + i}^{r}_{R}^{2}

X_{f} : = {x \in \mathds R^{d} ∣ c_{f} (x) ⩽ α; α > 0}

X_{f} : = {x \in \mathds R^{d} ∣ c_{f} (x) ⩽ α; α > 0}

c_{f} (\overset{ˉ}{f} (x, u^{'})) - c_{f} (x) ⩽ - c_{s} (x, u^{'}) for every x \in X_{f} .

c_{f} (\overset{ˉ}{f} (x, u^{'})) - c_{f} (x) ⩽ - c_{s} (x, u^{'}) for every x \in X_{f} .

V_{m} (x_{t ∣ t}, (u_{t + i ∣ t})_{i = 0}^{N - 1}) : = c_{f} (x_{t + N ∣ t}) + i = 0 \sum N - 1 c_{s} (x_{t + i ∣ t}, u_{t + i ∣ t}) .

V_{m} (x_{t ∣ t}, (u_{t + i ∣ t})_{i = 0}^{N - 1}) : = c_{f} (x_{t + N ∣ t}) + i = 0 \sum N - 1 c_{s} (x_{t + i ∣ t}, u_{t + i ∣ t}) .

x_{t ∣ t} = x_{t}

x_{t ∣ t} = x_{t}

u_{t ∣ t} + u_{t}^{a} \in \mathds U

x_{t + i + 1 ∣ t} = \overset{ˉ}{f} (x_{t + i ∣ t}, u_{t + i ∣ t}) for i = 0, \dots, N - 1

u_{t + i ∣ t} \in \mathds U^{'} for i = 1, \dots, N - 1.

V_{m} (x_{t}) : = (u_{t + i ∣ t})_{i = 0}^{N - 1} min {\eqref e : cos t_{f} u n c t i o n ∣ \eqref e : co n s t r ain t_{i} ni t ia l, \eqref e : co n s t r ain t_{f} i r s t_{c} o n t r o l, \eqref e : co n s t r ain t_{d} y nami cs, \eqref e : co n s t r ain t_{r} e mainin g_{c} o n t r o l} .

V_{m} (x_{t}) : = (u_{t + i ∣ t})_{i = 0}^{N - 1} min {\eqref e : cos t_{f} u n c t i o n ∣ \eqref e : co n s t r ain t_{i} ni t ia l, \eqref e : co n s t r ain t_{f} i r s t_{c} o n t r o l, \eqref e : co n s t r ain t_{d} y nami cs, \eqref e : co n s t r ain t_{r} e mainin g_{c} o n t r o l} .

x_{t+1}=Ax_{t}+B\Big{(}u_{t}+h(x_{t})\Big{)},

x_{t+1}=Ax_{t}+B\Big{(}u_{t}+h(x_{t})\Big{)},

v_{t} = {4 for t \in [50 ℓ, 50 ℓ + 49] ℓ = 0, 2, 4, \dots 0 otherwise .

v_{t} = {4 for t \in [50 ℓ, 50 ℓ + 49] ℓ = 0, 2, 4, \dots 0 otherwise .

V_{0}^{⊤} = [0.8 0.2314 0.6918 - 0.6245 0.0095 0.0214],

X = [- \frac{π}{6}, \frac{π}{6}] \times [- \frac{π}{3}, \frac{π}{3}], and \mathds U = [- \frac{π}{4}, \frac{π}{4}] .

X = [- \frac{π}{6}, \frac{π}{6}] \times [- \frac{π}{3}, \frac{π}{3}], and \mathds U = [- \frac{π}{4}, \frac{π}{4}] .

h (x) = W_{L}^{⊤} ψ_{L} [W_{L - 1}^{⊤} ψ_{L - 1} [\dots [ψ_{1} (x)]]] + ε^{*} (x),

h (x) = W_{L}^{⊤} ψ_{L} [W_{L - 1}^{⊤} ψ_{L - 1} [\dots [ψ_{1} (x)]]] + ε^{*} (x),

h (x_{t}) = W^{* ⊤} ϕ^{*} (x_{t}) + ε^{*} (x_{t}) .

h (x_{t}) = W^{* ⊤} ϕ^{*} (x_{t}) + ε^{*} (x_{t}) .

h (x_{t}) = W^{* ⊤} ϕ_{j} (x_{t}) + ε_{j} (x_{t}),

h (x_{t}) = W^{* ⊤} ϕ_{j} (x_{t}) + ε_{j} (x_{t}),

u_{t}^{a} = - K_{t}^{⊤} ϕ_{j} (x_{t})

u_{t}^{a} = - K_{t}^{⊤} ϕ_{j} (x_{t})

\overset{ˉ}{K}_{t + 1} = K_{t} + \frac{θ}{∥ ϕ _{j} ( x _{t} ) ∥ ^{2}} ϕ_{j} (x_{t}) (g (x_{t})^{†} (x_{t + 1} - \overset{ˉ}{f} (x_{t}, u_{t}^{m})))^{⊤},

\overset{ˉ}{K}_{t + 1} = K_{t} + \frac{θ}{∥ ϕ _{j} ( x _{t} ) ∥ ^{2}} ϕ_{j} (x_{t}) (g (x_{t})^{†} (x_{t + 1} - \overset{ˉ}{f} (x_{t}, u_{t}^{m})))^{⊤},

K_{t}^{(i)} = Proj \overset{ˉ}{K}_{t}^{(i)} = ⎩ ⎨ ⎧ \overset{ˉ}{K}_{t}^{(i)} \frac{W ˉ _{i}}{K ˉ _{t}^{(i)}} \overset{ˉ}{K}_{t}^{(i)} if \overset{ˉ}{K}_{t}^{(i)} ⩽ \overset{ˉ}{W}_{i} otherwise.

K_{t}^{(i)} = Proj \overset{ˉ}{K}_{t}^{(i)} = ⎩ ⎨ ⎧ \overset{ˉ}{K}_{t}^{(i)} \frac{W ˉ _{i}}{K ˉ _{t}^{(i)}} \overset{ˉ}{K}_{t}^{(i)} if \overset{ˉ}{K}_{t}^{(i)} ⩽ \overset{ˉ}{W}_{i} otherwise.

∥ u_{t}^{a} ∥ = K_{t}^{⊤} ϕ_{j} (x_{t}) ⩽ ∥ K_{t} ∥ ∥ ϕ_{j} (x_{t}) ∥ ⩽ ∥ K_{t} ∥_{F} σ ⩽ \overset{ˉ}{W} σ = : u_{m a x}^{a},

∥ u_{t}^{a} ∥ = K_{t}^{⊤} ϕ_{j} (x_{t}) ⩽ ∥ K_{t} ∥ ∥ ϕ_{j} (x_{t}) ∥ ⩽ ∥ K_{t} ∥_{F} σ ⩽ \overset{ˉ}{W} σ = : u_{m a x}^{a},

∥ g (x_{t}) (u_{t}^{a} + h (x_{t})) ∥ ⩽ ∥ g (x_{t}) u_{t}^{a} ∥ + ∥ g (x_{t}) h (x_{t}) ∥

∥ g (x_{t}) (u_{t}^{a} + h (x_{t})) ∥ ⩽ ∥ g (x_{t}) u_{t}^{a} ∥ + ∥ g (x_{t}) h (x_{t}) ∥

⩽ δ_{g} ∥ u_{t}^{a} ∥ + w_{m a x} ⩽ δ_{g} u_{m a x}^{a} + w_{m a x} = : w_{m a x}^{'} .

ℓ ((x_{t}, u_{t}^{a}), W_{1 : L - 1}) : = u_{t}^{a} + K_{T_{k}}^{⊤} ψ_{L} [W_{L - 1}^{⊤} ψ_{L - 1} [\dots [ψ_{1} (x_{t})]]]^{2} .

ℓ ((x_{t}, u_{t}^{a}), W_{1 : L - 1}) : = u_{t}^{a} + K_{T_{k}}^{⊤} ψ_{L} [W_{L - 1}^{⊤} ψ_{L - 1} [\dots [ψ_{1} (x_{t})]]]^{2} .

L (D_{k}, W_{1 : L - 1}) = \frac{1}{p _{0}} i = 1 \sum p_{0} ℓ (x^{i}, W_{1 : L - 1}) .

L (D_{k}, W_{1 : L - 1}) = \frac{1}{p _{0}} i = 1 \sum p_{0} ℓ (x^{i}, W_{1 : L - 1}) .

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsAdvanced Control Systems Optimization · Fault Detection and Control Systems · Fuzzy Logic and Control Systems

Full text

Deep Model Predictive Control

Prabhat K. Mishra

UIUC, USA

[email protected]

&Mateus V. Gasparino

UIUC, USA

[email protected]

&Andres E. B. Velasquez

UIUC, USA

[email protected]

&Girish Chowdhary

UIUC, USA

[email protected]

Abstract

This paper presents a deep learning based model predictive control algorithm for control affine nonlinear discrete time systems with matched and bounded state-dependent uncertainties of unknown structure. Since the structure of uncertainties is not known, a deep neural network (DNN) is employed to approximate the disturbances. In order to avoid any unwanted behavior during the learning phase, a tube based model predictive controller is employed, which ensures satisfaction of constraints and input-to-state stability of the closed-loop states.

Keywords: safety critical systems, deep learning, model predictive control, adaptive control

1 Introduction

Modeling errors and environmental uncertainties are unavoidable in practice. Therefore, purely model based controllers tend to exhibit unexpected or unwanted behaviors in the real-world. One key solution to this problem is to employ learning-based methods that utilize powerful learning elements such as deep neural networks (DNN). Such methods attempt to learn a good model of underlying nonlinear dynamics while the system is in operation in a manner that does not compromise safety and performance. We refer readers to [1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11] and references therein.

To address the above challenge, the available domain knowledge in terms of approximate model is utilized in [12, 13], along with the learning elements. We refer readers to an excellent survey on safe reinforcement learning [14] and references therein. One key approach for safe learning is to augment the learning based controller with model predictive control (MPC) and related methods to guarantee safety through constraint satisfaction and improve the performance over time [15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27]. The proper pairing of learning and MPC can bring useful features of both methods while compensating their drawbacks.

Our main goal in this article is to address these gaps by creating a learning based MPC architectures with performance and safety guarantees. When uncertainties are structured, they can be simply represented in terms of (possibly) high dimensional feature basis functions and the learning mechanism acts on the disturbances [28, 29, 30, 31, 32, 33]. These disturbance rejecting actions taken by the learning mechanism are experienced by the MPC controller as additional disturbances. If the learning mechanism eventually rejects the disturbance then MPC can ensure asymptotic convergence of closed-loop states while satisfying the underlying constraints [34]. In this article, we extend the results of [34] for unstructured uncertainties.

We present a problem setup in §2. The formulation of Deep MPC controller is given in §3. We validate our theoretical results with the help of a numerical experiments in §4 and conclude in §5. The real time implementable training mechanism of DNN, stability of the overall algorithm and proofs are given in the appendix.

We let $\mathds{R},\mathds{N}_{0},\mathds{N}_{0}$ denote the set of real numbers, non-negative integers and positive integers, respectively. For a given vector $v$ and positive (semi)-definite matrix $M\succeq\mathbf{0}$ , $\left\|v\right\|_{M}^{2}$ is used to denote $v^{\top}Mv$ . For a given matrix $A$ , the trace, the largest eigenvalue, pseudo-inverse and Frobenius norm are denoted by $\operatorname{tr}(A)$ , $\lambda_{\max}(A)$ , $A^{\dagger}$ and $\left\|A\right\|_{F}$ , respectively. By notation $\left\|A\right\|$ and $\left\|A\right\|_{\infty}$ , we mean the standard $2-$ norm and $\infty-$ norm, respectively, when $A$ is a vector, and induced $2-$ norm and $\infty-$ norm, respectively, when $A$ is a matrix. A vector or a matrix with all entries [math] is represented by $\mathbf{0}$ and $I$ is an identity matrix of appropriate dimensions. We let $M^{(i)}$ denote the $i^{\text{th}}$ column of a given matrix $M$ .

2 Problem setup

Let us consider a discrete time dynamical system

[TABLE]

(1-a)

$x_{t}\in\mathcal{X}\subset\mathds{R}^{d}$ , $u_{t}\in\mathds{U}\coloneqq\{v\in\mathds{R}^{m}\mid\left\|v\right\|_{\infty}\leqslant u_{\max}\}$ , $\mathcal{X}\subset\mathds{R}^{d}$ is a compact set, 2. (1-b)

system function $f:\mathds{R}^{d}\rightarrow\mathds{R}^{d}$ , control influence function $g:\mathds{R}^{d}\rightarrow\mathds{R}^{d\times m}$ are given Lipschitz continuous functions and represent domain knowledge or prior knowledge of the system dynamics. 3. (1-c)

$h(x_{t})$ is the state dependent matched uncertainty at time $t$ such that $g(x_{t})h(x_{t})\in\mathds{W}\coloneqq\{v\in\mathds{R}^{d}\mid\left\|v\right\|\leqslant w_{\max}\}$ , $h$ is continuous, $\left\|g(x_{t})\right\|\leqslant\delta_{g}$ for some $\delta_{g}>0$ and $\operatorname{rank}(g(x_{t}))=m$ for every $x_{t}\in\mathds{R}^{d}$ .

The term $f(x_{t})+g(x_{t})u_{t}$ in the right hand side of (1) represents the prior knowledge of the dynamics and the remaining term $g(x_{t})h(x_{t})$ represents the unknown part of the dynamics or uncertainties. We refer readers to [35, 36, 37, 38, 39, 40] for a few related problem formulations.

3 Deep Model predictive controller

Our proposed solution is based on constraint satisfaction and cost minimization capabilities of MPC, and universal approximation property of neural networks. We break the applied control $u_{t}$ such that

[TABLE]

where $u_{t}^{a}$ is the output of DNN and $u_{t}^{m}$ is the MPC components, at time $t$ . The relevant details about DNN are given in the Appendix §A. The MPC controller employs only the nominal dynamics of (1), which is given below for easy reference

[TABLE]

Therefore, the dynamics (1) can be written as

[TABLE]

Notice that in (4), the term $g(x_{t})\left(u_{t}^{a}+h(x_{t})\right)$ is independent of the MPC control component $u_{t}^{m}$ . Therefore, MPC experiences it as a disturbance. In a broader sense, the MPC component $u_{t}^{m}$ is responsible for input-to-state stability (ISS) of closed-loop states in the presence of bounded disturbances, and the DNN component $u_{t}^{a}$ acts on $h(x_{t})$ . In particular, the job of $u_{t}^{a}$ is to approximate $-h(x_{t})$ and keep the approximation error uniformly bounded with a known bound so that MPC can always experience a bounded disturbance.

Deep MPC is developed on celebrated tube based MPC [41] with some differences, which occur due to the inclusion of the DNN component $u_{t}^{a}$ . Tube based MPC ensures that the closed-loop states stay within a tube around a reference trajectory. The trackable reference trajectory is obtained by solving a reference governor problem offline under the tightened constraints for regulation problems. Once a trackable reference trajectory is obtained by spending only a part of the available control authority, a reference tracking problem without state constraints is solved online that utilizes full control authority.

Constraint tightening in the reference governor allows satisfaction of the actual constraints by the actual states and actual actions. Knowledge of the exact bound on disturbance, therefore, is needed to tighten the constraints. Although the disturbance in dynamical system (1) at time $t$ is $g(x_{t})h(x_{t})$ , the disturbance experienced by MPC is $g(x_{t})(u_{t}^{a}+h(x_{t}))$ , which can be proved uniformly bounded by carefully designed DNN and its training mechanism. More details about getting the bounds $\left\|u_{t}^{a}\right\|\leqslant u_{\max}^{a}$ and $\left\|g(x_{t})(u_{t}^{a}+h(x_{t}))\right\|\leqslant w_{\max}^{\prime}$ are given in the Appendix §A.1. Therefore, we re-define the disturbance set and control set as follows:

[TABLE]

These modifications in tube-based MPC are already pointed out in [34, 26, 42]. For some optimization horizon $N\in\mathds{Z}_{+}$ , an offline reference governor is utilized to generate a reference trajectory

[TABLE]

In particular, the reference trajectory (5) is obtained by solving the following optimal control problem with penalty matrices $Q,R\succ 0$ and tightened sets $\mathcal{X}_{r},\mathds{U}_{r}$ :

[TABLE]

where $\bar{f}$ is defined in (3). The tightened constraint sets $\mathcal{X}_{r}$ and $\mathds{U}_{r}$ can be obtained by following the approach of [41, §7]. In order to design the online reference tracking MPC, we first choose an optimization horizon $N\in\mathds{Z}_{+}$ and positive definite matrices $Q,R\succ\mathbf{0}$ , which can be different from those chosen for the reference governor. Let

[TABLE]

be the cost per stage at time $t+i$ predicted at time $t$ and let $c_{\mathrm{f}}(x)\coloneqq x^{\top}Q_{f}x$ be the terminal cost with $Q_{f}\succ 0$ . The terminal cost $c_{\mathrm{f}}$ is treated as a local control Lyapunov function within a terminal set

[TABLE]

as in [41] by making the following assumption:

Assumption 1.

There exists a control $u^{\prime}\in\mathds{U}^{\prime}$ such that the following holds

[TABLE]

The above assumption is standard in the literature. Refer to [41, §4] for more details with a minor modification, which we made here for simplicity. Let us define

[TABLE]

The online reference tracking MPC minimizes (9) at each time instant $t$ under the following constraints:

[TABLE]

Notice that the constraint (11) is different from the constraints present in the tube-based MPC formulation [41]. We define the underlying optimal control problem as follows:

[TABLE]

Let the optimizer of the above problem be $(u_{t+i\mid t}^{\ast})_{i=0}^{N-1}$ . Then the optimal cost will be $V_{m}(x_{t})\coloneqq V_{m}(x_{t},(u_{t+i\mid t}^{\ast})_{i=0}^{N-1})$ . The first control $u_{t\mid t}^{\ast}$ is called the MPC component $u_{t}^{m}$ and is applied along with $u_{t}^{a}$ to the system at time $t$ .

4 Numerical experiment

We consider Wing-rock dynamics to corroborate our result. Letting $\delta_{t}$ denote the roll angle in radian, and $p_{t}$ denote the roll rate in radian per second, the state of the wing-rock dynamics model is $x_{t}\coloneqq\begin{bmatrix}\delta_{t}&p_{t}\end{bmatrix}^{\top}$ at time $t$ . We consider the following discrete time dynamics:

[TABLE]

where $A=\begin{bmatrix}1&0.05\\ 0&1\end{bmatrix}$ , $B=\begin{bmatrix}0\\ 0.05\end{bmatrix}$ , and $h(\cdot)$ is bounded uncertainty. In order to generate $h$ for the purpose of simulation, we use $h(x_{t})=V_{t}^{\top}\varsigma(x_{t})+\omega_{t}$ , with $V_{t}=v_{t}V_{0}$ , where

[TABLE]

and $\omega_{t}\in[-\bar{\omega},\bar{\omega}]$ is a truncated normal random variable with $\bar{\omega}=0.1523$ . The function $\varsigma(\cdot)$ is saturated by a standard saturation function as $\varsigma(x)=\operatorname{sat}(\varsigma^{\prime}(x))$ , where $\varsigma^{\prime}(x)=\begin{bmatrix}1&\delta&p&\left|\delta\right|p&\left|p\right|p&\delta^{3}\end{bmatrix}^{\top}$ and $\operatorname{sat}(\cdot)$ is a standard saturation function with the threshold $\frac{\bar{\omega}}{5}$ . The controller is not aware of $\varsigma(\cdot)$ and $\omega$ . The admissible state and control sets are given below:

[TABLE]

Our control objective is to steer the states of the system from $x_{0}=\begin{bmatrix}\pi/30&\pi/12\end{bmatrix}^{\top}$ to the origin. We compare our proposed approach with two controllers, namely tube MPC [41] and shallow MPC. In order to design shallow MPC, we follow our approach but we consider only a single layer neural network with $3$ neurons. To design the deep MPC, we use a four layer network with sizes $[2,5,5,3]$ respectively, where the first hidden layer has 5 neurons and the outermost layer has $3$ neurons. The weights of the output layer are updated with our adaptive weight update law (§A.1), while the remaining three hidden layers are trained on a secondary machine (§A.2) using SGD with momentum constant 0.9 and learning rate 0.01. We use nonlinear activation functions after each of the inner layers, and these functions are respectively $[ReLU,ReLU,tanh]$ . We follow the approach of [43] for the experience selection (inclusion and removal of data pairs [44]). In particular, we construct a matrix $TT^{\top}$ , where $T$ consists of $p_{\max}$ labels, and compute its singular values. If the replacement of $i^{\text{th}}$ label by new label gives larger singular values than the old one, then the new data pair is added at the $i^{\text{th}}$ position of the replay buffer.

Our experimental results are depicted in Fig. 1. Due to the sudden change in $v_{t}$ at time instants shown by vertical grid lines, tube MPC has oscillations in roll angle. The performance of shallow MPC is affected at each instant of abrupt change, which depicts its incapability of generalization. However, deep MPC demonstrates a good generalization with only three hidden layers.

5 Conclusion

A deep learning based algorithm is presented for safety critical systems by combining the approaches of adaptive control based label generation and tube MPC. A numerical experiment demonstrates that our approach with a single layer neural network (shallow MPC) outperforms tube MPC. The advantage of deep MPC is demonstrated in terms of further improvement in performance and convergence to a very close vicinity of origin. Future work may incorporate the results of [45, 46, 47, 48].

Acknowledgments

We gratefully acknowledge financial support from ONR MURI N00014-19-1-2373 and joint NSF CPS USDA grant 2018-67007-28379.

Appendix

Appendix A Deep Neural Network

Any continuous function $h$ on a compact set $\mathcal{X}$ can be approximated by a multi-layer network with number of layers $L\geqslant 2$ such that

[TABLE]

where $x\in\mathcal{X}$ , $\psi_{i},W_{i}$ for $i=1,\ldots,L$ , are activation functions and ideal weights, respectively, in the $i^{\text{th}}$ layer. The reconstruction error function $\varepsilon^{\ast}$ is bounded by a known constant $\bar{\varepsilon}^{\ast}>0$ for each $x\in\mathcal{X}$ , i. e. $\left\|\varepsilon^{\ast}(x)\right\|\leqslant\bar{\varepsilon}^{\ast}$ . Therefore, we can represent $h(x_{t})$ with the help of a neural network with a desired accuracy. If the neural network is not minimal then the ideal weights may not be unique. However, for the neural-adaptive controller design only the existence of ideal weights is assumed, which is always guaranteed when $h$ is a continuous function on a compact set [37, §7.1]. Let us define $\phi^{\ast}(x)\coloneqq\psi_{L}\left[W_{L-1}^{\top}\psi_{L-1}\left[\cdots\left[\psi_{1}(x)\right]\right]\right]$ as the output of the last activation layer under the ideal weights of hidden layers and $W^{\ast}\coloneqq W_{L}\in\mathds{R}^{(n_{L}+1)\times m}$ be the ideal weights of the output layer, then

[TABLE]

There are $n_{L}$ number of neurons in the output layer. The first row of $W^{\ast}$ represents the bias term and the first element of $\phi^{\ast}\in\mathds{R}^{n_{L}+1}$ is $1$ . The ideal hidden layer weights defining $\phi^{\ast}(\cdot)$ are neither known nor unique.

We update the weights of the output layer on the main machine in real time at each time instant with the help of a weight update law while keeping the weights of hidden layers fixed. The hidden layers are trained on a parallel secondary machine by using the approach [32] in which the weights of the output layer are copied from the main machine at the start of the training and remain fixed during the training. Once the training of DNN on a secondary machine is complete, new weights of the hidden layers are updated on the main machine and remain fixed until new set of weights are again obtained from the secondary machine. The schematic of DNN in the loop with MPC is shown in Fig. 2.

Remark 1.

For the implementation of our controller, we can access the output of the last activation layer of DNN without knowing the functions $\phi^{\ast}$ and $\varepsilon_{\phi_{j}}$ .

Remark 2 (Necessity of second DNN).

In many practical applications uncertainties appear in the dynamics through interaction with the environment and neural networks trained on one autonomous vehicle do not perform well on the other vehicle due to slight difference in hardware such as aperture of camera. In such situations deep learning based algorithms cannot be used for mass production without any provision of online training. The second DNN in Fig. 2 allows to improve or re-adjust features with change in hardware or environment.

At time $t_{0}$ , the neural network is initialized with random weights on both machines, and for a given $x$ as input, $\phi_{0}(x)$ denote the output of the last activation layer at $t_{0}$ . Let $(t_{j})_{j\in\mathds{Z}_{+}}$ denote the instants when the weights of hidden layers are updated on main machine after the completion of the $j^{\text{th}}$ training. Let $\phi_{j}(x)$ be the output of the last activation layer after the $j^{\text{th}}$ training for a given $x$ as input. We can use bounded neurons in the last activation layer, which results in bounded $\phi^{\ast}$ and $\phi_{j}$ . Due to the universal approximation property of DNN, $\phi^{\ast}$ exists with bounded $\varepsilon^{\ast}$ . We can assume that there exists $\varepsilon_{\phi_{j}}:\mathds{R}^{d}\rightarrow\mathds{R}^{m}$ for each $\phi_{j}$ such that $\phi^{\ast}(x)=\phi_{j}(x)+\varepsilon_{\phi_{j}}(x)$ for each $x\in\mathcal{X}$ . The boundedness of both $\phi^{\ast}$ and $\phi_{j}$ ensures boundedness of $\varepsilon_{\phi_{j}}$ . We need not to compute their bounds for the controller design. For $t\in\{t_{j},t_{j}+1,\ldots,t_{j+1}-1\}$ , (17) becomes

[TABLE]

where $\varepsilon_{j}(x_{t})=\varepsilon^{\ast}(x_{t})+W^{\ast\top}\varepsilon_{\phi_{j}}(x_{t})$ is the overall reconstruction error.

Notice that even when the weights of hidden layers are randomly assigned as in ELM [49], the universal approximation property of the neural network allows us to make the overall reconstruction error $\varepsilon_{0}(\cdot)$ as small as desired by increasing the width of the network. However, a network with trained hidden layers can capture several useful features, which in turn results in performance improvement [32].

We employ

[TABLE]

as an adaptive (learning) control at time $t\in\{t_{j},t_{j}+1,\ldots,t_{j+1}-1\}$ , where $K_{t}$ is the weight of the output layer, which is trained according to the adaptive weight update law and $\phi_{j}$ is a feature basis function obtained from the last activation layer of DNN after $j^{\text{th}}$ training. In the next subsections we provide the relevant details of the training of DNN.

A.1 Adaptive learning of $W^{\ast}$ on the main machine

We make the following assumption:

Assumption 2.

There exist $\bar{W}_{i}>0$ for $i=1,\ldots,m$ , and $\sigma,\bar{\varepsilon}>0$ such that $\left\|W^{\ast(i)}\right\|\leqslant\bar{W}_{i}$ , for $i=1,\ldots,m$ , and $\left\|\phi_{j}(x)\right\|\leqslant\sigma,\left\|\varepsilon_{j}(x)\right\|\leqslant\bar{\varepsilon}$ for every $x\in\mathcal{X}$ and $j\in\mathds{N}_{0}$ .

The above assumption is standard in the literature [29, 50, 51]. A priori knowledge about the bounds on the ideal weights $W^{\ast}$ of the output layer is useful to avoid parameter drift phenomenon. If the activation functions in the last hidden layer are bounded, i. e. sigmoidal, tanh, etc., then $\left\|\phi_{j}(x)\right\|$ will also be bounded for each $j$ and for all $x\in\mathds{R}^{d}$ .

We initialize $K_{0}$ such that $\left\|K_{0}^{(i)}\right\|\leqslant\bar{W}_{i}$ ; $i=1,\ldots,m$ . For a given learning rate $0<\theta<1$ and for $t\in\{t_{j},t_{j}+1,\ldots,t_{j+1}-1\}$ , we employ the following weight update law:

[TABLE]

where $g(x_{t})^{\dagger}=\left(g(x_{t})^{\top}g(x_{t})\right)^{-1}g(x_{t})^{\top}$ represents the pseudo-inverse of the left invertible matrix $g(x_{t})$ . Notice that first element in $\phi_{j}(\cdot)$ is one. Therefore, $\left\|\phi_{j}(x)\right\|^{2}\geqslant 1$ for all $x\in\mathds{R}^{d}$ and $j\in\mathds{N}_{0}$ , which avoids any possibility of division by zero.

We employ the discrete projection method to ensure boundedness of $K_{t}^{(i)}$ for $i=1,\ldots,m$ , as follows:

[TABLE]

Let $\tilde{K}_{t}\coloneqq K_{t}-W^{\ast}$ and $\tilde{u}_{t}\coloneqq u_{t}^{a}+h(x_{t})=-\tilde{K}_{t}^{\top}\phi_{j}(x_{t})+\varepsilon_{j}(x_{t})$ . It is evident that $\left\|K_{t}\right\|_{F}^{2}=\operatorname{tr}(K_{t}^{\top}K_{t})=\sum_{i=1}^{m}\left\|K_{t}^{(i)}\right\|^{2}\leqslant\sum_{i=1}^{m}\bar{W}_{i}^{2}\eqqcolon\bar{W}$ for all $t$ due to the projection. Therefore, the neuro-adaptive control component $u_{t}^{a}$ is bounded, i. e.

[TABLE]

for $t\in\{t_{j},t_{j}+1,\ldots,t_{j+1}-1\}$ and for all $t_{j}$ . The apparent disturbance term $g(x_{t})\left(u_{t}^{a}+h(x_{t})\right)$ in (4) is also bounded, i. e.

[TABLE]

A.2 Self-supervised learning of $\phi^{\ast}$ on a secondary machine

Let $(T_{k})_{k\in\mathds{Z}_{+}}$ represent time instants when we begin the $k^{\text{th}}$ training of DNN. Let $p_{0}$ data samples are required for the training, which are stored in a buffer of size $p_{\max}>p_{0}$ . We do not have access of the labeled data pairs $(x,\phi^{\ast}(x))$ . Therefore, we follow an approach similar to that of [32] for the data collection and training.

We fix $T_{1}\geqslant p_{0}$ and for each $t\leqslant T_{1}$ , the labeled pairs $(x_{t},u_{t}^{a})$ are stored in the buffer. Recall that $u_{t}^{a}=-K_{t}^{\top}\phi_{0}(x_{t})$ for $t\leqslant T_{1}<t_{1}$ , where $\phi_{0}(\cdot)$ is obtained by the random initialization of the weights of hidden layers. At $t=T_{1}$ , we randomly sample $p_{0}$ data pairs for the training of DNN. We fix the weights of the output layer to be $-K_{T_{1}}$ and train the network. Notice that the training of DNN does not affect the operation of system because the controlled system still employs $u_{t}^{a}=-K_{t}^{\top}\phi_{0}(x_{t})$ as the adaptive control in which only $K_{t}$ is updated at each time instant by using the weight update law discussed in §A. At $t=t_{1}$ , we get our first trained network. For $t\in\{t_{1},\ldots,t_{2}-1\}$ , we employ $u_{t}^{a}=-K_{t}^{\top}\phi_{1}(x_{t})$ as an adaptive control. This process of training, exploiting and storing is repeated at each time $t$ . For each $k\in\mathds{Z}_{+}$ , $W_{L}$ is set to be $-K_{T_{k}}$ in the secondary DNN and remain fixed during the training. Therefore, we are interested in finding the weights $W_{1:L-1}\coloneqq W_{1},\ldots,W_{L-1}$ which minimize the following cost for a given input $x_{t}$ and corresponding label $u_{t}^{a}$ :

[TABLE]

Let $\mathcal{D}_{k}\coloneqq\left(x^{i},u^{i}\right)_{i=1}^{p_{0}}$ be training data consisting of $p_{0}$ data points randomly sampled from the buffer for the $k^{\text{th}}$ training. The following loss function is considered for the training of DNN:

[TABLE]

At $t=p_{\max}-1$ , the buffer becomes full. So new data can be added after the removal of some old data by using some suitable experience selection method [44]. The available approaches are based on retaining the most informative data based on some criterion [52] and ensuring sufficient diversity [53]. Our present approach is compatible with any existing method of experience selection. However, different methods may result in different performance for different problems and their choice may also depend on the availability of resources. We keep the method of experience selection open for the choice of users.

Appendix B Stability

We recall the following definition:

Definition 1 ([54], page 117).

The vector sequence $(s_{t})_{t\in\mathds{N}_{0}}$ is called $\mu$ small in mean square sense if it satisfies $\sum_{t=k}^{k+N-1}\left\|s_{t}\right\|^{2}\leqslant Nc_{0}\mu+c_{0}^{\prime}$ for all $k\in\mathds{Z}_{+}$ , a given constant $\mu\geqslant 0$ and some $N\in\mathds{Z}_{+}$ , where $c_{0},c_{0}^{\prime}\geqslant 0$ .

Some straightforward arguments as in [54, §4.11.3] give us the following result:

Lemma 1.

Consider the dynamical system (1), weight update law (20) and the projection method (21). Let the Assumption 2 hold and define $V_{a}(K_{t})\coloneqq\frac{1}{\theta}\operatorname{tr}(\tilde{K}_{t}^{\top}\tilde{K}_{t})$ . Then for all $t$ ,

(i)

$V_{a}(K_{t})\leqslant\frac{4}{\theta}\bar{W}$ , 2. (ii)

$V_{a}(K_{t+1})-V_{a}(K_{t})\leqslant-\frac{1-\theta}{\sigma^{2}}\left\|\tilde{u}_{t}\right\|^{2}+\left\|\varepsilon(x_{t})\right\|^{2}$ , 3. (iii)

$\tilde{u}_{t}$ * is $\bar{\varepsilon}^{2}$ small in mean square sense with $c_{0}=\frac{\sigma^{2}}{1-\theta}$ and $c_{0}^{\prime}=\frac{4c_{0}}{\theta}\bar{W}$ as per the Definition 1.*

We provide a proof of Lemma 1 in the appendix. Let $X_{c}(x_{t}^{r})$ be the level set around $x_{t}^{r}$ of radius $c$ generated by $V_{m}(x_{t})$ and $X_{c}$ be their union. In particular,

[TABLE]

Properties of the value function are summarized in the following Lemma. These results are standard in the literature [55]. We provide their proofs in the appendix for completeness.

Lemma 2.

(i)

If $\alpha\geqslant c$ then $x_{t+N\mid t}\in\mathcal{X}_{f}$ for every $x_{t}\in X_{c}(x_{t}^{r})$ . 2. (ii)

[34, Lemma 3]** There exist $c_{2}>c_{1}>0$ such that

[TABLE]

Lemma 2-(i) ensures the satisfaction of terminal constraint on states just by construction. Refer to [56, Proposition 1] and [41, Proposition 1] for minor differences due to (1), (14) and Assumption 1. For the purpose of analysis, we define an intermediate optimization problem by replacing $x_{t}$ in (10) by $x_{t\mid t-1}$ . In particular,

[TABLE]

Notice that we keep $u_{t}^{a}=-K_{t}^{\top}\phi_{j}(x_{t})$ fixed in both problems (14) and (24), respectively, and therefore, we can follow the following convention:

[TABLE]

Remark 3.

Notice that the constraint on the first control (11) includes $u_{t}^{a}$ to make MPC aware of the adaptive action. Since $u_{t}^{a}=-K_{t}^{\top}\phi_{j}(x_{t})$ nonlinealry depends on $x_{t}$ due to the nonlinear function $\phi_{j}$ , the set-valued control move map becomes state-dependent. Our analysis is based on using the value function of MPC (14) as a candidate Lyapunov function. The presence of state-dependent constraint (11) prohibits us to prove robustness of MPC by invoking [57, propositions 7,8 or 11]. We defined an intermediate optimization problem (24) to get rid of the above difficulty. Due to the above-mentioned technical difficulty the results of [41, propositions 2 and 4] are not directly applicable here.

Important results related to tube MPC are summarized in the following Lemma. Refer to [41, Proposition 2, Proposition 4] for a detailed discussion. We provide their proofs in the appendix to highlight the adjustments and for completeness.

Lemma 3.

If Assumption 1 is satisfied, then for all $t$ for every $x_{t}\in X_{c}(x_{t}^{r})$ the following hold:

(i)

$\hat{V}_{m}(x_{t+1\mid t})-V_{m}(x_{t})\leqslant-c_{\mathrm{s}}(x_{t},u_{t\mid t}^{\ast})$ , and $x_{t+1\mid t}\in\mathcal{X}_{c}$ . 2. (ii)

$x_{t+1}\in\mathcal{X}_{c}(x_{t+1}^{r})+\mathds{W}^{\prime}$ . 3. (iii)

$V_{m}(x_{t+1})-\hat{V}_{m}(x_{t+1\mid t})\leqslant c_{3}\left\|g(x_{t})\tilde{u}_{t}\right\|$ . 4. (iv)

There exists $\gamma<1$ such that

[TABLE]

The Lemma 3-(iv) along with Lemma 2-(ii) ensures that the controlled system is input-to-state stable (ISS) because it admits $V_{m}$ as an ISS Lyapunov function [58, Lemma 3.5]. In the case of structured uncertainty $\left\|\tilde{u}_{t}\right\|\rightarrow 0$ as $t\rightarrow\infty$ , which implies $\left\|x_{t}\right\|\rightarrow 0$ [34, Theorem 1]. Such results are not available in the presence of unstructured uncertainty. However, the existence of invariant and attractive tubes is possible when $w_{\max}$ and $u_{\max}^{a}$ are small. We have the following result:

Proposition 1.

Let us define $\bar{c}\coloneqq\frac{c_{2}c_{3}}{c_{1}}(\delta_{g}u_{\max}^{a}+w_{\max})$ . If $\delta_{g}u_{\max}^{a}+w_{\max}<\frac{c_{1}}{c_{2}c_{3}}c$ , then for all $t\geqslant N$ , the following hold:

(i)

for every $x_{t}\in\mathcal{X}_{c}(0)\setminus\mathcal{X}_{\bar{c}}(0)$ , $V_{m}(x_{t+1})<V_{m}(x_{t})$ , 2. (ii)

for every $x_{t}\in\mathcal{X}_{\bar{c}}(0)$ , $x_{t+1}\in\mathcal{X}_{\bar{c}}(0)$ . 3. (iii)

In addition, if $\mathcal{X}_{c}\subset\mathcal{X}$ , then $x_{t}\in\mathcal{X}$ for all $t$ .

The Proposition 1 has similar arguments as in [41, Proposition 4] and confirms the existence of an invariant tube $\mathcal{X}_{c}(0)$ and an attractive tube $\mathcal{X}_{\bar{c}}(0)\subset\mathcal{X}_{c}(0)$ .

Suppose there exists some $\hat{N}\geqslant N$ and $\hat{c}$ such that $x_{\hat{N}}\in\mathcal{X}_{\hat{c}}(0)\coloneqq\{x\mid V_{m}(x)\leqslant\hat{c}\leqslant c\}$ . Since $c_{3}$ is a Lipschitz constant of $V_{m}$ on a compact set $\mathcal{X}_{c}+\mathds{W}^{\prime}\supset\mathcal{X}_{c}(0)\supset\mathcal{X}_{\hat{c}}(0)$ , there exists $\hat{c}_{3}$ , which satisfies Lemma 3-(iv). Similarly, let there exist $\hat{\delta}_{g}\leqslant\delta_{g}$ such that $\left\|g(x)\right\|\leqslant\hat{\delta}_{g}$ for every $x\in\mathcal{X}_{\hat{c}}(0)$ . Since $\bar{c}$ depends on $c_{3}$ and $\delta_{g}$ , their reduction will result in shrinkage of the attractive tube $\mathcal{X}_{\bar{c}}(0)$ . Moreover, since any level set within $\mathcal{X}_{\bar{c}}(0)$ is invariant due to Proposition 1-(i), a further shrinkage is possible. However, asymptotic convergence is still not guaranteed. If $\gamma^{2}<\frac{1}{2}$ , then we can get a stronger result provided a certain condition in terms of $c_{3}$ and $\delta_{g}$ is satisfied, and the reconstruction error $\varepsilon$ has small gain type property within the invariant tube. We make the following assumption:

Assumption 3.

There exists $\beta>0$ such that $\left\|\varepsilon_{j}(x)\right\|\leqslant\beta\left\|x\right\|^{2}$ for all $x\in\mathcal{X}_{\hat{c}}(0)$ and $j\in\mathds{N}_{0}$ .

Generally, the norm bound on the reconstruction error is assumed to be linear in $\left\|x\right\|$ [29]. We assumed it to be quadtratic, otherwise the above assumption is standard in literature. We have the following result:

Theorem 1.

Consider the dynamical system (1) controlled by the Deep MPC, and let assumptions 2, 1 and 3 hold. If $w_{\max}^{\prime}<\frac{c_{1}}{c_{2}c_{3}}c$ , $\gamma^{2}<\frac{1}{2}$ and $\beta<\frac{c_{1}m}{\sqrt{2}\sigma\hat{c}_{3}\hat{\delta}_{g}}\sqrt{(1-2\gamma^{2})(1-\theta)}$ , then $\left\|x_{t}\right\|\rightarrow 0$ as $t\rightarrow\infty$ .

Notice that the main results of tube-based MPC (Proposition 1) are valid for small disturbances. The Theorem 1 extends Proposition 1 by guaranteeing convergence of states to origin under the conditions on $\gamma$ and $\beta$ . Smaller value of $\gamma$ refers to the faster convergence of the value function of nominal MPC. Generally, reconstruction error is comparatively very small with respect to the disturbance. Therefore, the condition on $\gamma$ and $\beta$ are reasonable, and they can be verified in both theoretical and empirical manner.

Appendix C Proofs

Proof of Lemma 1.

(i)

Since $V_{a}(K_{t})=\frac{1}{\theta}\operatorname{tr}(\tilde{K}_{t}^{\top}\tilde{K}_{t})=\frac{1}{\theta}\sum_{i=1}^{m}\left\|K_{t}^{(i)}-W^{\ast(i)}\right\|^{2}\leqslant\frac{4}{\theta}\sum_{i=1}^{m}\bar{W}_{i}^{2}=\frac{4}{\theta}\bar{W}$ . 2. (ii)

We first compute

[TABLE]

where

[TABLE]

One important property of the projection (21) is the following [54, (4.61)]:

[TABLE]

Since $(K_{t+1}^{(i)}-\bar{K}_{t+1}^{(i)})^{\top}(K_{t+1}^{(i)}-W^{\ast})\leqslant 0$ due to (26), we can ensure $\operatorname{tr}(\alpha_{t})\leqslant 0$ . Therefore,

[TABLE]

By substituting $\tilde{K}_{t}^{\top}\phi_{j}(x_{t})=-\tilde{u}_{t}+\varepsilon_{j}(x_{t})$ in the above inequality, we get

[TABLE]

where the last inequality is due to $m^{2}\leqslant\left\|\phi_{j}(x_{t})\right\|^{2}\leqslant\sigma^{2}$ . Therefore,

[TABLE] 3. (iii)

Consider Lemma 1-(ii) to get

[TABLE]

By summing from $t=k$ to $k+N-1$ in both sides, we get

[TABLE]

Therefore, $\tilde{u}_{t}$ is $\bar{\varepsilon}^{2}$ small in mean square sense with $c_{0}=\frac{\sigma^{2}}{(1-\theta)m^{2}}$ and $c_{0}^{\prime}=\frac{4c_{0}}{\theta}m^{2}\bar{W}$ as per the Definition 1.

∎

Proof of Lemma 2.

(i)

We recall the definitions of $\mathcal{X}_{c}(x_{t}^{r})$ and $\mathcal{X}_{f}$ from (7) and (23), respectively. Now, it is immediate to notice that $x_{t}\in X_{c}(x_{t}^{r})\implies V_{m}(x_{t})\leqslant c\implies c_{\mathrm{f}}(x_{t+N\mid t})\leqslant V_{m}(x_{t})\leqslant c\leqslant\alpha\implies x_{t+N\mid t}\in\mathcal{X}_{f}$ . 2. (ii)

Since $Q\succ 0$ and $f,g$ are Lipschitz continuous, by [34, Lemma 3] there exist $c_{1},c_{2}>0$ such that Lemma 2-(ii) hold. We mention key steps here for completeness. Since $V_{m}(x_{t})\geqslant c_{\mathrm{s}}(x_{t},u_{t}^{m})\geqslant\left\|x_{t}-x_{t}^{r}\right\|^{2}_{Q}$ , we can choose $c_{1}=\lambda_{\min}(Q)$ .

Let $f,g$ be Lipschitz continuous with Lipschitz constants $L_{f}$ and $L_{g}$ , respectively. We can notice that (14) has no constraints on states and the constraints on control can be satisfied by $(u_{i}^{r})_{i=t}^{t+N-1}$ at time $t$ .

Let us recall the definition of the cost function (9), then due to the optimality of $V_{m}(x_{t})$ , we get

[TABLE]

The above inequality is due to the substitution $(u_{i})_{i=t}^{t+N-1}=(u_{i}^{r})_{i=t}^{t+N-1}$ . Further,

[TABLE]

where $\bar{L}=L_{f}+L_{g}u_{\max}^{r}$ . Since $x_{t+N}^{r}=\mathbf{0}$ for all $t$ , there exists $c_{2}=\bar{L}^{N}\lambda_{\max}(Q_{f})+\sum_{i=0}^{N-1}\bar{L}^{i}\lambda_{\max}(Q)>\lambda_{\min}(Q)=c_{1}$ .

∎

Proof of Lemma 3.

(i)

Since $u_{t+1\mid t}^{\ast}\in\mathds{U}^{\prime}$ , we get $u_{t+1\mid t}^{\ast}+u_{t+1}^{a}\in\mathds{U}$ . Therefore, $u_{t+1\mid t+1}=u_{t+1\mid t}^{\ast}$ is feasible for (24) at time $t+1$ . Since $u_{t+i+1\mid t}^{\ast}\in\mathds{U}^{\prime}$ , for $i=1,\ldots,N-2$ the control sequence $u_{t+i+1\mid t+1}=u_{t+i+1\mid t}^{\ast}$ is also feasible at time $t+1$ for (24). Under the above control sequence $x_{t+N\mid t+1}=x_{t+N\mid t}\in\mathcal{X}_{f}$ because $x_{t+1\mid t+1}=x_{t+1\mid t}$ in (24). Therefore, $u_{t+N\mid t+1}=u^{\prime}$ is feasible for some $u^{\prime}\in\mathds{U}^{\prime}$ satisfying the Assumption 1. In this way, we have constructed a feasible control sequence $(u_{t+i+1\mid t+1})_{i=0}^{N-1}$ for (24) and due to the optimality of $\hat{V}_{m}(x_{t+1\mid t})$ , by substituting the feasible control sequence $(u_{t+i+1\mid t+1})_{i=0}^{N-1}=\{u^{\prime},(u_{t+i+1\mid t}^{\ast})_{i=0}^{N-2}\}$ in (24), we get

[TABLE]

due to the Assumption 1. Therefore, $\hat{V}_{m}(x_{t+1\mid t})\leqslant V_{m}(x_{t})\leqslant c$ , which implies $x_{t+1\mid t}\in\mathcal{X}_{c}(x_{t+1}^{r})\subset\mathcal{X}_{c}$ due to our convention (25). 2. (ii)

Since $x_{t+1\mid t}=\bar{f}(x_{t},u_{t}^{m})\in\mathcal{X}_{c}(x_{t+1}^{r})$ , we get $x_{t+1}=x_{t+1\mid t}+g(x_{t})\tilde{u}_{t}\in\mathcal{X}_{c}(x_{t+1}^{r})+\mathds{W}^{\prime}\subset\mathcal{X}_{c}+\mathds{W}^{\prime}$ due to the Lemma 3-(i). 3. (iii)

Notice that the optimization problems (14) and (24) do not have constraints on state. Let $(v_{t+i+1})_{i=0}^{N-1}$ be the minimizer of (24) at $t+1$ , which means $\hat{V}_{m}(x_{t+1\mid t})=V_{m}(x_{t+1\mid t},(v_{t+i+1})_{i=0}^{N-1})$ . Since $(v_{t+i+1})_{i=0}^{N-1}$ satisfies constraints on control (11) and (13), it is feasible for (14) at $t+1$ . Therefore, due to the optimality of $V_{m}(x_{t+1})$ , we get

[TABLE]

which in turn implies

[TABLE]

Now we notice that the cost function (9) is Lipschitz continuous in its first argument on the set $\mathcal{X}_{c}+\mathds{W}^{\prime}$ while keeping the second argument fixed and $x_{t+1},x_{t+1\mid t}\in\mathcal{X}_{c}+\mathds{W}^{\prime}$ . Since $v_{t+i+1}\in\mathds{U}$ for $i=0,\ldots,N-1$ , there exists some $c_{3}>0$ such that

[TABLE]

Since $t$ was arbitrary, the above result holds for all $t$ . 4. (iv)

We compute a bound on $V_{m}(x_{t+1})-V_{m}(x_{t})=V_{m}(x_{t+1})-\hat{V}_{m}(x_{t+1\mid t})+\hat{V}_{m}(x_{t+1\mid t})-V_{m}(x_{t})$ . Then by combining the results of Lemma 3-(i) and Lemma 3-(iii), we get $V_{m}(x_{t+1})-V_{m}(x_{t})\leqslant-c_{\mathrm{s}}(x_{t},u_{t\mid t}^{\ast})+c_{3}\left\|g(x_{t})\tilde{u}_{t}\right\|$ . Then due to Lemma 2-(ii), we have

[TABLE]

where $\gamma=1-\frac{c_{1}}{c_{2}}<1$ .

∎

Proof of Proposition 1.

(i)

We can observe that $c_{3}\left\|g(x_{t})\tilde{u}_{t}\right\|\leqslant c_{3}w_{\max}^{\prime}=\frac{c_{1}}{c_{2}}\bar{c}$ . Therefore, due to Lemma 3-(iv), we get

[TABLE]

Since $c\geqslant V_{m}(x_{t})>\bar{c}$ for all $x_{t}\in\mathcal{X}_{c}(0)\setminus\mathcal{X}_{\bar{c}}(0)$ , we have $V_{m}(x_{t+1})-V_{m}(x_{t})\leqslant\frac{c_{1}}{c_{2}}(\bar{c}-V_{m}(x_{t}))<0$ . 2. (ii)

If $V_{m}(x_{t})\leqslant\bar{c}$ then $V_{m}(x_{t+1})\leqslant\gamma V_{m}(x_{t})+c_{3}\left\|g(x_{t})\tilde{u}_{t}\right\|\leqslant\gamma\bar{c}+\frac{c_{1}}{c_{2}}\bar{c}=\bar{c}\implies x_{t+1}\in\mathcal{X}_{\bar{c}}(0)$ . 3. (iii)

For every $x_{t}\in\mathcal{X}_{c}(x_{t}^{r})\subset\mathcal{X}_{c}\subset\mathcal{X}$ , $x_{t+1}\in\mathcal{X}_{c}(x_{t+1}^{r})\subset\mathcal{X}_{c}\subset\mathcal{X}$ due to Proposition 1-(i).

∎

Proof of Theorem 1.

Let us consider $V(x_{t},K_{t})\coloneqq V_{m}^{2}(x_{t})+a_{0}V_{a}(K_{t})$ , where $a_{0}=\frac{2}{1-\theta}\left(\hat{c}_{3}\hat{\delta}_{g}\sigma\right)^{2}$ . Clearly, $V$ is continuous in $x_{t}$ and $K_{t}$ , and satisfies:

[TABLE]

for all $t\geqslant\hat{N}\geqslant N$ . From Lemma 3-(iv) we have

[TABLE]

Therefore,

[TABLE]

Now, we compute $V(x_{t+1},K_{t+1})-V(x_{t},K_{t})$ and substitute $V_{a}(K_{t+1})-V_{a}(K_{t})\leqslant-\frac{1-\theta}{\sigma^{2}}\left\|\tilde{u}_{t}\right\|^{2}+\frac{1}{m^{2}}\left\|\varepsilon_{j}(x_{t})\right\|^{2}$ from Lemma 1-(ii) to get

[TABLE]

where $\eta=(1-2\gamma^{2})c_{1}^{2}-\frac{2}{1-\theta}\left(\sigma\hat{c}_{3}\hat{\delta}_{g}\frac{\beta}{m}\right)^{2}>0$ because $\beta<\frac{c_{1}m}{\sqrt{2}\sigma\hat{c}_{3}\hat{\delta}_{g}}\sqrt{(1-2\gamma^{2})(1-\theta)}$ . Therefore,

[TABLE]

By summing from $t=\hat{N}$ to $k+\hat{N}$ on both sides, we get

[TABLE]

where the last inequality is due to Lemma 1-(i) and the fact that $x_{\hat{N}}\in\mathcal{X}_{\hat{c}}(0)$ . Since the right hand side of the above inequality is independent of $k$ , we have $\sum_{t=\hat{N}}^{\infty}\left\|x_{t}\right\|^{4}\leqslant\frac{1}{\eta}\left(\hat{c}^{2}+\frac{4a_{0}}{\theta}\bar{W}\right)$ , which implies $\left\|x_{t}\right\|\rightarrow 0$ as $t\rightarrow\infty$ . ∎

Bibliography58

The reference list from the paper itself. Each links out to its DOI / PubMed record.

1Le Cun et al. [2015] Y. Le Cun, Y. Bengio, and G. Hinton. Deep learning. nature , 521(7553):436–444, 2015.
2Mnih et al. [2015] V. Mnih, K. Kavukcuoglu, D. Silver, A. A. Rusu, J. Veness, M. G. Bellemare, A. Graves, M. Riedmiller, A. K. Fidjeland, G. Ostrovski, et al. Human-level control through deep reinforcement learning. nature , 518(7540):529–533, 2015.
3Levine et al. [2016] S. Levine, C. Finn, T. Darrell, and P. Abbeel. End-to-end training of deep visuomotor policies. The Journal of Machine Learning Research , 17(1):1334–1373, 2016.
4Bojarski et al. [2016] M. Bojarski, D. Del Testa, D. Dworakowski, B. Firner, B. Flepp, P. Goyal, L. D. Jackel, M. Monfort, U. Muller, J. Zhang, et al. End to end learning for self-driving cars. ar Xiv preprint ar Xiv:1604.07316 , 2016.
5Hewing et al. [2020] L. Hewing, K. P. Wabersich, M. Menner, and M. N. Zeilinger. Learning-based model predictive control: Toward safe learning in control. Annual Review of Control, Robotics, and Autonomous Systems , 3:269–296, 2020.
6Li et al. [2021] Y. Li, N. Li, H. E. Tseng, A. Girard, D. Filev, and I. Kolmanovsky. Safe reinforcement learning using robust action governor. ar Xiv preprint ar Xiv:2102.10643 , 2021.
7Berkenkamp et al. [2017] F. Berkenkamp, M. Turchetta, A. P. Schoellig, and A. Krause. Safe model-based reinforcement learning with stability guarantees. ar Xiv preprint ar Xiv:1705.08551 , 2017.
8Liu et al. [2020] A. Liu, G. Shi, S. Chung, A. Anandkumar, and Y. Yue. Robust regression for safe exploration in control. In Learning for Dynamics and Control , pages 608–619. PMLR, 2020.

TL;DR

Contribution

Findings

Abstract

Peer Reviews

Videos

Taxonomy

Deep Model Predictive Control

Abstract

1 Introduction

2 Problem setup

3 Deep Model predictive controller

Assumption 1**.**

4 Numerical experiment

5 Conclusion

Acknowledgments

Appendix

Appendix A Deep Neural Network

Remark 1**.**

Remark 2** (Necessity of second DNN).**

A.1 Adaptive learning of W∗W^{\ast}W∗ on the main machine

Assumption 2**.**

A.2 Self-supervised learning of ϕ∗\phi^{\ast}ϕ∗ on a secondary machine

Appendix B Stability

Definition 1** ([54], page 117).**

Lemma 1**.**

Lemma 2**.**

Remark 3**.**

Lemma 3**.**

Proposition 1**.**

Assumption 3**.**

Theorem 1**.**

Appendix C Proofs

Proof of Lemma 1.

Proof of Lemma 2.

Proof of Lemma 3.

Proof of Proposition 1.

Proof of Theorem 1.

Assumption 1.

Remark 1.

Remark 2 (Necessity of second DNN).

A.1 Adaptive learning of $W^{\ast}$ on the main machine

Assumption 2.

A.2 Self-supervised learning of $\phi^{\ast}$ on a secondary machine

Definition 1 ([54], page 117).

Lemma 1.

Lemma 2.

Remark 3.

Lemma 3.

Proposition 1.

Assumption 3.

Theorem 1.