Deep Model Predictive Control
Prabhat K. Mishra, Mateus V. Gasparino, Andres E. B. Velasquez, Girish, Chowdhary

TL;DR
This paper introduces a deep learning-based model predictive control method for nonlinear systems with unknown uncertainties, using neural networks for disturbance approximation and a tube-based controller for stability and constraint satisfaction.
Contribution
It proposes a novel integration of deep neural networks with tube-based MPC to handle unknown, state-dependent uncertainties in nonlinear systems.
Findings
Neural networks effectively approximate unknown disturbances.
The combined approach guarantees constraint satisfaction.
Closed-loop stability is maintained during learning.
Abstract
This paper presents a deep learning based model predictive control algorithm for control affine nonlinear discrete time systems with matched and bounded state-dependent uncertainties of unknown structure. Since the structure of uncertainties is not known, a deep neural network (DNN) is employed to approximate the disturbances. In order to avoid any unwanted behavior during the learning phase, a tube based model predictive controller is employed, which ensures satisfaction of constraints and input-to-state stability of the closed-loop states.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAdvanced Control Systems Optimization · Fault Detection and Control Systems · Fuzzy Logic and Control Systems
Deep Model Predictive Control
Prabhat K. Mishra
UIUC, USA
&Mateus V. Gasparino
UIUC, USA
&Andres E. B. Velasquez
UIUC, USA
&Girish Chowdhary
UIUC, USA
Abstract
This paper presents a deep learning based model predictive control algorithm for control affine nonlinear discrete time systems with matched and bounded state-dependent uncertainties of unknown structure. Since the structure of uncertainties is not known, a deep neural network (DNN) is employed to approximate the disturbances. In order to avoid any unwanted behavior during the learning phase, a tube based model predictive controller is employed, which ensures satisfaction of constraints and input-to-state stability of the closed-loop states.
Keywords: safety critical systems, deep learning, model predictive control, adaptive control
1 Introduction
Modeling errors and environmental uncertainties are unavoidable in practice. Therefore, purely model based controllers tend to exhibit unexpected or unwanted behaviors in the real-world. One key solution to this problem is to employ learning-based methods that utilize powerful learning elements such as deep neural networks (DNN). Such methods attempt to learn a good model of underlying nonlinear dynamics while the system is in operation in a manner that does not compromise safety and performance. We refer readers to [1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11] and references therein.
To address the above challenge, the available domain knowledge in terms of approximate model is utilized in [12, 13], along with the learning elements. We refer readers to an excellent survey on safe reinforcement learning [14] and references therein. One key approach for safe learning is to augment the learning based controller with model predictive control (MPC) and related methods to guarantee safety through constraint satisfaction and improve the performance over time [15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27]. The proper pairing of learning and MPC can bring useful features of both methods while compensating their drawbacks.
Our main goal in this article is to address these gaps by creating a learning based MPC architectures with performance and safety guarantees. When uncertainties are structured, they can be simply represented in terms of (possibly) high dimensional feature basis functions and the learning mechanism acts on the disturbances [28, 29, 30, 31, 32, 33]. These disturbance rejecting actions taken by the learning mechanism are experienced by the MPC controller as additional disturbances. If the learning mechanism eventually rejects the disturbance then MPC can ensure asymptotic convergence of closed-loop states while satisfying the underlying constraints [34]. In this article, we extend the results of [34] for unstructured uncertainties.
We present a problem setup in §2. The formulation of Deep MPC controller is given in §3. We validate our theoretical results with the help of a numerical experiments in §4 and conclude in §5. The real time implementable training mechanism of DNN, stability of the overall algorithm and proofs are given in the appendix.
We let denote the set of real numbers, non-negative integers and positive integers, respectively. For a given vector and positive (semi)-definite matrix , is used to denote . For a given matrix , the trace, the largest eigenvalue, pseudo-inverse and Frobenius norm are denoted by , , and , respectively. By notation and , we mean the standard norm and norm, respectively, when is a vector, and induced norm and norm, respectively, when is a matrix. A vector or a matrix with all entries [math] is represented by and is an identity matrix of appropriate dimensions. We let denote the column of a given matrix .
2 Problem setup
Let us consider a discrete time dynamical system
[TABLE]
- (1-a)
, , is a compact set, 2. (1-b)
system function , control influence function are given Lipschitz continuous functions and represent domain knowledge or prior knowledge of the system dynamics. 3. (1-c)
is the state dependent matched uncertainty at time such that , is continuous, for some and for every .
The term in the right hand side of (1) represents the prior knowledge of the dynamics and the remaining term represents the unknown part of the dynamics or uncertainties. We refer readers to [35, 36, 37, 38, 39, 40] for a few related problem formulations.
3 Deep Model predictive controller
Our proposed solution is based on constraint satisfaction and cost minimization capabilities of MPC, and universal approximation property of neural networks. We break the applied control such that
[TABLE]
where is the output of DNN and is the MPC components, at time . The relevant details about DNN are given in the Appendix §A. The MPC controller employs only the nominal dynamics of (1), which is given below for easy reference
[TABLE]
Therefore, the dynamics (1) can be written as
[TABLE]
Notice that in (4), the term is independent of the MPC control component . Therefore, MPC experiences it as a disturbance. In a broader sense, the MPC component is responsible for input-to-state stability (ISS) of closed-loop states in the presence of bounded disturbances, and the DNN component acts on . In particular, the job of is to approximate and keep the approximation error uniformly bounded with a known bound so that MPC can always experience a bounded disturbance.
Deep MPC is developed on celebrated tube based MPC [41] with some differences, which occur due to the inclusion of the DNN component . Tube based MPC ensures that the closed-loop states stay within a tube around a reference trajectory. The trackable reference trajectory is obtained by solving a reference governor problem offline under the tightened constraints for regulation problems. Once a trackable reference trajectory is obtained by spending only a part of the available control authority, a reference tracking problem without state constraints is solved online that utilizes full control authority.
Constraint tightening in the reference governor allows satisfaction of the actual constraints by the actual states and actual actions. Knowledge of the exact bound on disturbance, therefore, is needed to tighten the constraints. Although the disturbance in dynamical system (1) at time is , the disturbance experienced by MPC is , which can be proved uniformly bounded by carefully designed DNN and its training mechanism. More details about getting the bounds and are given in the Appendix §A.1. Therefore, we re-define the disturbance set and control set as follows:
[TABLE]
These modifications in tube-based MPC are already pointed out in [34, 26, 42]. For some optimization horizon , an offline reference governor is utilized to generate a reference trajectory
[TABLE]
In particular, the reference trajectory (5) is obtained by solving the following optimal control problem with penalty matrices and tightened sets :
[TABLE]
where is defined in (3). The tightened constraint sets and can be obtained by following the approach of [41, §7]. In order to design the online reference tracking MPC, we first choose an optimization horizon and positive definite matrices , which can be different from those chosen for the reference governor. Let
[TABLE]
be the cost per stage at time predicted at time and let be the terminal cost with . The terminal cost is treated as a local control Lyapunov function within a terminal set
[TABLE]
as in [41] by making the following assumption:
Assumption 1**.**
There exists a control such that the following holds
[TABLE]
The above assumption is standard in the literature. Refer to [41, §4] for more details with a minor modification, which we made here for simplicity. Let us define
[TABLE]
The online reference tracking MPC minimizes (9) at each time instant under the following constraints:
[TABLE]
Notice that the constraint (11) is different from the constraints present in the tube-based MPC formulation [41]. We define the underlying optimal control problem as follows:
[TABLE]
Let the optimizer of the above problem be . Then the optimal cost will be . The first control is called the MPC component and is applied along with to the system at time .
4 Numerical experiment
We consider Wing-rock dynamics to corroborate our result. Letting denote the roll angle in radian, and denote the roll rate in radian per second, the state of the wing-rock dynamics model is at time . We consider the following discrete time dynamics:
[TABLE]
where , , and is bounded uncertainty. In order to generate for the purpose of simulation, we use , with , where
[TABLE]
and is a truncated normal random variable with . The function is saturated by a standard saturation function as , where and is a standard saturation function with the threshold . The controller is not aware of and . The admissible state and control sets are given below:
[TABLE]
Our control objective is to steer the states of the system from to the origin. We compare our proposed approach with two controllers, namely tube MPC [41] and shallow MPC. In order to design shallow MPC, we follow our approach but we consider only a single layer neural network with neurons. To design the deep MPC, we use a four layer network with sizes respectively, where the first hidden layer has 5 neurons and the outermost layer has neurons. The weights of the output layer are updated with our adaptive weight update law (§A.1), while the remaining three hidden layers are trained on a secondary machine (§A.2) using SGD with momentum constant 0.9 and learning rate 0.01. We use nonlinear activation functions after each of the inner layers, and these functions are respectively . We follow the approach of [43] for the experience selection (inclusion and removal of data pairs [44]). In particular, we construct a matrix , where consists of labels, and compute its singular values. If the replacement of label by new label gives larger singular values than the old one, then the new data pair is added at the position of the replay buffer.
Our experimental results are depicted in Fig. 1. Due to the sudden change in at time instants shown by vertical grid lines, tube MPC has oscillations in roll angle. The performance of shallow MPC is affected at each instant of abrupt change, which depicts its incapability of generalization. However, deep MPC demonstrates a good generalization with only three hidden layers.
5 Conclusion
A deep learning based algorithm is presented for safety critical systems by combining the approaches of adaptive control based label generation and tube MPC. A numerical experiment demonstrates that our approach with a single layer neural network (shallow MPC) outperforms tube MPC. The advantage of deep MPC is demonstrated in terms of further improvement in performance and convergence to a very close vicinity of origin. Future work may incorporate the results of [45, 46, 47, 48].
Acknowledgments
We gratefully acknowledge financial support from ONR MURI N00014-19-1-2373 and joint NSF CPS USDA grant 2018-67007-28379.
Appendix
Appendix A Deep Neural Network
Any continuous function on a compact set can be approximated by a multi-layer network with number of layers such that
[TABLE]
where , for , are activation functions and ideal weights, respectively, in the layer. The reconstruction error function is bounded by a known constant for each , i. e. . Therefore, we can represent with the help of a neural network with a desired accuracy. If the neural network is not minimal then the ideal weights may not be unique. However, for the neural-adaptive controller design only the existence of ideal weights is assumed, which is always guaranteed when is a continuous function on a compact set [37, §7.1]. Let us define as the output of the last activation layer under the ideal weights of hidden layers and be the ideal weights of the output layer, then
[TABLE]
There are number of neurons in the output layer. The first row of represents the bias term and the first element of is . The ideal hidden layer weights defining are neither known nor unique.
We update the weights of the output layer on the main machine in real time at each time instant with the help of a weight update law while keeping the weights of hidden layers fixed. The hidden layers are trained on a parallel secondary machine by using the approach [32] in which the weights of the output layer are copied from the main machine at the start of the training and remain fixed during the training. Once the training of DNN on a secondary machine is complete, new weights of the hidden layers are updated on the main machine and remain fixed until new set of weights are again obtained from the secondary machine. The schematic of DNN in the loop with MPC is shown in Fig. 2.
Remark 1**.**
For the implementation of our controller, we can access the output of the last activation layer of DNN without knowing the functions and .
Remark 2** (Necessity of second DNN).**
In many practical applications uncertainties appear in the dynamics through interaction with the environment and neural networks trained on one autonomous vehicle do not perform well on the other vehicle due to slight difference in hardware such as aperture of camera. In such situations deep learning based algorithms cannot be used for mass production without any provision of online training. The second DNN in Fig. 2 allows to improve or re-adjust features with change in hardware or environment.
At time , the neural network is initialized with random weights on both machines, and for a given as input, denote the output of the last activation layer at . Let denote the instants when the weights of hidden layers are updated on main machine after the completion of the training. Let be the output of the last activation layer after the training for a given as input. We can use bounded neurons in the last activation layer, which results in bounded and . Due to the universal approximation property of DNN, exists with bounded . We can assume that there exists for each such that for each . The boundedness of both and ensures boundedness of . We need not to compute their bounds for the controller design. For , (17) becomes
[TABLE]
where is the overall reconstruction error.
Notice that even when the weights of hidden layers are randomly assigned as in ELM [49], the universal approximation property of the neural network allows us to make the overall reconstruction error as small as desired by increasing the width of the network. However, a network with trained hidden layers can capture several useful features, which in turn results in performance improvement [32].
We employ
[TABLE]
as an adaptive (learning) control at time , where is the weight of the output layer, which is trained according to the adaptive weight update law and is a feature basis function obtained from the last activation layer of DNN after training. In the next subsections we provide the relevant details of the training of DNN.
A.1 Adaptive learning of on the main machine
We make the following assumption:
Assumption 2**.**
There exist for , and such that , for , and for every and .
The above assumption is standard in the literature [29, 50, 51]. A priori knowledge about the bounds on the ideal weights of the output layer is useful to avoid parameter drift phenomenon. If the activation functions in the last hidden layer are bounded, i. e. sigmoidal, tanh, etc., then will also be bounded for each and for all .
We initialize such that ; . For a given learning rate and for , we employ the following weight update law:
[TABLE]
where represents the pseudo-inverse of the left invertible matrix . Notice that first element in is one. Therefore, for all and , which avoids any possibility of division by zero.
We employ the discrete projection method to ensure boundedness of for , as follows:
[TABLE]
Let and . It is evident that for all due to the projection. Therefore, the neuro-adaptive control component is bounded, i. e.
[TABLE]
for and for all . The apparent disturbance term in (4) is also bounded, i. e.
[TABLE]
A.2 Self-supervised learning of on a secondary machine
Let represent time instants when we begin the training of DNN. Let data samples are required for the training, which are stored in a buffer of size . We do not have access of the labeled data pairs . Therefore, we follow an approach similar to that of [32] for the data collection and training.
We fix and for each , the labeled pairs are stored in the buffer. Recall that for , where is obtained by the random initialization of the weights of hidden layers. At , we randomly sample data pairs for the training of DNN. We fix the weights of the output layer to be and train the network. Notice that the training of DNN does not affect the operation of system because the controlled system still employs as the adaptive control in which only is updated at each time instant by using the weight update law discussed in §A. At , we get our first trained network. For , we employ as an adaptive control. This process of training, exploiting and storing is repeated at each time . For each , is set to be in the secondary DNN and remain fixed during the training. Therefore, we are interested in finding the weights which minimize the following cost for a given input and corresponding label :
[TABLE]
Let be training data consisting of data points randomly sampled from the buffer for the training. The following loss function is considered for the training of DNN:
[TABLE]
At , the buffer becomes full. So new data can be added after the removal of some old data by using some suitable experience selection method [44]. The available approaches are based on retaining the most informative data based on some criterion [52] and ensuring sufficient diversity [53]. Our present approach is compatible with any existing method of experience selection. However, different methods may result in different performance for different problems and their choice may also depend on the availability of resources. We keep the method of experience selection open for the choice of users.
Appendix B Stability
We recall the following definition:
Definition 1** ([54], page 117).**
The vector sequence is called small in mean square sense if it satisfies for all , a given constant and some , where .
Some straightforward arguments as in [54, §4.11.3] give us the following result:
Lemma 1**.**
Consider the dynamical system (1), weight update law (20) and the projection method (21). Let the Assumption 2 hold and define . Then for all ,
- (i)
, 2. (ii)
, 3. (iii)
* is small in mean square sense with and as per the Definition 1.*
We provide a proof of Lemma 1 in the appendix. Let be the level set around of radius generated by and be their union. In particular,
[TABLE]
Properties of the value function are summarized in the following Lemma. These results are standard in the literature [55]. We provide their proofs in the appendix for completeness.
Lemma 2**.**
- (i)
If then for every . 2. (ii)
[34, Lemma 3]** There exist such that
[TABLE]
Lemma 2-(i) ensures the satisfaction of terminal constraint on states just by construction. Refer to [56, Proposition 1] and [41, Proposition 1] for minor differences due to (1), (14) and Assumption 1. For the purpose of analysis, we define an intermediate optimization problem by replacing in (10) by . In particular,
[TABLE]
Notice that we keep fixed in both problems (14) and (24), respectively, and therefore, we can follow the following convention:
[TABLE]
Remark 3**.**
Notice that the constraint on the first control (11) includes to make MPC aware of the adaptive action. Since nonlinealry depends on due to the nonlinear function , the set-valued control move map becomes state-dependent. Our analysis is based on using the value function of MPC (14) as a candidate Lyapunov function. The presence of state-dependent constraint (11) prohibits us to prove robustness of MPC by invoking [57, propositions 7,8 or 11]. We defined an intermediate optimization problem (24) to get rid of the above difficulty. Due to the above-mentioned technical difficulty the results of [41, propositions 2 and 4] are not directly applicable here.
Important results related to tube MPC are summarized in the following Lemma. Refer to [41, Proposition 2, Proposition 4] for a detailed discussion. We provide their proofs in the appendix to highlight the adjustments and for completeness.
Lemma 3**.**
If Assumption 1 is satisfied, then for all for every the following hold:
- (i)
, and . 2. (ii)
. 3. (iii)
. 4. (iv)
There exists such that
[TABLE]
The Lemma 3-(iv) along with Lemma 2-(ii) ensures that the controlled system is input-to-state stable (ISS) because it admits as an ISS Lyapunov function [58, Lemma 3.5]. In the case of structured uncertainty as , which implies [34, Theorem 1]. Such results are not available in the presence of unstructured uncertainty. However, the existence of invariant and attractive tubes is possible when and are small. We have the following result:
Proposition 1**.**
Let us define . If , then for all , the following hold:
- (i)
for every , , 2. (ii)
for every , . 3. (iii)
In addition, if , then for all .
The Proposition 1 has similar arguments as in [41, Proposition 4] and confirms the existence of an invariant tube and an attractive tube .
Suppose there exists some and such that . Since is a Lipschitz constant of on a compact set , there exists , which satisfies Lemma 3-(iv). Similarly, let there exist such that for every . Since depends on and , their reduction will result in shrinkage of the attractive tube . Moreover, since any level set within is invariant due to Proposition 1-(i), a further shrinkage is possible. However, asymptotic convergence is still not guaranteed. If , then we can get a stronger result provided a certain condition in terms of and is satisfied, and the reconstruction error has small gain type property within the invariant tube. We make the following assumption:
Assumption 3**.**
There exists such that for all and .
Generally, the norm bound on the reconstruction error is assumed to be linear in [29]. We assumed it to be quadtratic, otherwise the above assumption is standard in literature. We have the following result:
Theorem 1**.**
Consider the dynamical system (1) controlled by the Deep MPC, and let assumptions 2, 1 and 3 hold. If , and , then as .
Notice that the main results of tube-based MPC (Proposition 1) are valid for small disturbances. The Theorem 1 extends Proposition 1 by guaranteeing convergence of states to origin under the conditions on and . Smaller value of refers to the faster convergence of the value function of nominal MPC. Generally, reconstruction error is comparatively very small with respect to the disturbance. Therefore, the condition on and are reasonable, and they can be verified in both theoretical and empirical manner.
Appendix C Proofs
Proof of Lemma 1.
- (i)
Since . 2. (ii)
We first compute
[TABLE]
where
[TABLE]
One important property of the projection (21) is the following [54, (4.61)]:
[TABLE]
Since due to (26), we can ensure . Therefore,
[TABLE]
By substituting in the above inequality, we get
[TABLE]
where the last inequality is due to . Therefore,
[TABLE] 3. (iii)
[TABLE]
By summing from to in both sides, we get
[TABLE]
Therefore, is small in mean square sense with and as per the Definition 1.
∎
Proof of Lemma 2.
- (i)
We recall the definitions of and from (7) and (23), respectively. Now, it is immediate to notice that . 2. (ii)
Since and are Lipschitz continuous, by [34, Lemma 3] there exist such that Lemma 2-(ii) hold. We mention key steps here for completeness. Since , we can choose .
Let be Lipschitz continuous with Lipschitz constants and , respectively. We can notice that (14) has no constraints on states and the constraints on control can be satisfied by at time .
Let us recall the definition of the cost function (9), then due to the optimality of , we get
[TABLE]
The above inequality is due to the substitution . Further,
[TABLE]
where . Since for all , there exists .
∎
Proof of Lemma 3.
- (i)
Since , we get . Therefore, is feasible for (24) at time . Since , for the control sequence is also feasible at time for (24). Under the above control sequence because in (24). Therefore, is feasible for some satisfying the Assumption 1. In this way, we have constructed a feasible control sequence for (24) and due to the optimality of , by substituting the feasible control sequence in (24), we get
[TABLE]
due to the Assumption 1. Therefore, , which implies due to our convention (25). 2. (ii)
Since , we get due to the Lemma 3-(i). 3. (iii)
Notice that the optimization problems (14) and (24) do not have constraints on state. Let be the minimizer of (24) at , which means . Since satisfies constraints on control (11) and (13), it is feasible for (14) at . Therefore, due to the optimality of , we get
[TABLE]
which in turn implies
[TABLE]
Now we notice that the cost function (9) is Lipschitz continuous in its first argument on the set while keeping the second argument fixed and . Since for , there exists some such that
[TABLE]
Since was arbitrary, the above result holds for all . 4. (iv)
We compute a bound on . Then by combining the results of Lemma 3-(i) and Lemma 3-(iii), we get . Then due to Lemma 2-(ii), we have
[TABLE]
where .
∎
Proof of Proposition 1.
- (i)
We can observe that . Therefore, due to Lemma 3-(iv), we get
[TABLE]
Since for all , we have . 2. (ii)
If then . 3. (iii)
For every , due to Proposition 1-(i).
∎
Proof of Theorem 1.
Let us consider , where . Clearly, is continuous in and , and satisfies:
[TABLE]
for all . From Lemma 3-(iv) we have
[TABLE]
Therefore,
[TABLE]
Now, we compute and substitute from Lemma 1-(ii) to get
[TABLE]
where because . Therefore,
[TABLE]
By summing from to on both sides, we get
[TABLE]
where the last inequality is due to Lemma 1-(i) and the fact that . Since the right hand side of the above inequality is independent of , we have , which implies as . ∎
The reference list from the paper itself. Each links out to its DOI / PubMed record.
- 1Le Cun et al. [2015] Y. Le Cun, Y. Bengio, and G. Hinton. Deep learning. nature , 521(7553):436–444, 2015.
- 2Mnih et al. [2015] V. Mnih, K. Kavukcuoglu, D. Silver, A. A. Rusu, J. Veness, M. G. Bellemare, A. Graves, M. Riedmiller, A. K. Fidjeland, G. Ostrovski, et al. Human-level control through deep reinforcement learning. nature , 518(7540):529–533, 2015.
- 3Levine et al. [2016] S. Levine, C. Finn, T. Darrell, and P. Abbeel. End-to-end training of deep visuomotor policies. The Journal of Machine Learning Research , 17(1):1334–1373, 2016.
- 4Bojarski et al. [2016] M. Bojarski, D. Del Testa, D. Dworakowski, B. Firner, B. Flepp, P. Goyal, L. D. Jackel, M. Monfort, U. Muller, J. Zhang, et al. End to end learning for self-driving cars. ar Xiv preprint ar Xiv:1604.07316 , 2016.
- 5Hewing et al. [2020] L. Hewing, K. P. Wabersich, M. Menner, and M. N. Zeilinger. Learning-based model predictive control: Toward safe learning in control. Annual Review of Control, Robotics, and Autonomous Systems , 3:269–296, 2020.
- 6Li et al. [2021] Y. Li, N. Li, H. E. Tseng, A. Girard, D. Filev, and I. Kolmanovsky. Safe reinforcement learning using robust action governor. ar Xiv preprint ar Xiv:2102.10643 , 2021.
- 7Berkenkamp et al. [2017] F. Berkenkamp, M. Turchetta, A. P. Schoellig, and A. Krause. Safe model-based reinforcement learning with stability guarantees. ar Xiv preprint ar Xiv:1705.08551 , 2017.
- 8Liu et al. [2020] A. Liu, G. Shi, S. Chung, A. Anandkumar, and Y. Yue. Robust regression for safe exploration in control. In Learning for Dynamics and Control , pages 608–619. PMLR, 2020.
