Online learning with stability guarantees: A memory-based real-time model predictive controller
Lukas Schwenkel, Meriem Gharbi, Sebastian Trimpe, Christian Ebenbauer

TL;DR
This paper introduces a data-driven real-time MPC scheme that learns the value function online, ensuring stability and improving control performance by utilizing historical data for better warm starts.
Contribution
It presents a novel online learning method for MPC that guarantees stability and convergence, enhancing real-time control with data storage and learning.
Findings
Warm starts become asymptotically exact
Suboptimality diminishes over time
Performance improves with data storage and learning
Abstract
We propose and analyze a real-time model predictive control (MPC) scheme that utilizes stored data to improve its performance by learning the value function online with stability guarantees. For linear and nonlinear systems, a learning method is presented that makes use of basic analytic properties of the cost function and is proven to learn the MPC control law and the value function on the limit set of the closed-loop state trajectory. The main idea is to generate a smart warm start based on historical data that improves future data points and thus future warm starts. We show that these warm starts are asymptotically exact and converge to the solution of the MPC optimization problem. Thereby, the suboptimality of the applied control input resulting from the real-time requirements vanishes over time. Simulative examples show that existing real-time MPC schemes can be improved by storing…
| name | description |
|---|---|
| list of previous data points in | |
| list of previous input sequences in | |
| list of previous cost function values in | |
| list that contains all facets of the lower half of as lists of the indices of their extreme points in | |
| list that contains all facets of as lists of the indices of their extreme points in | |
| center point of | |
| center point of | |
| list of deleted facets in that can be overwritten | |
| list of deleted facets in that can be overwritten | |
| list that contains for each point in a list of facets in that are attached to this point | |
| list that contains for each point in a list of facets in that are attached to this point |
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
\setkomafont
section \setkomafontsubsection \setkomafontsubsubsection\normal
Online learning with stability guarantees:
A memory-based real-time model predictive controller111This article is an extended version of [35] including all proofs, an application example, and a detailed description of the used algorithm.
Lukas Schwenkel
Institute for Systems Theory and Automatic Control, University of Stuttgart, Germany
Meriem Gharbi
Institute for Systems Theory and Automatic Control, University of Stuttgart, Germany
Sebastian Trimpe
Institute for Data Science in Mechanical Engineering, RWTH Aachen University, Germany
Intelligent Control Systems Group, Max Planck Institute for Intelligent Systems, Stuttgart, Germany
Christian Ebenbauer
Institute for Systems Theory and Automatic Control, University of Stuttgart, Germany
Abstract
Abstract. We propose and analyze a real-time model predictive control (MPC) scheme that utilizes stored data to improve its performance by learning the value function online with stability guarantees. For linear and nonlinear systems, a learning method is presented that makes use of basic analytic properties of the cost function and is proven to learn the MPC control law and the value function on the limit set of the closed-loop state trajectory. The main idea is to generate a smart warm start based on historical data that improves future data points and thus future warm starts. We show that these warm starts are asymptotically exact and converge to the solution of the MPC optimization problem. Thereby, the suboptimality of the applied control input resulting from the real-time requirements vanishes over time. Numerical show that existing real-time MPC schemes can be improved by storing optimizatising the proposed ng scheme.
††footnotetext: Email addresses: [email protected] (Lukas Schwenkel), [email protected] (Meriem Gharbi), [email protected] (Sebastian Trimpe), [email protected] (Christian Ebenbauer). The authors thank the International Max Planck Research School for Intelligent Systems (IMPRS-IS) for supporting Lukas Schwenkel. This work was supported in part by the Cyber Valley Initiative and the Max Planck Society.
1 Introduction
Model predictive control (MPC) is a control strategy that solves at each sampling instant a finite horizon open-loop optimal control problem (see [33]). As a consequence, at every sampling instant an input sequence and its resulting state trajectory are computed for the whole prediction horizon, thus continuously generating large amounts of optimization data. However, only the first portion of the computed input sequence is applied to the system and typically the remaining part is not stored. This is wasteful and contradicts our human intuition to memorize our decisions when we are facing recurring problems or tasks in order to successively improve them. Motivated by this, the key question addressed in this article is how to leverage on optimization data by introducing memory and online learning in real-time MPC algorithms in order to improve their performance.
In this work, we will store the previously computed input sequences and utilize them to learn the solution of the MPC optimization problem online, see Fig. 1. The proposed approach specifically targets real-time MPC where, due to a lack of computation time, the MPC optimization problem cannot be solved exactly and thus, suboptimal solutions are applied to the system. Hence, there is a need to learn the optimal solution.
The proposed online learning is based on the idea to store optimization data and leverage it for subsequent optimizations: whenever a new input sequence has to be computed close to a point that was visited before, the optimization iteration is initialized with the previously computed input sequence. This way, we can use the available computation time to refine this suboptimal input sequence instead of starting the optimization from scratch. At recurrent points, we expect to approach the optimal solution since we improve a suboptimal one over and over again. By combining these data-based improvements with the theory of real-time MPC, we will be able to show that online learning is feasible for recurrent points with inherent stability guarantees.
**Contributions ** The main contributions of this work are as follows. We present a real-time MPC scheme based on [19] that allows for online learning to improve control performance and that comes with inherent stability guarantees independent of the chosen learning method. Further, we introduce novel learning methods tailored to the real-time MPC scheme for both linear and nonlinear systems. The learning methods exploit analytic properties of the cost function, namely convexity in the linear case and Lipschitz continuity in the nonlinear case, in order to upper bound the value function. This bound is shown to converge to the value function on the -limit set of the closed-loop state trajectory. Thereby, the input applied to the system converges to the optimal feedback policy. The result is an online method that learns the value function and the optimal MPC law asymptotically over the -limit set. Moreover, we present several examples, which illustrate and confirm the results and are used to discuss practical issues and how to circumvent them. In particular, we demonstrate how learning and applying the learned control law can be parallelized and executed on different time scales, whereby learning itself does not have to be executed in real time.
**Related work ** This work combines real-time MPC with data-based approaches, both of which being active areas of research. The proposed online learning approach builds on top of a real-time MPC scheme and especially its stability properties. In real-time MPC, there are several results on stability, which all take the suboptimality of the applied input sequence explicitly into account. The work [36] seeks for a feasible solution without executing any optimization steps, which makes this method unsuitable as basis for learning the optimal solution from its data. The works [2], [5], and [28] optimize the input sequence until a certain accuracy is achieved. Hence, to improve their performance by including learning, we would need to adjust this accuracy according to the learning progress, which is rather difficult to do. That is why we assume a method for the proposed learning approach that optimizes until a certain computation time is reached like those in [14], [19], and [41]. In particular, the real-time MPC scheme presented in [19] will be the basis for the online learning method we develop herein. In [19], the cost function is established as a Lyapunov function for the coupled system optimizer dynamic, which allows for a straightforward extension of the scheme to learning while preserving stability.
Learning approaches in MPC are becoming increasingly popular. Different ways for incorporating data in MPC have been suggested in literature and can be summed up in two main categories: (i) system input-output data to learn the model; and (ii) sampling data of the MPC policy to approximate it with an explicit control law. Model learning (i) is an important topic of current research since the accuracy of the model has significant impact on the control performance. The model can be learned offline, as for example in [10], [24], [26], [31], and [39], but also online, [3], [8], [13], [29], or first learned offline and then refined online [20], [27]. Nevertheless, only few of these works, namely [3], [10], [13], and [27], establish stability results for their methods. The learning of an explicit MPC control law (ii) is done offline in order to substitute the online optimization in MPC with an online evaluation of the explicit control law. The majority of works in this direction employ neural networks as approximate controllers [1], [12], [16], [21], [22], [30], [32] and [42], while support vector machines are used in [11], and the learning problem is formulated as quadratically constrained quadratic program in [15]. Out of these works, only [15] and [21] can guarantee stability. Since the evaluation of e.g. a neural network is typically much faster than solving the MPC optimization problem, these methods are suited for real-time applications. In contrast to our work, these methods learn the control law offline. This implies that they have a fixed approximation error and cannot improve online, while our method approaches optimality by learning online. By their design, offline methods cannot adapt to changes and need training data beforehand, while our method can be started without prior data.
There are few other works on learning in MPC that do not fit in the two categories. One of them is [34], in which past inputs and the past state trajectory are used for iterative tasks to learn a terminal set and terminal cost of a finite horizon MPC controller such that infinite horizon performance is optimized while stability is ensured. In [23], a neural network is trained offline to identify active constraints in order to warm start an active set algorithm in embedded MPC while guaranteeing online stability. Although this work also considers learning warm starts, it is conceptually different from ours, since it learns offline and does not improve performance by decreasing the cost.
This article proposes a novel way of leveraging data in MPC, which does not match any of the aforementioned categories and works. Here, we use internal data of the real-time MPC algorithm, specifically predicted state and input sequences. With this data, we improve the suboptimal real-time controller over time by providing better warm start solutions based on past optimization solutions. To the best of the authors’ knowledge, this is the first work on learning in real-time MPC that utilizes this internal data.
**Outline ** This article continues with introducing preliminary results, the problem formulation, and the core idea of this work in Section 2. Sections 3 and 4 then contain the main results of this article: the online learning method is developed for linear and nonlinear MPC, respectively, and their properties are analyzed in theory and illustrated through numerical examples. Section 5 underlines the relevance of the results by an application example. Our conclusions, stated in Section 6, complete the article.
**Notation ** Throughout the article, we will use , and to denote the sets of non-negative real, strictly positive real, and strictly positive natural numbers. The set of positive definite matrices of dimension is denoted with . For , we define . A continuous function with is a -function if it is strictly increasing and as . The interior of a set is denoted and the convex hull is . For a function the epigraph is
[TABLE]
2 Problem setting: Learning in real-time MPC
In this section, we set the stage for developing the main results of this article. We first introduce the real-time MPC framework [19], in which our learning method will be embedded. We then present the main idea of how to leverage data in this setting, make the learning problem precise, and provide preliminary stability results.
2.1 Real-time MPC framework
We consider the control of time-invariant discrete-time systems of the form
[TABLE]
where denotes the state vector, the control input vector, and some external signal vector, all at time instant . We will consider both linear and nonlinear system dynamics . In MPC, a typical control task is regulation of the system state to the origin while minimizing some cost function. This is handled by solving at each time instant an open-loop optimal control problem of the form
[TABLE]
where denotes the stacked control inputs over the finite prediction horizon , and and denote the stage and terminal costs, respectively. Since the external signal is unknown over the prediction horizon, we assume and use the nominal system to predict . We will call the value function and the cost function. We assume that state and input constraints are taken into account by a barrier or penalty term in the stage and terminal cost.
Usually in MPC, it is assumed that the system dynamics and the optimization algorithm evolve on different time scales such that the convergence time of the algorithm can be neglected and an instantaneous solution to (3) is available. In real-time iteration schemes, this often unrealistic assumption is dropped and the optimization algorithm is interpreted as a system with its own dynamics. These are coupled with (2) and are given by
[TABLE]
where represents the optimization algorithm and is a projection matrix selecting the input to be actually applied. We divide the optimization algorithm into a warm start operator and an optimization update operator , which is iterated times
[TABLE]
In this way, an input for the nominal next state is determined.
The warm start operator is typically generated by a time shift of the previous input sequence appended with some local control law [19]. Throughout the article, we will therefore refer to as temporal warm start operator. If the closed loop is stable for any number of optimization algorithm iterations , the real-time scheme is also called anytime MPC, [5], [19]. A generic result on nominal closed-loop stability of an anytime iteration scheme is proven in [19] and stated in the following theorem.
Theorem 1**.**
Consider the real-time MPC scheme introduced in (2)–(5) with and assume that the stage and terminal costs and are positive definite. Further assume the existence of such that
[TABLE]
and assume that and fulfill
[TABLE]
for all , where is a continuous function with . Then, for any sequence , the origin is globally asymptotically stable.
This theorem is stated in [19] for linear systems using a relaxed barrier function formalism to ensure the assumptions. For nonlinear systems, this generic result still applies, however, it is much more challenging to ensure (6) and (7) in nonlinear MPC. In [14], a weaker stability result for a slightly different nonlinear real-time MPC framework is presented, yet without satisfying the assumptions (6) and (7).
2.2 Problem formulation and main idea
When the system dynamics and the optimization algorithm operate on similar time scales, there is typically only time for a few optimization iterations until the system requires the next input, i.e. is small. Hence the input might be far from optimality resulting in unsatisfying controller performance. To solve this issue, we will leverage the data of all the previously computed input sequences, which are usually discarded (except the last input sequence generating a temporal warm start). We store all past input sequences, and if the system state arrives close to a point in the state space that has been visited before, we use this previously calculated input sequence to generate a warm start solution. The main challenges of this approach are to make ‘close to a point’ mathematically precise and to prove convergence. Hence, we have to choose which stored input sequence (or combinations of multiple ones) to take as warm start solution at a given location such that the learning scheme asymptotically recovers the optimal policy at recurrent points of the closed loop trajectory.
In more detail, we will denote the warm start generated from the optimization data by and call it spatial to distinguish it from the temporal one. The stored data includes: the input sequence for each time , the point in state space at which it was calculated , and the cost function value that was achieved ; that is
[TABLE]
and we denote an approximation of based on the collected data by .
In a nominal MPC stabilization problem, we cannot expect that the system state repeatedly arrives at the same points in the state space except at the origin, where the optimal input is trivial. In a real-world scenario, however, there are disturbances, periodic operation conditions, set point changes, or reference signals such that more points can be visited several times. Thus, learning the optimal control at these points is a meaningful task. We generically model such situations with the signal , which influences the shape of the -limit set of the sequence of points for which an input is computed,
[TABLE]
A point in is called limit or recurrent point and is reached infinitely often arbitrarily closely. Only at such points can we expect learning the optimal policy to be possible because the optimization iteration usually converges asymptotically to the optimum and thus needs an infinite number of iterations in general.
With this, we can now make the objective of this work precise. We aim to design the spatial warm start operator such that it will converge to the optimal policy for all as , i.e.
[TABLE]
Hence, we want to learn the value function on and the corresponding optimal control input.
While learning the optimal control, we do not want to jeopardize stability. Hence, we need to make sure that the warm start solution that is used to initialize the optimization iteration satisfies (7a). We can achieve stability by exploiting the temporal warm start (which is shown to be stabilizing) and by only applying the spatial warm start when it yields a lower cost. That is, we replace (5a) with the new warm start operator
[TABLE]
Through this intuitive approach, the resulting online learning scheme is inherently stable as stated in the following corollary.
Corollary 2**.**
Consider the real-time MPC scheme introduced in (2)–(5), (8), (11) with and assume (6), (7). Then the origin is globally asymptotically stable for all spatial warm start operators .
Remark 3**.**
The stability result considers the nominal system with since stability of the origin can not be achieved with a non-vanishing . The learning problem, however, is trivial for , thus, we require for some to render . Nevertheless, nominal stability is meaningful and essential for the learning task.
- •
As a first example, consider a scenario as typically studied in iterative learning control, where, after some finite time period, the system is reset to a new initial state. Let the set of initial states be . Then, if we model , , for a sequence of reset times and for , this would correspond to an iterative learning scenario for multiple initial states in , where the goal is to steer close to the origin. While executing an iteration it is , hence, global asymptotic stability of the nominal closed loop system is essential to guarantee that the controller does the right thing (not growing unbounded and approaching the origin). This iterative learning interpretation is also discussed in the unicycle example in Section 4 of the paper.
- •
As a second example, let be Gaussian noise. Then would render the set of recurrent points nonempty (presumably the whole state space). Nevertheless, the stability result still makes sense in terms of expectations. Since a complex stochastic analysis is beyond the scope of this work, we do not explicitly consider this case. Still, we can think of as a deterministic noise signal or a realization of a stochastic process. In general, feedback stabilization is essentially only meaningful in the presence of disturbances. Often, in nominal MPC, disturbances are associated to nonzero initial states.
- •
As a third example, consider a tracking of references or set points. These are often based on a nominal stabilizing controller, which is used to stabilize different set points. When the set point changes, this can be interpreted as a new initial condition and modeled by the signal as in the first example. This scenario is treated in an Application example in Section 5.
In general, we assume that is of such nature that the closed-loop trajectory stays bounded. This ensures that is nonempty due to the Bolzano-Weierstrass theorem and that is approached by as .
Overall, the proposed MPC iteration scheme measures at every time instant the current system state, generates the warm starts, chooses the better one, executes the optimization iteration, and stores its data before it finally applies the first portion of the input sequence to the system. A pseudo-code description of this procedure is given in Algorithm 1. The main focus of this article is line and of the algorithm. More specifically, we will design in Section 3 and 4 spatial warm start operators for linear and nonlinear MPC, respectively. Throughout the article, we assume that a temporal warm start satisfying the conditions in Theorem 1 is given.
3 Leveraging data in real-time linear MPC
In this section, we assume (2) to be linear
[TABLE]
with , the stage and terminal cost are quadratic, and the polytopic state and input constraints are incorporated in the costs via relaxed logarithmic barrier functions , (see [17] or [18] for definition). Under these assumptions, and in (3) become
[TABLE]
with the design parameters , and , as well as resulting from , , and the constraints (see [18] for details). In [19], a temporal warm start and an optimization update operator are defined such that the conditions (6) and (7) of Theorem 1 are satisfied. Moreover, it has been shown in [19] that constraint satisfaction can be guaranteed with a finite barrier parameter if the relaxation of the logarithmic barrier functions and the initialization of the scheme are chosen suitably. We will refer to this scheme as linear anytime MPC. For this setting, we are going to present a method to learn the optimal policy and the value function in the sense of (10), analyze its convergence properties, and demonstrate the method in an example.
3.1 A spatial warm start based on convexity
The main idea for the spatial warm start generation is to exploit convexity of the cost function and use convex combinations of past data points. More formally, we define the spatial warm start at by
[TABLE]
Hereby, we dropped the time dependency of the data from (8) for the sake of clarity. Furthermore, denotes the point at which (3) is to be solved and denotes a convex approximation of the value function based on the data . A graphical illustration of (15) is given in Fig. 2. Notice that (15) is only feasible for and thus the domain of is . For , we cannot compute a spatial warm start in this fashion and have to take the temporal one.
Geometrically, is a piecewise affine function and partitions into -simplices, on which it is affine. We will refer to this partition as triangulation of inspired by topology. For computing the spatial warm start at , one has to find the -simplex that contains , construct the convex combination of its extreme points that leads to , and combine the inputs corresponding to the extreme points in the same way.
In the remainder of this section we will show that this spatial warm start converges to the optimal policy at recurrent points (9) by showing that upper bounds if is convex (Lemma 4), that is convex (Lemma 5), and that converge to the value function (Theorem 6).
Lemma 4**.**
Let be convex with respect to . Further, let be any sequence, and let be the corresponding sequence of data sets defined in (8a). Moreover, let and be defined in (15) and let . Then (i) and (ii) is convex.
For a proof, see Appendix A. In order to apply Lemma 4, convexity of is needed, which is established for the linear anytime MPC by the following lemma.
Lemma 5**.**
The cost function of linear anytime MPC, i.e. (3) with (12), (13), (14) is convex in .
The proof is given in Appendix A. In view of Lemma 4 and Lemma 5, we can show the following convergence theorem, which establishes that we can indeed asymptotically learn the value function and the MPC law in .
Theorem 6**.**
Consider the system (12) controlled by (3)–(8), (11), (13)–(15) and from (9). Then for all with for some it holds
[TABLE]
Proof.
Step 1) converges: For , it holds and hence is defined on for all . Further, since and the minimum over a larger set can only be smaller, we can conclude that for all , . In order to apply Lemma 4, has to be convex, which is shown in Lemma 5. Due to (i) in Lemma 4 and the optimality of we have
[TABLE]
for all , . Hence, is nonincreasing and bounded from below on and thus converges pointwise on . Due to (ii) in Lemma 4, this is a sequence of convex functions , which has a convex limit . Every convex function is locally Lipschitz continuous on open subsets (see e.g. [7]), hence is locally Lipschitz continuous on .
Step 2) converges to : Let and let with be compact, then there exists a sequence such that \xi_{i}=f\big{(}x(k_{i}),\Pi_{0}U(k_{i})\big{)}\to x,\ \xi_{i}\in\mathcal{C}. Let be the Lipschitz constant of on and for , respectively, which are finite since is compact and the functions are locally Lipschitz. Hence, as is a real valued converging sequence and is therefore upper bounded by some . It follows
[TABLE]
Thus the following chain of inequalities
[TABLE]
is a chain of equalities in the limit, where the first inequality holds due to (8) and (15) and the other inequalities due to (7b), (7b), (11) and Lemma 4 (i) in this specific order. Therefore as the decrease of the optimizer update operator (7b) \gamma\left(\Phi^{0}\big{(}U(k_{i}),x(k_{i}),\mathcal{D}(k_{i})\big{)},\xi_{i}\right)\to 0, which is due to only possible if
[TABLE]
This is due to (18) equivalent to J_{N}^{\mathrm{a}}\big{(}\xi_{i},\mathcal{D}(k_{i})\big{)}-J_{N}^{*}(\xi_{i})\to 0. Since is convex, it is also locally Lipschitz continuous on and we assume that is a Lipschitz constant of on (if not choose large enough). Hence, it follows
[TABLE]
Thus, we showed that which implies due to (17) the desired result (16). ∎
Remark 7**.**
The assumption for some is rather technical. Still, it is possible that a recurrent point never lies in the interior of as . A straightforward solution to this problem is to initialize the data set with some points , . This solves the issue for all even for arbitrary . Hence, a good choice for the points are the extreme points of the polytopic feasible set. By adding the point to the prior knowledge about the optimal input at the origin can be included.
3.2 Algorithm and implementation
For online learning, it is crucial to solve (15) in an efficient way, since a warm start solution must be provided within one sampling period. The convex hull computation gets increasingly burdensome with more data points and higher state space dimensions. Therefore, we rely on an incremental convex hull algorithm and split the computation in two parts: the convex hull update and the spatial warm start generation. Detailed implementations of both parts can be found in the Appendix B. The update can be performed on a different time scale and does not need to be executed in real time, as is demonstrated in Section 5. Notice that the number of data points that is needed to get a good approximation of the MPC control law scales with the dimension of the -limit set , which can be significantly lower than the dimension of the state space. Moreover, the computational complexity of the learning algorithm is independent of the length of the prediction horizon. Finally, we emphasize that stability is always guaranteed. Hence, storage and processing restrictions only limit the performance improvement but do not jeopardize the stability of the controller.
3.3 Example: Double Integrator
We demonstrate the learning scheme presented in this section for a numerical example. We consider the discrete-time double integrator system from [19] of the form
[TABLE]
with sampling time . Input and state constraints are inherited from [19] and , and likewise the parameters of the cost function and of the backtracking line search optimization with gradient descent search direction. The external disturbance is quasiperiodic .
For two optimization iterations and a randomly chosen initial condition at , we have simulated the closed-loop behavior of the presented MPC scheme for time steps. In the left subplot of Fig. 3, the suboptimality of both warm starts, i.e. the difference of their costs to the value function, is depicted over time. As was to be expected, the spatial warm start performs first poorly since too few data is available, but then improves over time while more and more data is collected until it significantly outperforms the temporal warm start. Further, we can see that the learning rate is limited by the optimization update operator since two gradient descent iterations per time step result in a slow rate of convergence.
In order to highlight the improvements offered by the proposed learning scheme, we compare it to the original anytime MPC [19], i.e. to Algorithm 1 without lines 4 and 10. To this end, the state trajectories of both methods are depicted in the middle subplot of Fig. 3 in the phase portrait until time together with the limit cycle of the optimally controlled system, where the optimization problem (3) is solved with Matlab’s fminunc. As we can see, learning significantly improves the performance and leads to much better constraint satisfaction. First, both trajectories are identical since the spatial warm start is never used, but as more data is collected the trajectories deviate and the one from the proposed scheme approaches the optimal one. In particular, the trajectories deviate as soon as they enter the interior of , where for the proposed scheme the convergence result of Theorem 6 holds.
Thus far, we have not considered the fact that generating the spatial warm start consumes computation time that might be better invested in doing more optimization iterations with the temporal warm start. To investigate this, we run the simulation for different numbers of optimization iterations and for ten different initial conditions. For each , we average the computation times and the costs to obtain a point in the right subplot of of Fig. 3. We can see that the proposed learning scheme is way closer to the optimal performance than the original anytime MPC without learning.
4 Leveraging data in real-time nonlinear MPC
In the general case of the nonlinear MPC scheme (2)–(8), (11), we cannot expect that the cost function is convex. Therefore, we propose in this section a different learning method that does not depend on convexity, but instead exploits Lipschitz continuity.
4.1 A spatial warm start based on Lipschitz continuity
The idea for the spatial warm start generation for is to find an input sequence from the data such that is close to and is small. This directly leads to the following spatial warm start rule
[TABLE]
where is a Lipschitz parameter that might depend on , and is an approximation of the value function based on the data . An illustration of the spatial warm start (20) is given in Fig. 4. If we assume to be Lipschitz continuous with constant , then upper bounds the value function and the spatial warm start as shown in the following Lemma. Similar ideas to approximate an unknown function based on Lipschitz continuity from sampling data have been proposed in [4], [9], and [40].
Assumption 8**.**
Assume that is Lipschitz continuous with constant , i.e. for all ,
[TABLE]
Lemma 9**.**
Let Assumption 8 hold, then
[TABLE]
and is also Lipschitz continuous with Lipschitz constant .
For a proof, see Appendix A. With Lemma 9, we obtain a convergence result similar to Theorem 6.
Theorem 10**.**
Consider the nonlinear MPC scheme presented in (2)–(9), (11) with the spatial warm start operator (20), further let Assumption 8 hold, and let be upper bounded. Then for all
[TABLE]
Proof.
Step 1) converges: In view of Lemma 9, we have Thus by showing for we will prove the theorem. We also see from this inequality that is bounded from below by . Further is decreasing since the minimum over a larger set is smaller or equal and thus must converge to some value.
Step 2) converges to : This step is analogous to step 2) in the proof of Theorem 6 and thus moved to the Appendix A. ∎
Remark 11**.**
Even tough Assumption 8 might be restrictive, if we assume that does not grow unbounded, but stays in some compact region , then the Lipschitz constants do not need to apply globally, but only on in order to obtain the same result. If further stays in some compact set , then the assumption is also satisfied with if is continuous in .
Remark 12**.**
It can be quite challenging to satisfy (6) and (7) for general nonlinear MPC algorithms. As often done in practice, one can still implement the learning scheme with spatial warm start (20). The controller performance will improve as long as the optimization iteration does reduce the costs, even if (6) and (7) do not hold. However, a prior stability guarantee depends on the nonlinear iteration scheme.
Remark 13**.**
If in a specific setup the cost function is nevertheless known to be convex, then the learning scheme of Section 3 can be used and probably leads to a better upper bound ; for example, see Fig. 4 where the convex hull over the data points would result in a lower upper bound inside . On the other hand, if one faces a linear MPC problem with a nonconvex cost function , due to nonconvex stage or terminal costs, then the learning scheme from this section can be applied.
4.2 Example: Unicycle
In this section, we implement the nonlinear real-time MPC learning scheme for the unicycle model
[TABLE]
with state and input vectors , respectively, as well as the sampling time . The control task is to drive the unicycle to facing into positive direction, i.e. for some while keeping . Therefore, the stage cost in (3) is chosen as to ensure positive definiteness with respect to , , prevent from getting too steep, while ensuring differentiability at the origin, and let shoot up as . The terminal cost is set to . For this example, the learning scheme from Section 3 cannot be applied since the resulting cost function is not convex. However, the costs turn out to be Lipschitz continuous in with Lipschitz constant computed in Appendix C.
We choose as the temporal warm start and the same optimization operator (5b) as for the linear example in Section 3.3, which consists of the gradient descent search direction with backtracking line search. Further, we use a constant number of optimization iterations . We consider iterative learning in a repetitive and fast process, which is a common task in iterative learning control (see [38]). The system is repeatedly started from the same initial condition , where each run takes time steps. Hence, throughout a run, is zero and after every time steps it sets back to .
Although and might not satisfy (7), the closed loop performance is still satisfying as we can see in the simulation results depicted in Fig. 5. In the plot, we can see that the optimal behavior is learned within a few runs. After five runs the unicycle has learned to start driving backwards, and after runs it has learned to approach the destination driving forwards and is almost indistinguishable from the optimal trajectory.
5 Application: Servomechanism
In the two previous examples, we have not considered the actual time available for the computation. In fact, the learning update may not always be computable within one sampling period , especially when considering systems with higher dimensions and fast dynamics. To circumvent this issue, we will outsource learning onto a server that communicates with the controller. In particular we assume, that the convex hull update is executed in parallel to the controller and whenever it is completed, one of the latest data points since the last one added is added next. Further, is not predefined but the optimization iteration is stopped as soon as the computation time is exhausted. This does not affect our stability and convergence results as long as at least one optimization iteration is performed since Theorem 1 and Theorem 6 apply for any sequence .
In this section, we will implement the learning scheme for a reference tracking task on a simulation model of a servomechanism consisting of a DC-motor, a gear-box, an elastic shaft and a load. This system has already served as an example in [6] from where we have inherited the linear model
[TABLE]
and the parameters, which can be found in [6]. The state vector consists of the load angle and the motor angle as well as their time derivatives and the input corresponds to the DC voltage. The reference tracking task of this system is also included in the collection of benchmark MPC problems given in [25]. The model is discretized using zero-order hold on the input and the sampling time . Further the system has to satisfy the state and input constraints , where can be found in [6]. To make the problem more difficult and to obtain a polytopic feasible set, we added the constraints and to the original problem.
The goal is that the angle of the load follows the periodic step reference signal with period length . To achieve this we see that is a steady state of (25) that produces exactly the constant reference for arbitrary . Therefore we will define the cost function with respect to and except for the part that incorporates the constraints. That is, (13) and (14) become
[TABLE]
with , and . Further, the relaxed logarithmic barrier function and are chosen according to [18]. For the optimization update operator, we use the gradient descent backtracking line search as described in [19] with parameters , and . The data set will be initialized with the extreme points of the feasible set as described in Remark 7.
The simulation results of the closed loop over time steps ( periods of the reference signal) are shown in Fig. 6. In the left subplot, we can see that the reference tracking performance improves from period to period. This is also visible in the middle subplot, where the mean reference tracking error per period of the learning scheme decreases and converges to the one of the optimally controlled system. The original anytime MPC scheme without a spatial warm start and thus without learning does not improve over time and the mean reference tracking error stays constant. The scheme with learning is in the first period worse than without learning, because the time to compute the spatial warm start cannot be used for further optimization iterations. In average there is a difference of iterations in between the original anytime MPC ( iterations) and the scheme with learning (). However, already in the second period the scheme with learning achieves better tracking performance and improves much more over time. After the th period the controller has learned to exploit the full range of feasible control inputs as can be seen in the right subplot of Fig. 6.
There is a computational aspect that we have not discussed yet. The required memory capacity and the computation time for updating the convex hull and for the spatial warm start generation increase with the size of such that the number of points in the convex hull that can be handled is limited. Hence, to exploit the available capacities we add only data points to that lead to a significant improvement from to , here . In this example, out of data points were skipped because the improvement was too small and because it was not possible to update the convex hull within one sampling time, leading to a total of data points added to the convex hull.
In summary, the example clearly shows the benefit of the proposed memory-based MPC scheme in improving performance, while maintaining guarantees on stability at all times and being suitable for online implementation.
6 Conclusion and outlook
We presented and analyzed an online learning scheme for the value function and optimal policy in a real-time MPC framework for both linear and nonlinear systems. Even if only one optimization iteration is performed between two consecutive sampling instants, the MPC still approaches optimality in the long run through the learning scheme. Furthermore, stability of the real-time MPC scheme is retained independent of the type of learning approach. These findings were illustrated in different linear and nonlinear numerical examples. Moreover, we discussed how learning and applying the learned input can be decoupled and parallelized.
An interesting topic for future study is to further improve the presented learning methods by more efficient implementations of the data structure and the algorithms to generate the spatial warm start. In particular, while we have not provided a concrete algorithm for the nonlinear Lipschitz based learning scheme, the ideas of [4] might be helpful for arriving at an efficient implementation. The optimization iteration also leaves room for improvement by exploiting underlying properties or structure. The proposed method can be extended in several ways. For example the online learning scheme could probably be modified to handle time-varying systems or costs by including some mechanisms to forget expired data points. Since our stability result does not depend on the learning method, this opens the possibility to apply classical machine learning function approximators for online learning. A comparison to a function approximation based on offline data to generate a static spatial warm start would also be interesting to investigate.
Appendix A Proofs
Proof of Lemma 4.
(i) Since is convex and the spatial warm start defined in (15) is a convex combination of data points , we have
[TABLE]
and further
[TABLE]
(ii) We slightly reformulate (15)
[TABLE]
and show that the epigraph of is convex:
[TABLE]
where denotes the Minkowski sum. Since both and are convex, the epigraph is also. ∎
Proof of Lemma 5.
It is shown in [19] that the cost function can be written under these assumptions as
[TABLE]
where and can be computed from , , , and and account for the terms , and in and . Furthermore, the relaxed logarithmic barrier function is shown in [19] to be a weighted sum over convex functions whose argument depends affinely on . Therefore is convex in . The quadratic part of is also convex in since it is quadratic and positive definite as proven in [19]. ∎
Proof of Lemma 9.
In view of (21) we have
[TABLE]
and further
[TABLE]
To prove the second statement, we use (20a) with and (20b) with to see
[TABLE]
which implies
[TABLE]
Since and are interchangeable, it follows
[TABLE]
∎
Step 2) of the proof of Theorem 10.
Step 2) converges to : For , there exists a sequence such that . It follows
[TABLE]
Thus the following chain of inequalities
[TABLE]
is a chain of equalities in the limit, where the first inequality holds due to (8) and (20) and the other inequalities due to (7b), (7b), (11) and Lemma 9 in this specific order. Therefore as the decrease of the optimizer update operator (7b)
[TABLE]
which is only possible if
[TABLE]
This is due to (28) equivalent to
[TABLE]
is Lipschitz continuous and without loss of generality we can assume that is the Lipschitz constant of (if not choose large enough). Hence it follows
[TABLE]
which is with (17) what we wanted. ∎
Appendix B Convex hull algorithm
The purpose of the convex hull algorithm presented in this section is to solve (15). As discussed in Section 3, gives rise to a triangulation of and the spatial warm start at can be computed by seeking for the -simplex in this triangulation that contains . Therefore in a first step, this triangulation of must be computed by a convex hull algorithm that determines the facets of the lower boundary of the convex hull . The lower boundary of is the graph of , see Fig. 2. The extreme points of each of these facets are, after projection onto , the extreme points of an -simplex of the triangulation of . As second step, the spatial warm start can be computed from this convex hull by searching the -simplex that contains and combine the inputs corresponding to its extreme points to obtain a warm start solution. Hence, the procedure naturally decomposes into two steps:
- i)
Generate spatial warm start from the convex hull. 2. ii)
Update convex hull with a new data point.
While i) corresponds to line 4 of Algorithm 1, ii) corresponds to line 14 where collecting data is meant as fitting the new point into the data structure. Notice that this decomposition divides the algorithm into learning ii) and applying the learned spatial warm start i) – a property that will be exploited in Section 5 for parallelization.
It is efficient to use an incremental convex hull algorithm that updates the existing convex hull object when a new data point arrives instead of starting the calculation from scratch. Further, the fact that for a point that is added to the convex hull, a spatial warm start was already generated and it has therefore been located in the triangulation can be leveraged for efficiency. The problem of searching for the -simplex in the triangulation that contains is similar to the location problem in explicit MPC and hence these algorithms (see e.g. [37]) can be used. The implementation we used, however, is based on a directed search by tracking neighboring -simplices starting from an initial guess down to .
We present a way to implement the two main routines spatial warm start generation generateSW and convex hull update updateCH. First, we need to define an object that represents the collected data and all we need to know about the convex hull, we call this the convex hull object and denote it with . The properties of the convex hull object are listed in Table 1. References to a property of the convex hull object are denoted with a dot, e.g. . In addition to these listed properties we will use and to denote the combined lists of the data points, so the th element of is for example a triple containing the th elements of , and .
The way the data is generated allows for efficient tailored implementations of the two routines, if the neighbors of each facet are known.
- i)
The facet that contains can be found by starting from some initial facet and tracking neighboring facets towards , for example by following as increases from [math] to , where is some point in the facet. A good initial guess can be given by the facet in which the last point was located, if the system state does not change too fast from one time instant to the next. If this point was added to the convex hull, then this facet does not exist anymore. In this case any facet attached to this newly added point can be taken as initial guess. Another more advanced initial guess can be provided by saving for every point in the convex hull its subsequent point. If the subsequent point was not added, then an extreme point of the facet where the subsequent point was located is stored. This information provides for each point in the convex hull a good guess of its subsequent point and this can be used as basis for the initial guess. Thereby, the system dynamic gets also included implicitly in the prediction from the initial guess. 2. ii)
A data point that needs to be added to the convex hull is already located in the triangulation of by the previous warm start generation. Hence, by starting from this facet and checking the neighboring facets successively, all facets that need to be updated can be found.
The information about neighboring facets are stored in the graph , which needs to be updated whenever the convex hull is.
B.1 Generating the spatial warm start solution from the convex hull object
Let be the point where we want to generate the warm start solution and the convex hull object. The main part in the warm start generation is the search for the facet that contains . As discussed above, it is easy to provide a good guess for a point that is close to . Assume the index of in is given, then the routine takes the inputs , and . As outputs it makes sense to not only return the spatial warm start but also the index of the facet of that contains and the index of the next initial guess. If is not contained in any facet, i.e. if , then let be negative, in fact let it be where is the index of a facet of that excludes by means of lies outside the supporting halfspace to at this facet . Summing, the routine can be invoked by . The basic idea in finding the facet that contains is to start the search at and track the facets along the line , until , then we have reached the facet that contains . If we sometime step out of , then and we cannot provide a spatial warm start as described in (15) at , but we can at the point where we stepped out of and use this as spatial warm start solution for . The procedure can be described by the following steps, where the numbers indicate to which lines of pseudo-code in Algorithm 2 the step corresponds to
- a)
(1:6:) Get the initial point and a list of indices of the facets in that are attached to it. 2. b)
(7:16:) Find the facet in which the vector points starting from . 3. c)
(17:27:) If no such facet exists, then must point out of , must lie on the boundary and hence . So there must exist at least one facet of attached to that excludes . Find the index of this facet, return its negative as and return the input used at as spatial warm start. Also return the index of the extreme point of this facet that is closest to increased by one as next initial guess . 4. d)
(28:34:) Else track the facets along until or no further facet exists along the line because we stepped out of . 5. e)
(35:41:) If we stepped out before , then find the index of facet of where we stepped out and return its negative as , further return the index of the extreme point of this that is closest to increased by one as next initial guess and the convex combination of the inputs of this facet as spatial warm start . 6. f)
(42:46:) Else we have found index of the facet of that contains and return it. Also return the convex combination of the inputs at this facet as spatial warm start and the index of the extreme point of this that is closest to increased by one as next initial guess .
A detailed description of the steps in pseudo-code can be found in Algorithm 2, still there are needed some minor supporting routines that we will discuss now to complete the whole procedure of generating the warm start.
- •
checks if the vector starting from points inside the facet of with index , where has to be an extreme point of . The length of does not matter only the direction, i.e. need not lie inside for being true, it is enough if there exists such that lies inside the facet. Pseudo-code is given in Algorithm 3.
- •
is the most used subroutine and it calculates for a given set of points in a vector that is normal to the hyperplane going through all points in . This hyperplane cuts into two halfspaces and points into the halfspace that does not contain the given point , where must not lie on the this plane. A pseudo-code description of this subroutine is given in Algorithm 4 and uses the well-known QR-decomposition that decomposes an invertible matrix into an orthogonal matrix and an upper triangular matrix where all diagonal entries of are positive.
- •
gives out all facets of that share the extreme points specified in . can also be only a single facet or even empty, depending on how many facets exists that share the edge . Pseudo-code is given in Algorithm 5.
- •
takes an edge of such that lies on and it gives out such that lies on the edge of but with output input . In words it takes the edge and point where the line enters and computes the edge and point where it leaves . Pseudo-code is given in Algorithm 6.
- •
calculates the weights such that and . Therefore it must be verified that it is possible to convex combine the points in to . Pseudo-code is given in Algorithm 7.
B.2 Updating the convex hull
Second we consider updating the convex hull, therefore let be the convex hull object and the input sequence and point in state space we want to add. In addition we already know the index of the facet of that contains from the previous spatial warm start generation at exactly this point . This makes up the inputs of the routine and the output is of course the updated convex hull object , hence it can be invoked by . The basic idea of updating is that we start at the facet , which must be for sure updated since there is always an improvement in the optimization update operator as long as we have not reached exact optimality and even in that unlikely case the point we want to add lies directly on the facet and it makes no difference to add it and split the facet. Thus we always update the facet and the we start searching from for neighboring facets that need to be updated and if we found one, then we also search in its neighbors for facets to update. Since the convex hull was convex before and has to be convex afterwards the set of facets to update must be connected and thus we will find all facets to update by that procedure. A special case is if , i.e. , then we need to find all facets of that exclude and check that facets of attached to them if they must be updated. In this case we must also update which we do by the same procedure, starting from search for neighboring facets that must to be updated. In fact all facets of that must be updated are facets that exclude and the other way round, so we search them, update them and initialize with the facets of attached to them the search for updates in . In more detail the procedure can be described by the following steps, where the numbers indicate to which lines of pseudo-code in Algorithm 8 the step corresponds to
- a)
(1:5:) Append data point to data set . 2. b)
(6:24:) If then and we need to update first. We know that is one facet that needs to be updated so we remove it and check all neighboring facets if they must also be updated. If so, then we remove them and also add their edges to the list of edges whose neighboring facets need to be checked. If not then we take the edge between the facet that had been removed and the one that stays and add it together with the new point as new facet to . Whenever we remove a facet we also add it to the list of edges we need to check for updating . 3. c)
(25:28) If then we remove this facet from and initialize the list of edges we need to check with all edges of this facet. 4. d)
(29:42) Check edges as long as there are some whether their adjacent facets need to be updated or not, if so remove them and add their edges to the list, if not add the edge with the new point as new facet to . 5. e)
(43:45:) Update the center points of the convex hulls and return the updated convex hull object.
Whenever we talked about adding a facet to or removing it from we of course also have to update the graph . A detailed explanation on the subroutines adding and removing shall be given now.
- •
removes the facet from the and also removes its appearance in . When removing we still need to keep the slot in because if we would delete it, all facets coming after in the list would get their index decremented and we would need to change the whole . Hence it is easier to add to a list of open slots that can be overwritten, thereby keeping all other indices as they are. If it happens for that thereby an extreme point of is afterwards not attached to any facet at all, i.e. with adding the new data point ’moves’ from the boundary of the convex hull to its interior, then we store for the information that it has fallen out of the boundary when updating by setting . This is necessary for the initial guess for the warm start generation where it could happen that for some point the initial guess is , but if is not attached to any facet then we cannot start a search from there. Therefore we store that has been overwritten by and then we can start the search from a facet attached to instead.
- •
adds the facet with extreme points listed in to and updates the graph . If there is an open slot in then we can overwrite it, otherwise we append the new facet to the list .
Another point we have not discussed here is the initialization of the convex hull. The algorithms presented here only work if there already exists a convex hull object where there are enough points to build the convex hulls and they have to be correct obviously. A straight forward approach is to wait until the data points are available and initialize the convex hull with the first and only facet as well as the convex hull with all possible combinations of points out of these points.
Appendix C Computation of the Lipschitz constant from Section 4.2
The Lipschitz constant of the cost function can be computed as
[TABLE]
where denotes the Kronecker-delta that is if and [math] else, where , and are the Lipschitz constants of the stage cost , the terminal cost and the system dynamic , respectively and where is a projection matrix, that projects onto its st till th component. The Lipschitz constants can be upper estimated as
[TABLE]
by the following calculations
[TABLE]
and
[TABLE]
which has the eigenvalues and . Thus the maximum absolute eigenvalue is
[TABLE]
The reference list from the paper itself. Each links out to its DOI / PubMed record.
- 1[1] B. M. Åkesson and H. T. Toivonen. A neural network model predictive controller. Journal of Process Control , 16(9):937–946, 2006.
- 2[2] M. Alamir. From certification of algorithms to certified MPC: The missing links. In 5th IFAC Conference on Nonlinear Model Predictive Control , volume 48, pages 65–72, 2015.
- 3[3] A. Aswani, H. Gonzalez, S. S. Sastry, and C. Tomlin. Provably safe and robust learning-based model predictive control. Automatica , 49(5):1216–1226, 2013.
- 4[4] G. Beliakov. Interpolation of Lipschitz functions. Journal of Computational and Applied Mathematics , 196(1):20 – 44, 2006.
- 5[5] A. Bemporad, D. Bernardini, and P. Patrinos. A convex feasibility approach to anytime model predictive control. ar Xiv , 1502.07974, 2015.
- 6[6] A. Bemporad and E. Mosca. Fulfilling hard constraints in uncertain linear systems by reference managing. Automatica , 34(4):451–461, 1998.
- 7[7] J. Borwein and A. Lewis. Convex analysis and nonlinear optimization: Theory and examples . Springer, New York, 2006.
- 8[8] P. Bouffard, A. Aswani, and C. Tomlin. Learning-based model predictive control on a quadrotor: Onboard implementation and experimental results. IEEE Int. Conf. on Robotics and Automation , pages 279–284, 2012.
