Sample-Based Learning Model Predictive Control for Linear Uncertain   Systems

Ugo Rosolia; Francesco Borrelli

arXiv:1904.06432·cs.SY·January 22, 2021

Sample-Based Learning Model Predictive Control for Linear Uncertain Systems

Ugo Rosolia, Francesco Borrelli

PDF

TL;DR

This paper introduces a sample-based Learning Model Predictive Control approach for uncertain linear systems, enabling safe exploration and performance improvement despite disturbances and constraints.

Contribution

It extends LMPC to uncertain systems by using noisy data to approximate safe sets and value functions, ensuring safety and robustness.

Findings

01

Successfully demonstrates safe state space exploration

02

Iterative performance improvement under uncertainty

03

Robust constraint satisfaction

Abstract

We present a sample-based Learning Model Predictive Controller (LMPC) for constrained uncertain linear systems subject to bounded additive disturbances. The proposed controller builds on earlier work on LMPC for deterministic systems. First, we introduce the design of the safe set and value function used to guarantee safety and performance improvement. Afterwards, we show how these quantities can be approximated using noisy historical data. The effectiveness of the proposed approach is demonstrated on a numerical example. We show that the proposed LMPC is able to safely explore the state space and to iteratively improve the worst-case closed-loop performance, while robustly satisfying state and input constraints.

Figures7

Click any figure to enlarge with its caption.

Tables1

Table 1. TABLE I: Initial condition x 0 j superscript subscript 𝑥 0 𝑗 x_{0}^{j} at each j th 𝑗 th j\text{th} iteration.

$x_{0}^{1} = - {[\begin{matrix} 2.00 & 0 \end{matrix}]}^{⊤}$	$x_{0}^{6} = - {[\begin{matrix} 9.90 & 0 \end{matrix}]}^{⊤}$
$x_{0}^{2} = - {[\begin{matrix} 5.46 & 0 \end{matrix}]}^{⊤}$	$x_{0}^{7} = - {[\begin{matrix} 9.90 & 0 \end{matrix}]}^{⊤}$
$x_{0}^{3} = - {[\begin{matrix} 6.86 & 0 \end{matrix}]}^{⊤}$	$x_{0}^{8} = - {[\begin{matrix} 9.90 & 0 \end{matrix}]}^{⊤}$
$x_{0}^{4} = - {[\begin{matrix} 9.35 & 0 \end{matrix}]}^{⊤}$	$x_{0}^{9} = - {[\begin{matrix} 9.90 & 0 \end{matrix}]}^{⊤}$
$x_{0}^{5} = - {[\begin{matrix} 9.90 & 0 \end{matrix}]}^{⊤}$	$x_{0}^{10} = - {[\begin{matrix} 9.90 & 0 \end{matrix}]}^{⊤}$

Equations109

x_{k + 1}^{j} = A x_{k}^{j} + B u_{k}^{j} + w_{k}^{j}

x_{k + 1}^{j} = A x_{k}^{j} + B u_{k}^{j} + w_{k}^{j}

x_{k} \in X and π^{j} (x_{k}^{j}) \in U .

x_{k} \in X and π^{j} (x_{k}^{j}) \in U .

J_{π^{j}}^{j} (x_{0}^{j}) = w \in W max [h (x_{0}^{j}, π^{j} (x_{0}^{j})) + J_{π^{j}}^{j} (A x_{0}^{j} + B π^{j} (x_{0}^{j}) + w)] .

J_{π^{j}}^{j} (x_{0}^{j}) = w \in W max [h (x_{0}^{j}, π^{j} (x_{0}^{j})) + J_{π^{j}}^{j} (A x_{0}^{j} + B π^{j} (x_{0}^{j}) + w)] .

J_{0 \to \infty}^{j, *} (x_{S}^{j}) = π^{j} (\cdot) min

J_{0 \to \infty}^{j, *} (x_{S}^{j}) = π^{j} (\cdot) min

x_{k + 1}^{j} = A x_{k}^{j} + B π^{j} (x_{k}^{j}) + w_{k}^{j}

u_{k}^{j} = π^{j} (x_{k}^{j})

x_{k}^{j} \in X, u_{k}^{j} \in U

x_{0}^{j} = x_{S}^{j}

\forall w_{k}^{j} \in W, k \in {0, 1, \dots} .

π^{j} (\cdot) : F^{j} \subseteq X \to U

π^{j} (\cdot) : F^{j} \subseteq X \to U

∣ x ∣_{O} = Δ d \in O in f ∣∣ x - d ∣ ∣_{1} .

∣ x ∣_{O} = Δ d \in O in f ∣∣ x - d ∣ ∣_{1} .

\forall x_{k} \in O \to (A + B K) x_{k} + w_{k} \in O, \forall w_{k} \in W

\forall x_{k} \in O \to (A + B K) x_{k} + w_{k} \in O, \forall w_{k} \in W

α_{x}^{l} (∣ x ∣_{O}) \leq h (x, 0)

α_{x}^{l} (∣ x ∣_{O}) \leq h (x, 0)

and α_{u}^{l} (∣ u ∣_{K O}) \leq h (0, u) \leq α_{x}^{u} (∣ u ∣_{K O})

R_{k + 1}

R_{k + 1}

\mathcal{SS}^{j}=\bigg{\{}\bigcup_{k=0}^{\infty}\mathcal{R}_{k}(x_{0}^{j})\bigg{\}}\bigcup\mathcal{O}.

\mathcal{SS}^{j}=\bigg{\{}\bigcup_{k=0}^{\infty}\mathcal{R}_{k}(x_{0}^{j})\bigg{\}}\bigcup\mathcal{O}.

\mathcal{CS}^{j}=\text{conv}\Bigg{(}\bigcup_{k=0}^{j}\mathcal{SS}^{k}\Bigg{)}.

\mathcal{CS}^{j}=\text{conv}\Bigg{(}\bigcup_{k=0}^{j}\mathcal{SS}^{k}\Bigg{)}.

\forall x \in CS^{j} \to A x + B π^{j} (x) + w \in CS^{j}, \forall w \in W

\forall x \in CS^{j} \to A x + B π^{j} (x) + w \in CS^{j}, \forall w \in W

J_{π^{j}}^{j} (x_{0}^{j}) = w \in W max [h (x_{0}^{j}, π^{j} (x_{0}^{j})) + J_{π^{j}}^{j} (A x_{0}^{j} + B π^{j} (x_{0}^{j}) + w)],

J_{π^{j}}^{j} (x_{0}^{j}) = w \in W max [h (x_{0}^{j}, π^{j} (x_{0}^{j})) + J_{π^{j}}^{j} (A x_{0}^{j} + B π^{j} (x_{0}^{j}) + w)],

L_{π^{j}}^{j} (x) = {w \in W max [h (x, π^{j} (x)) + L_{π^{j}}^{j} (x_{+}^{j} (w))] + \infty \mbox I f x \in SS^{j} \mbox I f x \in / SS^{j}

L_{π^{j}}^{j} (x) = {w \in W max [h (x, π^{j} (x)) + L_{π^{j}}^{j} (x_{+}^{j} (w))] + \infty \mbox I f x \in SS^{j} \mbox I f x \in / SS^{j}

Q^{j}(x)=\min_{\mu}\{\mu~{}|~{}(x,\mu)\in\text{conv}\big{(}\textstyle\bigcup_{k=0}^{j}\text{epi}(L_{\pi^{j}}(x)^{j})\big{)}\},

Q^{j}(x)=\min_{\mu}\{\mu~{}|~{}(x,\mu)\in\text{conv}\big{(}\textstyle\bigcup_{k=0}^{j}\text{epi}(L_{\pi^{j}}(x)^{j})\big{)}\},

\min_{u\in\mathcal{U}}\max_{w\in\mathcal{W}}\big{[}Q^{j}(Ax+Bu+w)+h(x,u)-Q^{j}(x)\big{]}\leq 0

\min_{u\in\mathcal{U}}\max_{w\in\mathcal{W}}\big{[}Q^{j}(Ax+Bu+w)+h(x,u)-Q^{j}(x)\big{]}\leq 0

Q^{j} (x) = i = 0 \sum j k = 0 \sum K λ_{k}^{i} L_{π^{i}}^{i} (x_{k}^{i}) .

Q^{j} (x) = i = 0 \sum j k = 0 \sum K λ_{k}^{i} L_{π^{i}}^{i} (x_{k}^{i}) .

Q^{j} (x)

Q^{j} (x)

\geq w \in W max [h (x, u) + i = 0 \sum j k = 0 \sum K λ_{k}^{i} L_{π^{i}}^{i} (x_{k, +}^{i} (w))],

Q^{j} (x)

Q^{j} (x)

\displaystyle\geq\max\limits_{w\in\mathcal{W}}[h(x,u)+Q^{j}\big{(}\sum_{i=0}^{j}\sum_{k=0}^{K}\lambda_{k}^{i}x^{i}_{k,+}(w)\big{)}]

\geq u \in U min w \in W max [h (x, u) + Q^{j} (A x + B u + w)] .

J_{t \to t + N}^{\scalebox 0.5 L M P C, j} (x_{t}^{j}) = π_{t}^{j} (\cdot) min \overset{ˉ}{w}_{t}^{j} max

J_{t \to t + N}^{\scalebox 0.5 L M P C, j} (x_{t}^{j}) = π_{t}^{j} (\cdot) min \overset{ˉ}{w}_{t}^{j} max

+ Q^{j - 1} (x_{t + N ∣ t}^{j})]

x_{k + 1∣ t}^{j} = A x_{k ∣ t}^{j} + B u_{k ∣ t}^{j} + \overset{w}{ˉ}_{k ∣ t}^{j}

u_{k ∣ t}^{j} = π_{k ∣ t}^{j} (x_{k ∣ t}^{j})

x_{k ∣ t}^{j} \in X, u_{k ∣ t}^{j} \in U

x_{t + N ∣ t}^{j} \in CS^{j - 1}

x_{t ∣ t}^{j} = x_{t}^{j}

\forall \overset{w}{ˉ}_{k ∣ t}^{j} \in W, k \in {t, \dots, t + N}

π_{t}^{j, *} (\cdot) = [π_{t ∣ t}^{j, *} (\cdot), \dots, π_{t + N ∣ t}^{j, *} (\cdot)]

π_{t}^{j, *} (\cdot) = [π_{t ∣ t}^{j, *} (\cdot), \dots, π_{t + N ∣ t}^{j, *} (\cdot)]

π^{j} (x_{t}^{j}) = π_{t ∣ t}^{j, *} (x_{t}^{j}) .

π^{j} (x_{t}^{j}) = π_{t ∣ t}^{j, *} (x_{t}^{j}) .

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Full text

Sample-Based Learning Model Predictive Control

for Linear Uncertain Systems

Ugo Rosolia and Francesco Borrelli U. Rosolia and F. Borrelli are with the Department of Mechanical Engineering, University of California at Berkeley , Berkeley, CA 94701, USA {ugo.rosolia, fborrelli}@berkeley.edu

Abstract

We present a sample-based Learning Model Predictive Controller (LMPC) for constrained uncertain linear systems subject to bounded additive disturbances. The proposed controller builds on earlier work on LMPC for deterministic systems. First, we introduce the design of the safe set and value function used to guarantee safety and performance improvement. Afterwards, we show how these quantities can be approximated using noisy historical data. The effectiveness of the proposed approach is demonstrated through a numerical example. We show that the LMPC is able to safely explore the state space and to iteratively improve the worst-case closed-loop performance, while robustly satisfying state and input constraints.

I Introduction

Exploiting historical data in order to iteratively improve the performance of Model Predictive Controllers (MPC) has been an active theme of research in the past few decades [1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11]. The key idea is to use stored state-input pairs in order to compute at least one of the following three components used in the control design: $\emph{i})$ a model which describes the evolution of the system, $\emph{ii})$ a safe set of states (and an associated control policy $\pi(\cdot)$ ) from which the control task can be safely completed and $\emph{iii})$ a value function which represents the cumulative closed-loop cost from a given point of the safe set when the policy $\pi(\cdot)$ is used. In this work, we present a strategy to build safe sets and the associated value functions by exploiting historical noisy closed-loop trajectories.

Policy evaluation strategies used to estimate value functions from historical data are studied in Approximate Dynamic Programming (ADP) and Reinforcement Learning (RL) [12, 13, 14]. For instance, direct strategies compute the estimate value function which best fits the closed-loop cost data over the stored states. On the other hand, in indirect strategies the estimate value function is computed by iteratively minimizing the temporal difference [15, 16]. A survey on policy evaluation strategies goes beyond the scope of this work, we refer the reader to [13, 12] for a comprehensive review on this topic.

The integration of MPC with system identification strategies has been extensively studied in the literature [1, 5, 3, 4, 2, 7, 6]. In [5] the authors identified the system’s model using a deep neural network, which incorporates uncertainty using an ensemble of models. Another system identification strategy consists of fitting a Gaussian Process (GP) to experimental data [3, 4, 2]. GP provides a nominal model and confidence bounds, which may be used to tighten the constraint set over the planning horizon. This strategy allows to provide high-probability safety guarantees [3, 4]. The effectiveness of GP-based strategies on experimental platform has been shown in [4], where a MPC is used to race a 1/43-scale vehicle. Regression strategies may also be used to identify the system’s model [7, 6]. For instance, the authors in [6] used linear regression to identify both the nominal model and the model uncertainty used for robust MPC design. In [7], we used local linear regression to identify the model used by the controller, which was able to drive a 1/10-scale race car at the limit of handling.

Data-based strategies to construct safe sets have been investigated in [17, 18, 19, 20, 21, 22]. The authors in [17] proposed a linear model predictive safety certification framework, where safe sets are computed exploiting closed-loop data generated by a robust controller. In [18, 19] the authors computed safe sets combining stored trajectories with polyhedron and ellipsoidal invariant sets. Another approach is proposed in [20] where the stored trajectories are mirrored to construct invariant sets. In [21, 22] we showed that data from a deterministic system can be trivially used to compute safe sets. However, these strategies cannot be used to compute safe sets for uncertain system.

In this work we present a sample-based Learning Model Predictive Controller (LMPC) for linear systems subject to bounded additive uncertainty. We refer to a control task execution as “iteration” and we iteratively update the LMPC policy. At iteration $j-1$ , we show how to construct a robust safe set and value function, which are used to synthesize the LMPC policy at next $j$ th iteration. We show that the proposed strategy guarantees that: i) state and input constraints are robustly satisfied, ii) the closed-loop system converges asymptotically to a neighborhood of the origin, iii) the worst-case performance of the $j$ th LMPC policy is non-increasing with the iteration index, and iv) the domain of the LMPC policy is not shrinking at each $j$ th iteration. The proposed control strategy is computationally intensive. Therefore, we propose a practical algorithm that exploits simulations of the closed-loop system, which are associated with unknown sampled disturbance realizations. These closed-loop simulations, referred to as “roll-outs”, are used to approximate the safe set and the value function used in the LMPC design.

II Problem Definition

We consider the following linear time invariant system

[TABLE]

where at time $k$ of the $j$ th iteration the disturbance $w_{k}^{j}\in\mathcal{W}$ , the state $x_{k}\in\mathbb{R}^{n}$ and input $u_{k}^{j}\in\mathbb{R}^{d}$ . Furthermore, the system is subject to the following convex polytopic state and input constraints, for all $k\geq 0$

[TABLE]

At each $j$ th iteration, we define the worst-case iteration cost associated with the control policy $\pi^{j}(\cdot)$ as the solution to the Bellman equation

[TABLE]

The goal of the control design is to solve the following infinite time robust optimal control problem,

[TABLE]

We present a strategy to iteratively design a feedback policy

[TABLE]

which is a feasible solution to Problem (3) for $x_{0}^{j}\in\mathcal{F}^{j}$ . In particular the proposed strategy guarantees: i) convergence of the closed-loop system (1) and (4) to a neighborhood of the origin $\mathcal{O}$ , ii) safety, state and input constraints are robustly satisfied, iii) performance improvement, if the controller performs the same task repeatedly (i.e. $x_{0}^{j}=x_{0}^{j+1}$ ), then the worst-case iteration cost (2) is non-increasing (i.e. $J_{\pi^{j+1}}^{j+1}(x_{0}^{j+1})\leq J_{\pi^{j}}^{j}(x_{0}^{j})$ ), and iv) exploration, the domain of the policy (4) is not shrinking with the iteration index (i.e. $\mathcal{F}^{i}\subseteq\mathcal{F}^{j},\forall j\geq i$ ).

Throughout this paper we use the standard function classes $\mathcal{K}$ , $\mathcal{K}_{\infty}$ and $\mathcal{KL}$ notation (see [23]) and we define the distance from a point $x\in\mathbb{R}^{n}$ to a set $\mathcal{O}\subseteq\mathbb{R}^{n}$ as

[TABLE]

Furthermore, we make the following assumptions.

Assumption 1

The set $\mathcal{O}\subset\mathbb{R}^{n}$ is a robust positive invariant set for the autonomous system $x_{k+1}=(A+BK)x_{k}+w_{k}$ ,

[TABLE]

and $\forall x_{k}\in\mathcal{O}$ we have that $Kx_{k}\in\mathcal{U}$ .

Assumption 2

The continuous stage cost $h(\cdot,\cdot)$ is jointly convex in its arguments. Furthermore, we assume that $\forall x\in\mathbb{R}^{n},\forall u\in\mathbb{R}^{d}$

[TABLE]

where $\alpha_{x}^{u},\alpha_{x}^{l},\alpha_{u}^{u}$ and $\alpha_{u}^{l}\in\mathcal{K}_{\infty}$ .

Notice that the above assumptions imply that the optimal policy from (3) robustly steers system (1) to the goal set $\mathcal{O}$ .

III Learning Model Predictive Control

In this section we illustrate the control design strategy. We show how to construct a safe set of states, from which the control policy $\pi^{j}(\cdot)$ can successfully complete the control task. Afterward, we define a value function which approximates the cost-to-go associated with the control policy $\pi^{j}(\cdot)$ . Finally, we exploit the safe set and the value function to synthesize the control policy $\pi^{j+1}(\cdot)$ at the next iteration $j+1$ .

III-A Safe Set

In this section we show how to iteratively construct a set of states from which the control task can be safely executed. First, we recall the definition of robust reachable set [24] for the closed-loop system (1) and (4),

[TABLE]

with $\mathcal{R}_{0}(x_{0}^{j})=x_{0}^{j}$ . The above robust reachable set $\mathcal{R}_{N}(x_{0}^{j})$ collects that states which may be reached in $N$ -steps by the closed-loop system (1) and (4).

Now, we define the safe set at the $j$ th iteration as

[TABLE]

The above safe set $\mathcal{SS}^{j}$ contains the state evolution of the closed-loop system (1) and (4) at the $j$ th iteration.

Remark 1

In practical applications each iteration has a finite-time duration. It is common in the literature to adopt an infinite time formulation at each iteration for the sake of simplicity. We follow such an approach in this paper. Our choice does not affect the practicality of the proposed method. In Section IV-A, we show that if the $j$ th iteration is completed in finite time (i.e. $x_{T^{j}}^{j}\in\mathcal{O},T^{j}<\infty$ ), then the safe set $\mathcal{SS}^{j}$ can be approximated using historical data.

Finally, we define the convex safe set $\mathcal{CS}^{j}$ as the convex hull of the safe sets $\mathcal{SS}^{k}$ for iterations $k\in\{0,\ldots,j\}$ ,

[TABLE]

Notice that, if the control policies $\pi^{k}(\cdot)$ for $k\in\{0,\ldots,j\}$ safely steer the system to the neighborhood of the origin $\mathcal{O}$ . Then, $\mathcal{CS}^{j}$ is a robust control invariant set as stated by the following proposition.

Proposition 1

For $j\geq 0$ , let $\pi^{j}(\cdot):\mathcal{F}^{j}\rightarrow\mathcal{U}$ be a control policy defined over $\mathcal{F}^{j}\subseteq\mathcal{X}$ . Consider system (1) in closed-loop with $\pi^{j}(\cdot)$ and assume that $\forall x_{0}^{j}\in\mathcal{F}^{j}$ we have $x_{k}^{j}\in\mathcal{X}$ and $\lim_{t\rightarrow\infty}x_{t}^{j}\in\mathcal{O},\forall w_{k}\in\mathcal{W},k\geq 0$ . Then, the convex safe set $\mathcal{CS}^{j}\subseteq\mathcal{X}$ is a robust control invariant set for system (1),

[TABLE]

Proof:

By assumption $\pi^{k}(\cdot)$ for $k\in\{0,\ldots,j\}$ in closed-loop with (1) robustly satisfies and input constraints. By definition (6), $\mathcal{SS}^{k}$ is a robust control invariant set for $k\in\{0,\ldots,j\}$ . Therefore, by linearity of system (1), $\mathcal{CS}^{j}\subseteq\mathcal{X}$ is a robust control invariant set. ∎

III-B Q-function

In this section we define the value function $Q^{j}(\cdot):\mathcal{CS}^{j}\rightarrow\mathbb{R}$ , which approximates the cost-to-go from any state $x\in\mathcal{CS}^{j}$ . Recall that the iteration cost (2) for the control policy $\pi^{j}(\cdot)$ is given by the solution to following Bellman equation

[TABLE]

and it represents the worst-case cost-to-go from any point in the state space. The solution to the above Bellman equation is hard to compute [12] and closed-form exists just for few problems [24]. For a survey on strategies to approximate the solution to Bellman equation we refer to [12, 13].

Now, we define the worst-case cost-to-go over the safe set as

[TABLE]

where $x^{j}_{+}(w)=Ax+B\pi^{j}(x)+w$ . Notice that, for all $x\in\mathcal{SS}^{j}$ , the above function coincides with the Bellman equation (8). The difference between $J_{\pi^{j}}^{j}(\cdot)$ and $L_{\pi^{j}}^{j}(\cdot)$ is that the domain of the latter is the safe set $\mathcal{SS}^{j}$ from (6). The solution equation (9) is still hard to compute, however it may be approximated using sampled closed-loop trajectories from $\mathcal{SS}^{j}$ , as shown in Section IV-B.

Finally, for all $x\in\mathcal{CS}^{j}$ we define the function

[TABLE]

which interpolates the worst-case cost-to-go functions $L_{\pi^{k}}^{k}(\cdot)$ for $k\in\{0,\ldots,j\}$ . Notice that the above $Q^{j}(\cdot)$ is simply a convexification of the cost-to-go functions (i.e. $\text{epi}(Q^{j}(x))=\text{conv}\big{(}\cup_{k=0}^{j}\text{epi}(L_{\pi^{k}}(x)^{k}))$ ). Furthermore, if the control policies $\pi^{k}(\cdot)$ for $k\in\{0,\ldots,j\}$ safely steer the system to the neighborhood of the origin $\mathcal{O}$ , then the approximated value function $Q^{j}(\cdot)$ is a robust control Lyapunov function over the convex safe set $\mathcal{CS}^{j}$ for system (1), as shown by the following proposition.

Proposition 2

For $j\geq 0$ , let $\pi^{j}(\cdot):\mathcal{F}^{j}\rightarrow\mathcal{U}$ be a control policy defined over $\mathcal{F}^{j}\subseteq\mathcal{X}$ . Consider system (1) in closed-loop with $\pi^{j}(\cdot)$ and assume that $\forall x_{0}^{j}\in\mathcal{F}^{j}$ we have $x_{k}^{j}\in\mathcal{X}$ and $\lim_{t\rightarrow\infty}x_{t}^{j}\in\mathcal{O}~{}\forall w_{k}\in\mathcal{W}$ . Then, $Q^{j}(\cdot)$ is a robust control Lyapunov function, i.e.

[TABLE]

for all $x\in\mathcal{CS}^{j}$ .

Proof:

From definition (10), we have that $\forall x\in\mathcal{CS}^{j}$ there exist a set of multipliers $\{\lambda_{0}^{0},\ldots,\lambda_{k}^{i},\ldots,\lambda_{K}^{j}\}$ and a set of states $\{x_{0}^{0},\ldots,x_{k}^{i},\ldots,x_{K}^{j}\}$ such that for all $k\in\{0,\ldots,K\}$ and for all $i\in\{0,\ldots,j\}$ we have $x_{k}^{i}\in\mathcal{SS}^{i}$ , $\lambda_{k}^{i}\geq 0$ , $\sum_{i=0}^{j}\sum_{k=0}^{K}\lambda_{k}^{i}=1$ , $\sum_{i=0}^{j}\sum_{k=0}^{K}\lambda_{k}^{i}x_{k}^{i}=x$ , and

[TABLE]

Substituting in the above equation the definition of the worst-case cost-to-go (9) evaluated at $x_{k}^{i}\in\mathcal{SS}^{i}$ and leveraging the convexity of $h(\cdot,\cdot)$ , we have that

[TABLE]

where $x=\sum_{i=0}^{j}\sum_{k=0}^{K}\lambda_{k}^{i}x_{k}^{i}$ , $u=\sum_{i=0}^{j}\sum_{k=0}^{K}\lambda_{k}^{i}\pi^{i}(x_{k}^{i})\in\mathcal{U}$ and $x^{i}_{k,+}(w)=Ax_{k}^{i}+B\pi^{i}(x_{k}^{i})+w$ . Definition (10) implies that $Q^{j}(x)\leq L^{i}_{\pi^{i}}(x),\forall x\in\mathcal{CS}^{j}$ and $\forall i\in\{0,\ldots,j\}$ , therefore from the above equation and convexity of $Q^{j}(\cdot)$ we conclude that

[TABLE]

∎

III-C Controller Design

In this section we illustrate the controller design which leverages the convex safe set (7) and the approximated value function (10). At each time $t$ of the $j$ th iteration, we solve the following finite time optimal control problem

[TABLE]

where the control policy ${\bm{\pi}_{t}^{j}}(\cdot)=[\pi_{t|t}^{j}(\cdot),\ldots,\pi_{t+N|t}^{j}(\cdot)]$ and the disturbance $\bar{\bf{w}}^{j}_{t}=[\bar{w}_{t|t}^{j},\ldots,\bar{w}_{t+N|t}^{j}]$ . The optimal feedback policy from the above finite time optimal control problem safely steers system (1) from $x_{t}^{j}$ to the convex safe set, while minimizing the worst-case cost. Let

[TABLE]

be the optimal feedback policy to Problem (12). Then we apply to system (1)

[TABLE]

The finite time optimal control problem (12) is solved at time $t+1$ , based on the new state $x_{t+1|t+1}^{j}=x^{j}_{t+1}$ , yielding a moving or receding horizon control strategy.

Furthermore, we define the domain of the LMPC policy (14), which is given by

[TABLE]

The set $\mathcal{F}^{j}$ , which collects the feasible initial conditions to Problem (12), is used to compute the initial state $x_{0}^{j}$ of the $j$ th iteration. In particular, the initial condition at the $j$ th iteration is computed solving the following convex optimization problem,

[TABLE]

where the user-defined row vector $a\in\mathbb{R}^{n}$ represents the direction in which the LMPC explores the state space, and $a^{\perp}\in\mathbb{R}^{n}$ is a row vector perpendicular to $a$ .

It is well-known that the solution to Problem (12) can be computed enumerating the vertices of the disturbance over the prediction horizon [25]. Therefore, the computational complexity of Problem (12) explodes with the horizon length $N$ . For this reason, it is important to construct a terminal set and terminal cost, which allow to guarantee safety and performance improvement independently on the prediction horizon length. In the result section, we show that the proposed controller is able to safely explore the state space and to improve its performance, even with a short prediction horizon.

III-D Properties

As discussed in Propositions 1-2, for every point in $\mathcal{CS}^{j}$ there exists a control policy which safely steers the system to the terminal goal set. The properties of $\mathcal{CS}^{j}$ and $Q^{j}(\cdot)$ allow us to guarantee that the proposed strategy meets the requirements from Section II. The following theorem shows that the LMPC (12) and (14) satisfies state and input constraints while steering the system to the neighborhood of the origin $\mathcal{O}$ .

Theorem 1

Consider system (1) in closed-loop with the LMPC (12) and (14). Let Assumptions 1-2 hold, initialize $\mathcal{CS}^{0}=\mathcal{O}$ and $Q^{0}(\cdot)=0$ . If $x_{0}^{j}\in\mathcal{F}^{j},\forall j\geq 1$ , then the LMPC (12) and (14) is feasible for all $t\geq 0$ and iteration $j\geq 1$ . Furthermore, the closed-loop system asymptotically converges to $\mathcal{O}$ , regardless of the disturbance realization.

Proof:

Assume that at the $j$ th iteration $Q^{j}(\cdot)$ is a robust control Lyapunov function defined on the robust control invariant set $\mathcal{CS}^{j}$ . Then, by standard MPC arguments and the assumption on $x_{0}^{j}\in\mathcal{F}^{j}$ , we have that at iteration $j+1$ the LMPC (12) and (14) recursively satisfies state and input constraints, and the closed-loop system (1) and (14) converges asymptotically to the terminal set $\mathcal{O}$ [24]. Consequently, the LMPC policy at iteration $j+1$ used to compute $Q^{j+1}(\cdot)$ and $\mathcal{CS}^{j+1}$ satisfies the assumptions in Propositions 1-2, and therefore $Q^{j+1}(\cdot)$ is a robust control Lyapunov function defined on the robust control invariant set $\mathcal{CS}^{j+1}$ .

The proof is completed by induction. We initialized $Q^{0}(\cdot)=0$ , which is a robust control Lyapunov function defined on the robust control invariant set $\mathcal{CS}^{0}=\mathcal{O}$ . Therefore it follows that $\forall j\geq 1$ the LMPC (12) and (14) recursively satisfies state and input constraints, and the closed-loop system (1) and (14) converges asymotically to the terminal set $\mathcal{O}$ . ∎

Next, we discuss the performance improvement properties. In particular, we show that if the initial condition of two subsequent iterations does not change (i.e. $x_{0}^{j}=x_{0}^{j+1}$ ), then the worst-case cost iteration cost is non-increasing.

Theorem 2

Consider system (1) in closed-loop with the LMPC (12) and (14). Let Assumptions 1-2 hold, initialize $\mathcal{CS}^{0}=\mathcal{O}$ and $Q^{0}(\cdot)=0$ . If the initial condition of two subsequent iterations are equal, $x_{0}^{j+1}=x_{0}^{j}\in\mathcal{F}^{j}$ . Then, the worst-case iteration cost (2) is non-increasing with the iteration index $J_{0\rightarrow T^{j+1}}^{j+1}(x_{0}^{j+1})\leq J_{0\rightarrow T^{j}}^{j}(x_{0}^{j}).$

Proof:

By Theorem 1, the LMPC (12) and (14) is feasible at time $t$ of the $j$ th iteration. Let (13) be the optimal policy time $t$ of the $j$ th iteration, by Proposition 2 we have

[TABLE]

The above equation and the convergence of the closed-loop system (1) and (14) from Theorem 1 imply that

[TABLE]

The above derivation holds for all disturbance realization, therefore we have that

[TABLE]

Finally we notice that the above inequality together with Equations (9)-(10) and the feasibility of the LMPC policy $\pi^{j}(\cdot)$ (14) at the next iteration $j+1$ imply that

[TABLE]

∎

Finally, we show that the domain of the LMPC (12) and (14) does not shrink at each iteration.

Theorem 3

Consider system (1) in closed-loop with the LMPC (12) and (14). Let Assumptions 1-2 hold, and initialize $\mathcal{CS}^{0}=\mathcal{O}$ and $Q^{0}(\cdot)=0$ . If $x_{0}^{j}\in\mathcal{F}^{j},\forall j\geq 1$ . Then, the domain of which the LMPC defined in (15) does not shrink at each iteration, i.e. $\mathcal{F}^{i}\subseteq\mathcal{F}^{j},\forall j\geq i$ .

Proof:

The proof follows from the definition of the convex safe set. Notice that by definition (7) we have that $\mathcal{CS}^{i}\subseteq\mathcal{CS}^{j},\forall j\geq i$ . Therefore, the terminal set in (15) is not shrinking at each iteration and $\mathcal{F}^{i}\subseteq\mathcal{F}^{j},\forall j\geq i$ . ∎

IV Practical Implementation

In this section we show how the closed-loop trajectories associated with unknown sampled disturbance sequences can be used to approximate the convex safe set $\mathcal{CS}^{j}$ and the value function $Q^{j}(\cdot)$ . At each $j$ th iteration we collect $R$ simulations of the closed-loop systems, also referred to as “roll-outs”. Afterwards, we exploit these $R$ roll-outs to approximate the robust reachable sets (5) and the worst-case cost-to-go (9).

IV-A Sample-Based Convex Safe Set

In this section we show how the data from the closed-loop system (1) and (4) can be used to approximate the convex safe set $\mathcal{CS}^{j}$ . We define the $i$ th disturbance realization sequence ${\bf{w}}^{j}_{i}=[w_{0,i}^{j},\ldots,w_{T^{j},i}^{j}]$ , where $w_{k,i}^{j}$ is the realized disturbance at time $k$ of the $j$ th iteration. Furthermore, we denote the stored closed-loop trajectory associated with the $i$ th disturbance realization ${\bf{w}}^{j}_{i}$ as

[TABLE]

where $T^{j}$ is the time at which the terminal goal set $\mathcal{O}$ is reached. The above notation emphasizes that the realized state $x_{k}^{j}({\bf{w}}^{j}_{i})$ is a function of the realized disturbance sequence ${\bf{w}}^{j}_{i}$ . Now, we notice that at each time $k$ of the $j$ th iteration the state $x_{k}^{j}({\bf{w}}^{j}_{i})$ is contained into the $k$ -steps robust reachable set from $x_{0}^{j}$ (i.e. $x_{k}^{j}({\bf{w}}^{j}_{i})\in\mathcal{R}_{k}(x_{0}^{j}))$ . Therefore, we approximate the $k$ -steps robust reachable set $\mathcal{R}_{k}(x_{0}^{j})$ using $R$ roll-outs. In particular, for $i\in\{1,\ldots,R\}$ sampled disturbance sequences ${\bf{w}}^{j}_{i}$ we define the approximated $k$ -steps robust reachable set

[TABLE]

Finally, we define the approximated safe set

[TABLE]

which is used to construct the approximated convex safe set,

[TABLE]

It is important to underline that the above approximated convex safe set $\tilde{\mathcal{CS}}^{j}$ is not invariant, as the approximated reachable sets are an inner approximation of the exact reachable sets (Figure 1). Indeed, it may exist a disturbance realization which can steer the closed-loop system (1) and (14) outside $\tilde{\mathcal{CS}}^{j}$ . In particular, given $x\in\tilde{\mathcal{CS}}^{j}$ there is a probability $\epsilon>0$ that the closed-loop system evolves outside $\tilde{\mathcal{CS}}^{j}$ ,

[TABLE]

In the result section, we show that the above probability is a function of the number of roll-outs used to construct $\tilde{\mathcal{CS}}^{j}$ . In particular as more roll-outs are collected, $\tilde{\mathcal{CS}}^{j}$ from (19) better approximates the convex safe set ${\mathcal{CS}}^{j}$ from (7).

IV-B Sample-Based Q-function

In this section we show how the closed-loop trajectories may be used to approximate the cost-to-go function $L^{j}_{\pi^{j}}(\cdot)$ in (9). First, we define the realized cost-to-go associated with the stored state $x_{k}^{j}({\bf{w}}^{i})\in\tilde{\mathcal{R}}_{k}(x_{0}^{j})\subseteq\tilde{\mathcal{SS}}^{j}$ ,

[TABLE]

where $u_{k}^{j}({\bf{w}}^{i})=\pi^{j}(x_{k}^{j}({\bf{w}}^{i}))$ .

The realized cost (21), associated with the realized trajectory (17), is used to approximate the worst-case cost-to-go function $L^{j}_{\pi^{j}}(\cdot)$ . We compute an hyperplane which upper-bounds the realized cost $\tilde{J}^{j}_{k\rightarrow T^{j}}(x_{k}^{j}({\bf{w}}^{i}))$ for all stored states $\big{\{}\bigcup_{i=1}^{R}x_{k}^{j}({\bf{w}}^{i})\big{\}}\in\tilde{\mathcal{R}}_{k}(x_{0}^{j})$ . In particular, for time $k$ of the $j$ th iteration we define the hyperplane $a^{j}_{k}x+b^{j}_{k}$ , where

[TABLE]

At the $j$ th iteration, the hyperplanes $a^{j}_{k}x+b^{j}_{k}$ are used to approximate the worst-case cost-to-go $L^{j}_{\pi^{j}}(\cdot)$ from (9) as follows,

[TABLE]

The resulting approximated value function is defined as

[TABLE]

Finally, we underline that the above approximated value function is not a control Lyapunov function for system (1). Indeed, there is a probability $\gamma>0$ that Equation (11) does not hold and $\tilde{Q}^{j}(\cdot)$ is not decreasing along the closed-loop trajectory,

[TABLE]

In the result section, we show that above probability is inversely proportional to the number $R$ of roll-outs used to construct $\tilde{L}_{\pi^{j}}^{j}(\cdot)$ in (23).

V Results

We test the proposed control strategy on the following double integrator system

[TABLE]

where the the random disturbance $w_{k}$ is uniformly distributed on the set $\mathcal{W}=\{w\in\mathbb{R}^{2}:||w_{k}||_{\infty}\leq 0.1\}$ . The system is subjected to the following state and input constraints, $x_{k}\in\mathcal{X}=\{x\in\mathbb{R}^{2}:||x||_{\infty}\leq 10\}$ and $u_{k}\in\mathcal{U}=\{u\in\mathbb{R}^{2}:||u||_{\infty}\leq 1\}$ , for all $k\geq 0$ . Furthermore, we compute the minimal robust positive invariant set $\mathcal{O}$ for the autonomous system $x_{k+1}=(A+BK)x_{k}+w_{k}$ where $-K$ is the LQR gain for $Q=1$ and $R=1$ . Finally, we define the stage cost $h(x,u)=|x|_{\mathcal{O}}+|u|_{KO}$ which satisfies Assumption 2.

The convex safe set $\mathcal{CS}^{j}$ and value function $Q^{j}(\cdot)$ , used in the LMPC (12) and (14), are approximated as described in Section IV. In particular at each iteration $j$ , we use $R$ roll-outs to compute the approximated safe set $\tilde{\mathcal{CS}}^{j}$ and value function $\tilde{Q}^{j}(\cdot)$ . In order to initialize the LMPC we set $N=3$ , $\tilde{\mathcal{CS}}^{0}=\mathcal{O}$ and $\tilde{Q}^{0}(\cdot)=0$ . Finally at each $j$ th iteration, the initial state $x_{0}^{j}$ is computed as the furthest point along the negative $x$ -axis which belongs to $\mathcal{F}^{j}$ . Basically, we set $a=[-1,~{}0]$ in (16).

V-A Convex Safe Set and Value Function Approximation

In this section, we construct $\tilde{\mathcal{CS}}^{1}$ and $\tilde{Q}^{1}(\cdot)$ using $R=100$ and $R=1000$ roll-outs. Furthermore, we perform $1000$ Monte-Carlo simulations for the closed-loop system (1) and (14), in order to estimate the properties of $\tilde{\mathcal{CS}}^{1}$ and $\tilde{Q}^{1}(\cdot)$ .

Figure 2 shows the terminal set $\mathcal{O}$ and the approximated robust reachable sets $\tilde{\mathcal{R}}_{k}(x_{0}^{1})$ , which are used to construct the approximated convex safe set $\tilde{\mathcal{CS}}^{j}$ with $R=100$ and $R=1000$ roll-outs. As expected, the approximated convex safe set $\tilde{\mathcal{CS}}^{j}$ constructed using $1000$ trajectories contains the one constructed using $100$ trajectories. As mentioned in Section IV-A (Eq. (20)), the approximated convex safe set is not invariant. Indeed, there is a probability $\epsilon>0$ that, given a state $x\in\tilde{\mathcal{CS}}^{1}$ , the closed-loop system evolves outside $\tilde{\mathcal{CS}}^{1}$ . In order to estimate the probability $\epsilon$ , we perform $1000$ Monte-Carlo simulations for the closed-loop system (1) and (14) and we compute the percentage of realized states which evolved outside $\tilde{\mathcal{CS}}^{j}$ . As expected the probability $\epsilon$ decreases as more roll-outs are used to construct $\tilde{\mathcal{CS}}^{1}$ . In particular, we have that $\epsilon\sim 3.6\%$ and $\epsilon\sim 0.3\%$ for $R=100$ and $R=1000$ , respectively.

Finally, we analyze how the number of roll-outs affects the approximated value function $\tilde{Q}^{1}(\cdot)$ . Figure 3 shows the approximated value function $\tilde{Q}^{1}(\cdot)$ constructed with $R=100$ and $R=1000$ roll-outs. First, we notice that the domain of approximated value function $\tilde{Q}^{1}(\cdot)$ is enlarged as more realized trajectories are used to compute the approximation. Indeed, the domain of $\tilde{Q}^{1}(\cdot)$ is the approximated safe set $\tilde{\mathcal{CS}}^{1}$ from Figure 2. Second, we recall that $\tilde{Q}^{1}(\cdot)$ is constructed based on sampled disturbance sequences and it underestimates $Q^{1}(\cdot)$ , which considers the whole disturbance support. Therefore, we expect that as more sample disturbance sequences are considered $\tilde{Q}^{1}(\cdot)$ better approximates ${Q}^{1}(\cdot)$ . This intuition is confirmed by Figure 3, we notice that $\tilde{Q}^{1}(\cdot)$ constructed with $1000$ trajectories upper-bounds almost everywhere the value function $\tilde{Q}^{1}(\cdot)$ constructed with $100$ trajectories, therefore it better approximates $Q^{1}(\cdot)$ . Finally, we recall from Equation (25) that $\tilde{Q}^{1}(\cdot)$ is not a robust control Lyapunov function. Indeed, there is a probability $\gamma>0$ that $\tilde{Q}^{1}(\cdot)$ is not decreasing along the realized closed-loop trajectory. In order to estimate the probability $\gamma$ , we use $1000$ Monte Carlo simulations. As expected, the probability $\gamma$ decreases as more closed-loop trajectories are used to construct $\tilde{{Q}}^{1}(\cdot)$ . In particular, we have $\gamma\sim 10.1\%$ and $\gamma\sim 4.3\%$ for $R=100$ and $R=1000$ , respectively.

V-B Iterative Policy Update

In this section we run the LMPC for $10$ iterations. In particular, at each $j$ th iteration we collect $R=1000$ roll-outs which are used to compute the approximated convex safe set $\tilde{\mathcal{CS}}^{j}$ and the approximated value function $\tilde{Q}^{j}(\cdot)$ . We show that the LMPC is able to explore the state space while safely steering the system to the terminal set $\mathcal{O}$ .

As stated in Section V, at each $j$ th iteration we compute the initial condition $x_{0}^{j}$ as the furthest point along the negative $x$ -axis such that Problem (12) is feasible. Notice that by Theorem 3, the domain of the LMPC policy $\mathcal{F}^{j}$ is enlarged at each iteration (i.e. ${\mathcal{F}}^{k}\subseteq{\mathcal{F}}^{j}$ for all $k\in\{1,\ldots,j\}$ ). As a result, the region of the state space from which the controller is able to safely complete the control task grows at each iteration. This fact is highlighted in Table I, where we report the initial condition $x_{0}^{j}$ as a function of the iteration index. Furthermore, in Figure 4 we show $1000$ realized trajectories for the $2$ nd, $4$ th and $8$ th iterations. We notice that at each iteration the LMPC safely operates the system over progressively larger regions of the state space, until the closed-loop trajectory is close to saturate the state constraints.

Finally, in Figure 5 we report the approximated value function $\tilde{Q}^{j}(\cdot)$ for the $2$ nd, $4$ th and $8$ th iterations. We recall that the domain of $\tilde{Q}^{j}(\cdot)$ is the approximated convex safe set $\tilde{\mathcal{CS}}^{j}$ , which is enlarged at each iteration. Therefore, as more iterations of the control task are executed, $\tilde{Q}^{j}(\cdot)$ approximates the value function over larger regions of the state space, as shown in Figure 5.

V-C Performance Improvement

In this section we empirically validate Theorem 2. We design a LMPC which minimizes the stage cost $\bar{h}(x,u)=0.1|x|_{\mathcal{O}}+|u|_{KO}$ . Afterwards, we run the closed-loop system for $10$ iterations starting from the same initial condition, $x_{0}^{j}=-[0,~{}9.9]~{}\forall j\in\{0,\dots,9\}$ . In order to initialize the LMPC, we use a suboptimal controller which robustly steers system (26) to $\mathcal{O}$ and we exploit the closed-loop data to initialize the approximated convex safe set and value function.

Figure 6 shows the closed-loop cost $\tilde{J}^{j}_{0\rightarrow T^{j}}(x_{0}^{j}({\bf{w}}^{j}_{i}))$ from (21) and the worst-case realized cost

[TABLE]

for $10$ iterations. We notice that the LMPC is able to improve the worst-case realized cost associated with the suboptimal policy used at the [math]th iteration. Furthermore, we underline that the controller performs exactly the same task at each iteration ( $x_{0}^{j}=x_{0}^{i},\forall j,i\geq 0$ ) and the worst-case realized cost (27) decreases at each iteration, until it converges within a tolerance of $0.7\%$ as stated in Theorem 2.

VI Conclusions

In this paper we proposed a sample-based Learning Model Predictive Controller (LMPC) for linear system subject to bounded additive uncertainty. First, we used the LMPC policy to construct a safe set and the associated value function. Afterwards, we showed that the proposed strategy allows to guarantee safety and worst-case performance improvement. Finally, we exploited sampled closed-loop trajectories to approximate the safe set and associated value function. We demonstrated the effectiveness of the proposed approach on a numerical example. In particular, we showed that the proposed LMPC is able to safely explore the state space while estimating the value function associated with the control task. Future work concentrates on finding probability bounds, which would allows to characterize the properties of the approximated safe set and approximate value function as a function of the sampled trajectories.

Bibliography25

The reference list from the paper itself. Each links out to its DOI / PubMed record.

1[1] A. Aswani, H. Gonzalez, S. S. Sastry, and C. Tomlin, “Provably safe and robust learning-based model predictive control,” Automatica , vol. 49, no. 5, pp. 1216–1226, 2013.
2[2] J. Kocijan, R. Murray-Smith, C. E. Rasmussen, and A. Girard, “Gaussian process model based predictive control,” in Proceedings of the 2004 American Control Conference , vol. 3. IEEE, 2004, pp. 2214–2219.
3[3] T. Koller, F. Berkenkamp, M. Turchetta, and A. Krause, “Learning-based model predictive control for safe exploration,” in 2018 IEEE Conference on Decision and Control (CDC) . IEEE, 2018, pp. 6059–6066.
4[4] L. Hewing, A. Liniger, and M. N. Zeilinger, “Cautious nmpc with gaussian process dynamics for autonomous miniature race cars,” in 2018 European Control Conference (ECC) . IEEE, 2018, pp. 1341–1348.
5[5] K. Chua, R. Calandra, R. Mc Allister, and S. Levine, “Deep reinforcement learning in a handful of trials using probabilistic dynamics models,” in Advances in Neural Information Processing Systems , 2018, pp. 4759–4770.
6[6] E. Terzi, L. Fagiano, M. Farina, and R. Scattolini, “Learning-based predictive control for linear systems: a unitary approach,” ar Xiv preprint ar Xiv:1810.12584 , 2018.
7[7] U. Rosolia and F. Borrelli, “Learning how to autonomously race a car: a predictive control approach,” ar Xiv preprint ar Xiv:1901.08184 , 2019.
8[8] K. S. Lee and J. H. Lee, “Model predictive control for nonlinear batch processes with asymptotically perfect tracking,” Computers & Chemical Engineering , vol. 21, pp. S 873–S 879, 1997.