Output-feedback online optimal control for a class of nonlinear systems

Ryan Self; Michael Harlan; Rushikesh Kamalapurkar

arXiv:1903.02078·cs.SY·July 7, 2021

Output-feedback online optimal control for a class of nonlinear systems

Ryan Self, Michael Harlan, Rushikesh Kamalapurkar

PDF

TL;DR

This paper introduces an output-feedback reinforcement learning method for controlling a specific class of nonlinear systems, combining model knowledge with dynamic state estimation to improve control performance.

Contribution

It presents a novel output-feedback MBRL approach that integrates a dynamic state estimator with exact model knowledge for second-order nonlinear systems.

Findings

01

Simulation results confirm the effectiveness of the proposed method.

02

The approach successfully stabilizes the nonlinear systems under study.

Abstract

In this paper an output-feedback model-based reinforcement learning (MBRL) method for a class of second-order nonlinear systems is developed. The control technique uses exact model knowledge and integrates a dynamic state estimator within the model-based reinforcement learning framework to achieve output-feedback MBRL. Simulation results demonstrate the efficacy of the developed method.

Equations76

\overset{p}{˙}

\overset{p}{˙}

\overset{q}{˙}

y

J (x (\cdot), u (\cdot)) = \int_{0}^{\infty} r (x (τ), u (τ)) d τ,

J (x (\cdot), u (\cdot)) = \int_{0}^{\infty} r (x (τ), u (τ)) d τ,

\overset{y}{˙}

\overset{y}{˙}

\overset{q}{˙}

V_{y} (x) q + V_{q} (x) (f (x) + g (x) u^{*} (x)) + r (y, u^{*} (x)) = 0,

V_{y} (x) q + V_{q} (x) (f (x) + g (x) u^{*} (x)) + r (y, u^{*} (x)) = 0,

u^{*} (x) = - \frac{1}{2} R^{- 1} g^{T} (x) V_{q} (x),

u^{*} (x) = - \frac{1}{2} R^{- 1} g^{T} (x) V_{q} (x),

V_{y} (x) q + V_{q} (x) (f (x) + g (x) u^{*} (x)) \leq - Q (x),

V_{y} (x) q + V_{q} (x) (f (x) + g (x) u^{*} (x)) \leq - Q (x),

V_{y} (x) q + V_{q} (x) (f (x) + g (x) u^{*} (x)) \leq - W (x),

V_{y} (x) q + V_{q} (x) (f (x) + g (x) u^{*} (x)) \leq - W (x),

\dot{\overset{p}{^}}

\dot{\overset{p}{^}}

\dot{\overset{q}{^}}

r = \dot{\tilde{p}} + α \tilde{p} + η,

r = \dot{\tilde{p}} + α \tilde{p} + η,

\overset{η}{˙}

\overset{η}{˙}

ν = α^{2} \tilde{p} - (k + α + β) η .

ν = α^{2} \tilde{p} - (k + α + β) η .

η (t) = - T_{0} \int t (β + k) η (τ) d τ - T_{0} \int t k α \tilde{p} (τ) d τ - (k + α) \tilde{p} (t) .

η (t) = - T_{0} \int t (β + k) η (τ) d τ - T_{0} \int t k α \tilde{p} (τ) d τ - (k + α) \tilde{p} (t) .

V (x) : = u (\cdot) min t \int \infty r (ϕ (τ, x, u (\cdot)), u (\cdot)) d τ .

V (x) : = u (\cdot) min t \int \infty r (ϕ (τ, x, u (\cdot)), u (\cdot)) d τ .

\hat{V} (x, W_{c})

\hat{V} (x, W_{c})

\overset{u}{^} (x, W_{a})

δ (\overset{x}{^}, W_{c}, W_{a}) = \hat{V}_{q} (\overset{x}{^}, W_{c}) (f (\overset{x}{^}) + g (\overset{x}{^}) \overset{u}{^} (\overset{x}{^}, W_{a})) + \hat{V}_{y} (\overset{x}{^}, W_{c}) \overset{q}{^} + r (\overset{y}{^}, \overset{u}{^} (\overset{x}{^}, W_{a})),

δ (\overset{x}{^}, W_{c}, W_{a}) = \hat{V}_{q} (\overset{x}{^}, W_{c}) (f (\overset{x}{^}) + g (\overset{x}{^}) \overset{u}{^} (\overset{x}{^}, W_{a})) + \hat{V}_{y} (\overset{x}{^}, W_{c}) \overset{q}{^} + r (\overset{y}{^}, \overset{u}{^} (\overset{x}{^}, W_{a})),

\dot{\hat{W}}_{c}

\dot{\hat{W}}_{c}

\dot{Γ}

\dot{W}_{a} = - k_{a 1} (W_{a} - W_{c}) - k_{a 2} W_{a} + i = 1 \sum N \frac{k _{c} G _{i}^{T} W _{a} ω _{i}^{T}}{4 N ρ _{i}} W_{c},

\dot{W}_{a} = - k_{a 1} (W_{a} - W_{c}) - k_{a 2} W_{a} + i = 1 \sum N \frac{k _{c} G _{i}^{T} W _{a} ω _{i}^{T}}{4 N ρ _{i}} W_{c},

\underline{c}_{1} I_{L} \leq t \in R_{\geq t_{0}} in f (\frac{1}{N} i = 1 \sum N \frac{ω _{i} ( t ) ω _{i}^{T} ( t )}{ρ _{i}^{2} ( t )}),

\underline{c}_{1} I_{L} \leq t \in R_{\geq t_{0}} in f (\frac{1}{N} i = 1 \sum N \frac{ω _{i} ( t ) ω _{i}^{T} ( t )}{ρ _{i}^{2} ( t )}),

\underline{c}_{2} I_{L} \leq \frac{1}{N} t \int t + T (i = 1 \sum N \frac{ω _{i} ( τ ) ω _{i}^{T} ( τ )}{ρ _{i}^{2} ( τ )}) d τ, \forall t \in R_{\geq t_{0}},

\underline{Γ} I_{L} \leq Γ (t) \leq \overline{Γ} I_{L},

\underline{Γ} I_{L} \leq Γ (t) \leq \overline{Γ} I_{L},

δ_{t i}

δ_{t i}

\dot{V} (x, t) = V_{y} (x) q + V_{q} (x) (f (x) + g (\overset{x}{^}) \overset{u}{^} (\overset{x}{^}, W_{a}))

\dot{V} (x, t) = V_{y} (x) q + V_{q} (x) (f (x) + g (\overset{x}{^}) \overset{u}{^} (\overset{x}{^}, W_{a}))

\dot{V} (x, t) = V_{y} (x) q + V_{q} (x) (f (x) + g (x) u^{*} (x)) + V_{q} (x) (g (\overset{x}{^}) \overset{u}{^} (\overset{x}{^}, W_{a}) - g (x) u^{*} (x))

\dot{V} (x, t) = V_{y} (x) q + V_{q} (x) (f (x) + g (x) u^{*} (x)) + V_{q} (x) (g (\overset{x}{^}) \overset{u}{^} (\overset{x}{^}, W_{a}) - g (x) u^{*} (x))

\dot{V} (x, t) \leq - W (x) + ι_{1} \overline{ϵ} + ι_{2} ∥ \tilde{x} ∥ \tilde{W}_{a} + ι_{3} \tilde{W}_{a} + ι_{4} ∥ \tilde{x} ∥,

\dot{V} (x, t) \leq - W (x) + ι_{1} \overline{ϵ} + ι_{2} ∥ \tilde{x} ∥ \tilde{W}_{a} + ι_{3} \tilde{W}_{a} + ι_{4} ∥ \tilde{x} ∥,

\dot{Θ} (\tilde{W}_{c}, \tilde{W}_{a}, t) = - \tilde{W}_{c}^{T} Γ^{- 1} (- \frac{k _{c}}{N} Γ i = 1 \sum N \frac{ω _{i}}{ρ _{i}} δ_{t i}) - \frac{1}{2} \tilde{W}_{c}^{T} (Γ^{- 1} β - \frac{k _{c}}{N} i = 1 \sum N \frac{ω _{i} ω _{i}^{T}}{ρ _{i}^{2}}) \tilde{W}_{c} - \tilde{W}_{a}^{T} (- k_{a 1} (W_{a} - W_{c}) - k_{a 2} W_{a} + i = 1 \sum N \frac{k _{c} G _{i}^{T} W _{a} ω _{i}^{T}}{4 N ρ _{i}} W_{c})

\dot{Θ} (\tilde{W}_{c}, \tilde{W}_{a}, t) = - \tilde{W}_{c}^{T} Γ^{- 1} (- \frac{k _{c}}{N} Γ i = 1 \sum N \frac{ω _{i}}{ρ _{i}} δ_{t i}) - \frac{1}{2} \tilde{W}_{c}^{T} (Γ^{- 1} β - \frac{k _{c}}{N} i = 1 \sum N \frac{ω _{i} ω _{i}^{T}}{ρ _{i}^{2}}) \tilde{W}_{c} - \tilde{W}_{a}^{T} (- k_{a 1} (W_{a} - W_{c}) - k_{a 2} W_{a} + i = 1 \sum N \frac{k _{c} G _{i}^{T} W _{a} ω _{i}^{T}}{4 N ρ _{i}} W_{c})

\dot{Θ} (\tilde{W}_{c}, \tilde{W}_{a}, t) \leq - k_{c} \underline{c} \tilde{W}_{c}^{2} - (k_{a 1} + k_{a 2}) \tilde{W}_{a}^{2} + k_{c} ι_{8} \overline{ϵ} \tilde{W}_{c} + k_{c} ι_{5} \tilde{W}_{a}^{2} + (k_{c} ι_{6} + k_{a 1}) \tilde{W}_{c} \tilde{W}_{a} + (k_{c} ι_{7} + k_{a 2} \overline{W}) \tilde{W}_{a},

\dot{Θ} (\tilde{W}_{c}, \tilde{W}_{a}, t) \leq - k_{c} \underline{c} \tilde{W}_{c}^{2} - (k_{a 1} + k_{a 2}) \tilde{W}_{a}^{2} + k_{c} ι_{8} \overline{ϵ} \tilde{W}_{c} + k_{c} ι_{5} \tilde{W}_{a}^{2} + (k_{c} ι_{6} + k_{a 1}) \tilde{W}_{c} \tilde{W}_{a} + (k_{c} ι_{7} + k_{a 2} \overline{W}) \tilde{W}_{a},

\dot{Φ} (\tilde{p}, r, η, t) = α^{2} \tilde{p}^{T} (r - α \tilde{p} - η) + η (- β η - k r - α \tilde{q}) + r^{T} (\tilde{f} (x, \overset{x}{^}) + \tilde{g} (x, \overset{x}{^}) \overset{u}{^} (\overset{x}{^}, W_{a}) - α^{2} \tilde{p} - k r + k η + α η),

\dot{Φ} (\tilde{p}, r, η, t) = α^{2} \tilde{p}^{T} (r - α \tilde{p} - η) + η (- β η - k r - α \tilde{q}) + r^{T} (\tilde{f} (x, \overset{x}{^}) + \tilde{g} (x, \overset{x}{^}) \overset{u}{^} (\overset{x}{^}, W_{a}) - α^{2} \tilde{p} - k r + k η + α η),

\dot{Φ} (\tilde{p}, r, η, t) \leq - α^{3} ∥ \tilde{p} ∥^{2} - (k - ϖ_{1}) ∥ r ∥^{2} - (β - α) ∥ η ∥^{2} + ϖ_{1} (1 + α) ∥ r ∥ ∥ \tilde{p} ∥ + ϖ_{1} ∥ r ∥ ∥ η ∥ + ϖ_{3} ∥ r ∥ + ϖ_{2} ∥ r ∥ \tilde{W}_{a}

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Full text

Output-feedback online optimal control for a class of nonlinear systems

Ryan Self, Michael Harlan, and Rushikesh Kamalapurkar The authors are with the School of Mechanical and Aerospace Engineering, Oklahoma State University, Stillwater, OK, USA. {rself, michael.c.harlan, rushikesh.kamalapurkar}@okstate.edu.

Abstract

In this paper an output-feedback model-based reinforcement learning (MBRL) method for a class of second-order nonlinear systems is developed. The control technique uses exact model knowledge and integrates a dynamic state estimator within the model-based reinforcement learning framework to achieve output-feedback MBRL. Simulation results demonstrate the efficacy of the developed method.

I Introduction

Over the past decade, online reinforcement learning algorithms that guarantee stability during the learning phase have been developed for deterministic systems [1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15]; however, stability and convergence are established under restrictive persistence of excitation (PE) conditions which are difficult, if not impossible, to verify. To soften the PE condition, data-driven methods that employ experience replay have been utilized in results such as [16, 17, 18, 19, 20, 21, 13]; however, since the data is collected along the system trajectory, added exploration signals are often required to achieve convergence. The need for PE and exploration signals is a result of sample inefficiency, and is a significant drawback of the existing model-free RL-based online optimal control methods.

Model-based reinforcement learning (MBRL) algorithms learn a model of the system from observations using supervised learning and employ the model to learn the policies. Several different MBRL approaches have been developed in the literature over the last few decades. Imaginary roll-outs, i.e., the use of a model as a proxy for the real world to evaluate temporal difference errors (referred to as Bellman errors (BEs) in this paper) are explored in results such as [22] and [23]. While the sample efficiency is of the policy learning algorithms is improved, the performance of the method in [22] decays rapidly with model mismatch, and the method in [23] relies on fitting neural networks to dynamics, which is typically data-intensive, nullifying the sample efficiency gain in the policy learning algorithm.

Policy gradient methods that rely on backpropagation through the model to compute the gradient of the state or the action value function with respect to the policy parameters are developed in results such as [24, 25, 26, 27]; however, policy gradient methods are often iterative in nature and typically do not study stability during the learning phase, and as a result, are not suitable for real-time simultaneous learning and execution. MBRL methods with provable sample efficiency bounds have been developed in results such as [28, 29, 30, 31, 32, 33, 34]; however, the theoretical guarantees are obtained under discretization of the continuous state space into finitely many discrete states and a finite action space, and as such, are not directly applicable to systems with continuous state and action spaces.

The MBRL technique developed by the authors in [35, 36, 37, 38, 39, 40] for continuous time and continuous space systems softens the excitation requirements used in results such as [1, 2, 41, 42, 3, 4, 5, 43, 44, 6, 7, 8, 9, 10, 11, 12, 45, 13, 14, 15] by utilizing a model of the system to simulate exploration, where the stability and the performance of the closed-loop system critically depends on the accuracy of the estimated model. A significant drawback of the online optimal control methods mentioned so far is that they require full state measurements.

While model-based and model-free reinforcement learning can be achieved using output feedback instead of state feedback by making use of partially observable Markov decision processes (POMDPs), in general, POMDPs are undecidable if the objective is to find an optimal solution, and finding a near-optimal solution can also be NP-hard [46, 47]. In this paper, the problem is formulated as a state estimation based reinforcement learning problem, and for a specific class of systems, an online solution is obtained that guarantees stability during the learning phase.

A recent result in [48], presents an offline model-free algorithm for linear systems to achieve optimality using output feedback. The objective in this paper is to develop an output-feedback model-based reinforcement learning method for a class of nonlinear systems under exact model knowledge. While the developed results can be extended to systems with uncertain models using model-learning methods such as [49], such extension is not a focus of this work.

The paper is organized as follows. A detailed description of the problem under consideration is provided in Section I. To facilitate the subsequent analysis of the developed technique, section III examines the stability properties of optimal controllers under semidefinite cost functions for the class of systems under consideration. Section IV describes the state estimator used in the design. Section V describes the developed MBRL method. Section VI presents a Lyapunov-based stability analysis, Section VII presents simulation results, and Section VIII concludes the paper.

II Problem Description

Consider a second order nonlinear system of the form

[TABLE]

where $p\in\mathbb{R}^{n}$ and $q\in\mathbb{R}^{n}$ denote the generalized position states and the generalized velocity states, respectively, $x\coloneqq\begin{bmatrix}p^{T}&q^{T}\end{bmatrix}^{T}$ is the system state, $f:\mathbb{R}^{2n}\to\mathbb{R}^{n}$ is locally Lipschitz continuous, $f\left(0\right)=0$ , and $y\in\mathbb{R}^{n}$ denotes the output. The drift dynamics $f$ are unknown and the control effectiveness $g:\mathbb{R}^{2n}\to\mathbb{R}^{n\times m}$ is known and locally Lipschitz. Systems of the form (1) encompass second-order linear systems and Euler-Lagrange models with known inertial matrices, and hence, represent a wide class of physical plants, including, but not limited to, robotic manipulators and autonomous ground, aerial, and underwater vehicles.

The objective is to design an adaptive estimator to estimate the state $x$ , online, using input-output measurements and to simultaneously estimate and utilize the optimal feedback controller that minimizes the cost functional

[TABLE]

while maintaining system stability during the learning phase. The function $r:\mathbb{R}^{n\times m}\to\mathbb{R}$ is defined as $r\left(x,u\right)\coloneqq Q\left(x\right)+u^{T}Ru$ , where $Q:\mathbb{R}^{n}\to\mathbb{R}$ is continuous, $R\in\mathbb{R}^{m\times m}$ is a constant positive definite matrix, and $\gamma\geq 0$ is the discount factor.

Assumption 1.

One of the following is true:

(a)

$Q$ is positive definite. 2. (b)

$Q$ is positive semidefinite and $p\mapsto Q\left(x\right)$ is positive definite for all nonzero $q\in\mathbb{R}^{n}$ . 3. (c)

$Q$ is positive semidefinite, $q\mapsto Q\left(x\right)$ is positive definite for all nonzero $p\in\mathbb{R}^{n}$ and $f\left(x\right)\neq 0$ whenever $p\neq 0$ .

To facilitate control design, the stability properties of the closed-loop system under optimal feedback are examined.

III Stability Under Optimal state Feedback

The following theorem establishes global asymptotic stability of the closed-loop system under optimal state feedback.

Theorem 1.

If the optimal state feedback controller $u^{*}:\mathbb{R}^{2n}\to\mathbb{R}^{m}$ that minimizes the cost function in (2) exists and if the corresponding optimal value function $V:\mathbb{R}^{2n}\to\mathbb{R}$ is continuously differentiable and radially unbounded, then the origin of closed-loop system

[TABLE]

is globally asymptotically stable.

Proof.

Under the hypothesis of Theorem 1, the optimal value function is the unique solution of the Hamilton-Jacobi-Bellman equation [50, pp. 164]

[TABLE]

with

[TABLE]

where the notation $x_{y}$ denotes the partial derivative of $x$ with respect to $y$ . The function $V$ is positive semidefinite by definition. Since the solutions of (3) are continuous, if $V\left(\begin{bmatrix}y\\ q\end{bmatrix}\right)=0$ for some $x\neq 0$ , it can be concluded that $Q\left(\phi\left(t;x,u^{*}\left(\cdot\right)\right)\right)=0,\forall t\geq 0$ , and $u^{*}\left(\phi\left(t;x,u^{*}\left(\cdot\right)\right)\right)=0,\forall t\geq 0$ , where $\phi\left(t,x,u\left(\cdot\right)\right)$ denotes the trajectory of (1), evaluated at time $t$ , starting from the state $x$ and under the controller $u\left(\cdot\right)$ . If Assumption 1-(a) holds then $\phi\left(t;x,u^{*}\left(\cdot\right)\right)=0,\forall t\geq 0$ , which contradicts $x\neq 0$ . If Assumption 1-(b) holds, then $p\left(t;x,u^{*}\left(\cdot\right)\right)=0,\forall t\geq 0$ . As a result, $\phi\left(t;x,u^{*}\left(\cdot\right)\right)=0,\forall t\geq 0$ , which contradicts $x\neq 0$ . If Assumption 1-(c) holds, then $q\left(t;x,u^{*}\left(\cdot\right)\right)=0,\forall t\geq 0$ . As a result, $p\left(t;x,u^{*}\left(\cdot\right)\right)=c,\forall t\geq 0$ for some constant $c\in\mathbb{R}^{n}$ . Since $f\left(x\right)\neq 0$ if $p\neq 0$ , it can be concluded that $c=0$ , which contradicts $x\neq 0$ . Hence, $V\left(x\right)$ cannot be zero for a nonzero $x$ . Furthermore, since $f\left(0\right)=0$ , the zero controller is clearly the optimal controller starting from $x=0$ . That is, $V\left(0\right)=0$ , and as a result, $V:\mathbb{R}^{2n}\to\mathbb{R}$ is positive definite.

Using $V$ as a candidate Lyapunov function and using the HJB equation in (4), it can be concluded that

[TABLE]

$\forall x\in\mathbb{R}^{2n}.$ If Assumption 1-(a) holds, then the proof is complete using Lyapunov’s direct method. If Assumption 1-(b) holds, then using the fact that if the output is identically zero then so is the state, the invariance principle [51, Corollary 4.2] can be invoked to complete the proof. If Assumption 1-(c) holds, then finiteness of the value function everywhere implies that the origin is the only equilibrium point of the closed-loop system. As a result, the invariance principle can be invoked to complete the proof. ∎

Using Theorem 1 and the converse Lyapunov theorem for asymptotic stability [51, Theorem 4.17], the existence of a radially unbounded positive definite function $\mathcal{V}:\mathbb{R}^{2n}\to\mathbb{R}$ and a positive definite function $W:\mathbb{R}^{2n}\to\mathbb{R}$ is guaranteed such that

[TABLE]

$\forall x\in\mathbb{R}^{2n}$ . The functions $\mathcal{V}$ and $W$ are utilized to analyze the stability of the output feedback approximate optimal controller.

IV Velocity Estimator Design

To generate estimates of the generalized velocity, a velocity estimator inspired by [52] is developed. The estimator is given by

[TABLE]

where $\hat{x}$ , $\hat{p}$ , and $\hat{q}$ are estimates of $x$ , $p$ , and $q$ , respectively, and $\nu$ is a feedback term designed in the following.

To facilitate the design of $\nu$ , let $\tilde{p}=p-\hat{p}$ , $\tilde{q}=q-\hat{q}$ , and let

[TABLE]

where the signal $\eta$ is added to compensate for the fact that the generalized velocity state, $q$ , is not measurable. Based on the subsequent stability analysis, the signal $\eta$ is designed as the output of the dynamic filter

[TABLE]

where $\alpha,$ $k,$ and $\beta$ are positive constants and the feedback component $\nu$ is designed as

[TABLE]

The design of the signals $\eta$ and $\nu$ to estimate the state from output measurements is inspired by the $p-$ filter [53]. Using the fact that $\tilde{p}\left(0\right)=0$ , the signal $\eta$ can be implemented via the integral form

[TABLE]

V Model-based Reinforcement Learning

To estimate the optimal state feedback policy, the optimal value function, defined as

[TABLE]

The optimal value function $V$ and the optimal policy $u^{*}$ are approximated using parametric approximators $\hat{V}:\mathbb{R}^{2n}\times\mathbb{R}^{L}\to\mathbb{R}$ and $\hat{u}:\mathbb{R}^{2n}\times\mathbb{R}^{L}\to\mathbb{R}^{m}$ defined as

[TABLE]

where $\sigma\coloneqq\left[\sigma_{1}\cdots,\sigma_{L}\right]$ , $\sigma_{i}:\mathbb{R}^{2n}\to\mathbb{R}$ for all $i$ is the vector of basis functions and $W_{c}\in\mathbb{R}^{L}$ and $W_{a}\in\mathbb{R}^{L}$ are estimates of the ideal parameters $W\in\mathbb{R}^{L}$ . The corresponding approximation error $\epsilon:\mathbb{R}^{2n}\to\mathbb{R}$ is defined as $\epsilon\left(x\right)\coloneqq V\left(x\right)-\hat{V}\left(x,W\right)$ . Provided the basis functions are selected from an appropriate class of functions, for any given compact ball $\overline{\operatorname{B}}\left(0,\chi\right)\subset\mathbb{R}^{2n}$ , and any given $\overline{\epsilon}$ there exists $L\in\mathbb{N}$ , a set of basis functions $\left\{\sigma_{1},\cdots,\sigma_{L}\right\}$ , and $W\in\mathbb{R}^{L}$ such that $\overline{\left\|\epsilon\right\|}_{\chi}<\overline{\epsilon}$ and $\overline{\left\|\epsilon_{x}\right\|}_{\chi}<\overline{\epsilon}$ , where $\overline{\left\|\epsilon\right\|}_{\chi}$ denotes $\sup_{x\in\overline{\operatorname{B}}\left(0,\chi\right)}\left\|\epsilon\left(x\right)\right\|$ (see [54, 55, 56]).

Substituting the estimates $\hat{V}$ , $\hat{u}$ , and $\hat{x}$ in (4), the Bellman error $\delta:\mathbb{R}^{2n}\times\mathbb{R}^{L}\times\mathbb{R}^{L}\to\mathbb{R}$ is obtained as

[TABLE]

Similar to [36], the technique developed in this result implements simulation of experience in a model-based RL scheme by using the system model to extrapolate the approximate BE to unexplored areas of the state space. In the following, the trajectories of the state and the weight estimates $W_{c}$ and $W_{a}$ , evaluated at time $t$ starting from appropriate initial conditions are denoted by $x\left(t\right)$ , $W_{c}\left(t\right)$ and $W_{a}\left(t\right)$ , respectively. The notation111For $a\in\mathbb{R},$ the notation $\mathbb{R}_{\geq a}$ denotes the interval $\left[a,\infty\right)$ and the notation $\mathbb{R}_{>a}$ denotes the interval $\left(a,\infty\right)$ . $\delta_{t}:\mathbb{R}_{\geq 0}\to\mathbb{R}$ denotes the BE in (14), evaluated along the trajectories of the state and the weight estimates as $\delta_{t}\left(t\right)\coloneqq\delta\left(\hat{x}\left(t\right),\hat{W}_{c}\left(t\right),\hat{W}_{a}\left(t\right)\right)$ and $\delta_{ti}:\mathbb{R}_{\geq 0}\to\mathbb{R}$ denotes BE extrapolated along the trajectories of the weight estimates and a predefined set of trajectories $\left\{x_{i}:\mathbb{R}_{\geq 0}\to\mathbb{R}^{n}\mid i=1,\cdots,N\right\}$ as $\delta_{ti}\left(t\right)\coloneqq\delta\left(x_{i}\left(t\right),\hat{W}_{c}\left(t\right),\hat{W}_{a}\left(t\right)\right)$ . A least-squares update law for the value function weights is designed based on the subsequent stability analysis as

[TABLE]

$\Gamma\left(t_{0}\right)=\Gamma_{0},$ where $\Gamma:\mathbb{R}_{\geq t_{0}}\to\mathbb{R}^{L\times L}$ is a time-varying least-squares gain matrix, $\omega_{i}\left(t\right)\coloneqq\sigma_{p}\left(x_{i}\left(t\right)\right)q_{i}\left(t\right)+\sigma_{q}\left(x_{i}\left(t\right)\right)\left(f\left(x_{i}\left(t\right)\right)+g\left(x_{i}\left(t\right)\right)\hat{u}\left(x_{i}\left(t\right),W_{a}\left(t\right)\right)\right),$ $\rho_{i}\left(t\right)\coloneqq 1+\gamma_{1}\omega_{i}^{T}\left(t\right)\omega_{i}\left(t\right)$ , where $\gamma_{1}\in\mathbb{R}$ is a constant positive normalization gain, $\beta>0\in\mathbb{R}$ is a constant forgetting factor, and $k_{c}>0\in\mathbb{R}$ is a constant adaptation gain.

The policy weights are updated to follow the value function weights as

[TABLE]

where $k_{a1},\>k_{a2}\in\mathbb{R}$ are positive constant adaptation gains, $G_{i}\left(t\right)\coloneqq\sigma_{xi}\left(t\right)g_{i}\left(t\right)R^{-1}g_{i}^{T}\left(t\right)\sigma_{xi}^{T}\left(t\right)\in\mathbb{R}^{L\times L}$ , $g_{i}\left(t\right)=g\left(x_{i}\left(t\right)\right)$ and $\sigma_{xi}\left(t\right)=\sigma_{x}\left(x_{i}\left(t\right)\right)$ . The following rank condition facilitates the subsequent analysis.

Assumption 2.

There exists a finite set of trajectories $\left\{x_{i}:\mathbb{R}_{\geq t_{0}}\to\mathbb{R}^{n}\mid i=1,\cdots,N\right\}$ and a constant $T\in\mathbb{R}_{>0}$ such that

[TABLE]

where, at least one of the nonnegative constants $\underline{c}_{1}$ and $\underline{c}_{2}$ is strictly positive.

The rank conditions in (18) and (19) depend on the estimates $W_{a}$ ; hence, in general, they are impossible to guarantee a priori. However, unlike traditional adaptive dynamic programming literature that assumes that a regressor similar to $\omega_{i}$ evaluated along the system trajectories is PE, Assumption 2 only requires the regressor $\omega_{i}$ to be persistently exciting. When the regressor is evaluated along the system state $x$ excitation in the regressor vanishes as the system states converge. Hence, in general, it is unlikely that a regressor evaluated along the system trajectories will be PE. However, the regressor $\omega_{i}$ depends on $x_{i}$ , which can be designed independent of the system state $x$ . Hence, $\underline{c}_{2}$ can be made strictly positive if the signal $x_{i}$ contains enough frequencies, and $\underline{c}_{1}$ can be made strictly positive by selecting a sufficient number of extrapolation trajectories, i.e., $N\gg L$ . It is established in [38, Lemma 1] that under Assumption 2 and provided $\lambda_{\min}\left\{\Gamma_{0}^{-1}\right\}>0$ , the update law in (16) ensures that the least squares gain matrix satisfies

[TABLE]

$\forall t\in\mathbb{R}_{\geq 0}$ and for some $\overline{\Gamma},\underline{\Gamma}>0$ .

VI Analysis

The approximate BE, evaluated along the selected trajectories $\left\{x_{i}\mid i=1,\cdots,N\right\}$ , can be expressed as

[TABLE]

where $\nabla\epsilon_{i}=\nabla\epsilon\left(x_{i}\right)$ , $f_{i}=f\left(x_{i}\right)$ , $G_{i}\coloneqq g_{i}R^{-1}g_{i}^{T}\in\mathbb{R}^{n\times n}$ , $\Delta_{i}\coloneqq\frac{1}{2}W^{T}\nabla\sigma_{i}G_{i}\nabla\epsilon_{i}^{T}+\frac{1}{4}G_{\epsilon i}-\nabla\epsilon_{i}f_{i}\in\mathbb{R}$ is a constant, $G_{\epsilon i}\coloneqq\nabla\epsilon_{i}G_{i}\nabla\epsilon_{i}^{T}\in\mathbb{R}$ , and $G_{\sigma i}$ was introduced in (17). Using (21), the time-derivative of the Lyapunov function introduced in (6) along the trajectories of (1) under the controller $u\left(t\right)=\hat{u}\left(\hat{x\left(t\right)},W_{a}\left(t\right)\right)$ is given by

[TABLE]

Adding and subtracting $\mathcal{V}_{q}\left(x\right)\left(g\left(x\right)u^{*}\left(x\right)\right)$ ,

[TABLE]

Using (6), the fact that $g$ is bounded, the basis functions $\sigma$ are bounded, and that the value function approximation error $\epsilon$ and its derivative with respect to $x$ are bounded on compact sets, the time-derivative can be bounded as

[TABLE]

for all $t\geq 0$ and for all $x\in\overline{\operatorname{B}}\left(0,\chi\right)$ and $\hat{x}\in\mathbb{R}^{2n}$ , where $\chi\subset\mathbb{R}^{2n}$ is a compact set, $\iota_{1},\cdots,\iota_{4}$ are positive constants, and $\tilde{x}\coloneqq x-\hat{x}$ .

Let $\Theta\left(\tilde{W}_{c},\tilde{W}_{a},t\right)\coloneqq\frac{1}{2}\tilde{W}_{c}^{T}\Gamma^{-1}\left(t\right)\tilde{W}_{c}+\frac{1}{2}\tilde{W}_{a}^{T}\tilde{W}_{a}$ The time-derivative of $\Theta$ along the trajectories of (15)-(17) is given by

[TABLE]

Using (14),

[TABLE]

for all $t\geq 0$ and for all $x\in\overline{\operatorname{B}}\left(0,\chi\right)$ , where $\iota_{5},\cdots,\iota_{8}$ are positive constants that are independent of the learning gains, $\overline{W}$ denotes an upper bound on the norm of the ideal weights $W$ , and $\underline{c}=\min_{t\geq 0}\lambda_{\min}\left\{\left(\frac{\beta}{2k_{c}}\Gamma^{-1}\left(t\right)+\frac{1}{2N}\sum_{i=1}^{N}\frac{\omega_{i}\omega_{i}^{T}}{\rho_{i}}\right)\right\}$ . Assumption 2 and (20) guarantee that $\underline{c}>0$ .

Let $\Phi\left(\tilde{p},r,\eta\right)\coloneqq\frac{\alpha^{2}}{2}\tilde{p}^{T}\tilde{p}+\frac{1}{2}r^{T}r+\frac{1}{2}\eta^{T}\eta$ . The time-derivative of $\Phi$ along the trajectories of (1) and (7)-(10) is given by

[TABLE]

where $\tilde{f}\left(x,\hat{x}\right)\coloneqq f\left(x\right)-f\left(\hat{x}\right)$ and $\tilde{g}\left(x,\hat{x}\right)\coloneqq g\left(x\right)-g\left(\hat{x}\right)$ . The time derivative of $\Phi$ can be bounded above as

[TABLE]

for all $t\geq 0$ and for all $x,\tilde{x}\in\overline{\operatorname{B}}\left(0,\chi\right)$ , where $\varpi_{1},\cdots,\varpi_{3}$ are positive constants that are independent of the learning gains.

The candidate Lyapunov function for the overall system is then defined as $\mathscr{V}\left(Z,t\right)=\mathcal{V}\left(x\right)+\Theta\left(\tilde{W}_{c},\tilde{W}_{a},t\right)+\Phi\left(\tilde{p},r,\eta\right)$ , where $Z\coloneqq\begin{bmatrix}x^{T}&\tilde{p}^{T}&r^{T}&\eta^{T}&\tilde{W}_{c}^{T}&\tilde{W}_{a}^{T}\end{bmatrix}^{T}$ . The time derivative of the candidate Lyapunov function can be bounded as

[TABLE]

where $z\coloneqq\begin{bmatrix}\left\|\tilde{W}_{c}\right\|&\left\|\tilde{W}_{a}\right\|&\left\|\tilde{p}\right\|&\left\|r\right\|&\left\|\eta\right\|\end{bmatrix}^{T}$ , $P=$

[TABLE]

$M=$

[TABLE]

Provided the matrix $M+M^{T}$ is positive definite,

[TABLE]

where $\underline{M}\coloneqq\lambda_{\min}\left\{\frac{M+M^{T}}{2}\right\}$ . Letting $\underline{M}\eqqcolon\underline{M}_{1}+\underline{M}_{2}$ and letting $\mathcal{W}:\mathbb{R}^{5*n+2*L}\to\mathbb{R}$ be defined as $\mathcal{W}\left(Z\right)=-W\left(x\right)-\underline{M}_{1}\left\|z\right\|^{2}$ , the bound

[TABLE]

for all $t\geq 0$ .

Using the bound in (20) and the fact that the converse Lyapunov function is guaranteed to be time-independent, radially unbounded, and positive definite, Lemma 4.3 can be invoked to conclude that

[TABLE]

for all $t\in\mathbb{R}_{\geq 0}$ and for all $Z\in\mathbb{R}^{5n+2L}$ , where $\underline{v},\overline{v}:\mathbb{R}_{\geq 0}\rightarrow\mathbb{R}_{\geq 0}$ are class $\mathcal{K}$ functions.

Provided the learning gains, the domain radius $\chi$ , and the basis functions for function approximation are selected such that $M+M^{T}$ is positive definite and $\mu<\overline{v}^{-1}\left(\underline{v}\left(\frac{\chi}{4\left(1+\alpha\right)}\right)\right)$ , Theorem 4.18 in [51] can be invoked to conclude that Z is uniformly ultimately bounded. Since the estimates $W_{a}$ approximate the ideal weights $W$ , the policy $\hat{u}$ approximates the optimal policy $u^{*}$ .

VII Simulation Results

The performance of the developed controller is demonstrated by simulating a nonlinear, control affine system with a two dimensional state $x=[x_{1},\>x_{2}]^{T}$ . The system dynamics are described by (1) where

[TABLE]

The origin is an unstable equilibrium point of the unforced system $\dot{x}=f\left(x\right)$ . The control objective is to minimize the cost in (2), where $Q\left(x\right)=q^{2}$ and $R=1$ . For comparison purposes, the optimal value function for this problem is computed using the converse method in [57] as $V^{*}\left(x\right)=x_{1}^{2}+x_{2}^{2}$ .

The basis function $\sigma:\mathbb{R}^{2}\to\mathbb{R}^{3}$ for value function approximation is selected as $\sigma=\left[x_{1}^{2},x_{1}x_{2},x_{2}^{2}\right]^{T}$ . Based on the analytical solution, the ideal weights are $W=\left[1,\>0,\>1\right]^{T}$ . The data points for the simulation of experience in the update law (15) are selected to be on a $5\times 5$ grid around the origin. The learning gains are selected as $k_{c}=0.2$ , $k_{a1}=100$ , $k_{a2}=0.1$ , $\beta_{\gamma}=3$ , and $\nu=0.005$ . The gains for the state estimator are selected as $k=5$ , $\alpha=0.2$ , and $\beta=5$ . The initial conditions are selected as $x\left(0\right)=[1,1]^{T}$ , $\hat{x}\left(0\right)=[-1,-1]^{T}$ , $W_{a}\left(0\right)=W_{c}\left(0\right)=[0.5,0.5,0.5]^{T}$ , and $\Gamma\left(0\right)=50\operatorname{I}_{3}$ .

Figs. 1-5 demonstrates that the system state is regulated to the origin, the generalized velocities are identified, and the actor and the critic weights converge to their true values. Furthermore, unlike previous results, a probing signal to ensure persistence of excitation is not required.

VIII Conclusion

An output-feedback MBRL method is developed for a class of second-order nonlinear systems. The control technique uses exact model knowledge and integrates a dynamic state estimator within the model-based reinforcement learning framework to achieve output-feedback MBRL. Simulation results demonstrate the efficacy of the developed method. Integration of simultaneous state and parameter estimation methods such as [49] with the MBRL method to achieve output-feedback MBRL using uncertain models is a topic for future research.

Bibliography57

The reference list from the paper itself. Each links out to its DOI / PubMed record.

1[1] Z. Chen and S. Jagannathan, “Generalized Hamilton-Jacobi-Bellman formulation -based neural network control of affine nonlinear discrete-time systems,” IEEE Trans. Neural Netw. , vol. 19, no. 1, pp. 90–106, Jan. 2008.
2[2] P. Mehta and S. Meyn, “Q-learning and pontryagin’s minimum principle,” in Proc. IEEE Conf. Decis. Control , Dec. 2009, pp. 3598–3605.
3[3] D. Vrabie and F. L. Lewis, “Integral reinforcement learning for online computation of feedback nash strategies of nonzero-sum differential games,” in Proc. IEEE Conf. Decis. Control , 2010, pp. 3066–3071.
4[4] K. G. Vamvoudakis and F. L. Lewis, “Online actor-critic algorithm to solve the continuous-time infinite horizon optimal control problem,” Automatica , vol. 46, no. 5, pp. 878–888, 2010.
5[5] F. L. Lewis, D. Vrabie, and V. L. Syrmos, Optimal control , 3rd ed. Hoboken, NJ: Wiley, 2012.
6[6] J. Y. Lee, J. B. Park, and Y. H. Choi, “Integral Q-learning and explorized policy iteration for adaptive optimal control of continuous-time linear systems,” Automatica , vol. 48, no. 11, pp. 2850–2859, Nov. 2012.
7[7] H. Modares, F. L. Lewis, and M.-B. Naghibi-Sistani, “Adaptive optimal control of unknown constrained-input systems using policy iteration and neural networks,” IEEE Trans. Neural Netw. Learn. Syst. , vol. 24, no. 10, pp. 1513–1525, 2013.
8[8] S. Bhasin, R. Kamalapurkar, M. Johnson, K. G. Vamvoudakis, F. L. Lewis, and W. E. Dixon, “A novel actor-critic-identifier architecture for approximate optimal control of uncertain nonlinear systems,” Automatica , vol. 49, no. 1, pp. 89–92, Jan. 2013. http://www.sciencedirect.com/science/article/pii/S 0005109812004827

TL;DR

Contribution

Findings

Abstract

Peer Reviews

Videos

Output-feedback online optimal control for a class of nonlinear systems

Abstract

I Introduction

II Problem Description

Assumption 1**.**

III Stability Under Optimal state Feedback

Theorem 1**.**

Proof.

IV Velocity Estimator Design

V Model-based Reinforcement Learning

Assumption 2**.**

VI Analysis

VII Simulation Results

VIII Conclusion

Assumption 1.

Theorem 1.

Assumption 2.