Differential Dynamic Programming for time-delayed systems

David D. Fan; Evangelos A. Theodorou

arXiv:1701.01882·cs.SY·January 10, 2017

Differential Dynamic Programming for time-delayed systems

David D. Fan, Evangelos A. Theodorou

PDF

TL;DR

This paper extends Differential Dynamic Programming (DDP) to handle systems with multiple time-delays, enabling optimal control for more complex, delay-influenced dynamical systems, demonstrated on chemical reactors and neural network models.

Contribution

The paper introduces a novel extension of DDP to systems with multiple time-delays, broadening its applicability to richer models including neural networks.

Findings

01

Successfully applied to a two-tank reactor system.

02

Effective control of a recurrent neural network model of an inverted pendulum.

03

Demonstrates real-time feasible trajectory optimization for delayed systems.

Abstract

Trajectory optimization considers the problem of deciding how to control a dynamical system to move along a trajectory which minimizes some cost function. Differential Dynamic Programming (DDP) is an optimal control method which utilizes a second-order approximation of the problem to find the control. It is fast enough to allow real-time control and has been shown to work well for trajectory optimization in robotic systems. Here we extend classic DDP to systems with multiple time-delays in the state. Being able to find optimal trajectories for time-delayed systems with DDP opens up the possibility to use richer models for system identification and control, including recurrent neural networks with multiple timesteps in the state. We demonstrate the algorithm on a two-tank continuous stirred tank reactor. We also demonstrate the algorithm on a recurrent neural network trained to model an…

Equations68

x_{i + 1} = f (x_{i}, x_{i - 1}, \dots, x_{i - k}, u_{i}), i = 0, \dots, N - 1 x_{- j} = x_{- j}^{0}, j = 0, \dots, k, 0 < k ≪ N

x_{i + 1} = f (x_{i}, x_{i - 1}, \dots, x_{i - k}, u_{i}), i = 0, \dots, N - 1 x_{- j} = x_{- j}^{0}, j = 0, \dots, k, 0 < k ≪ N

x_{i + 1}

x_{i + 1}

\overset{ˉ}{x}_{0}

J^{0} (\overset{ˉ}{x}_{0}, U) = j = 0 \sum N - 1 L^{j} (\overset{ˉ}{x}_{j}, u_{j}) + L^{N} (\overset{ˉ}{x}_{N})

J^{0} (\overset{ˉ}{x}_{0}, U) = j = 0 \sum N - 1 L^{j} (\overset{ˉ}{x}_{j}, u_{j}) + L^{N} (\overset{ˉ}{x}_{N})

U^{*} = U arg min J^{0} (\overset{ˉ}{x}_{0}, U)

U^{*} = U arg min J^{0} (\overset{ˉ}{x}_{0}, U)

J^{i} (\overset{ˉ}{x}_{i}, U_{i}) = j = i \sum N - 1 L^{j} (\overset{ˉ}{x}_{j}, u_{j}) + L^{N} (\overset{ˉ}{x}_{N})

J^{i} (\overset{ˉ}{x}_{i}, U_{i}) = j = i \sum N - 1 L^{j} (\overset{ˉ}{x}_{j}, u_{j}) + L^{N} (\overset{ˉ}{x}_{N})

V^{i} (\overset{ˉ}{x}_{i}) = U_{i} min j = i \sum N - 1 L^{j} (\overset{ˉ}{x}_{j}, u_{j}) + L^{N} (\overset{ˉ}{x}_{N})

V^{i} (\overset{ˉ}{x}_{i}) = U_{i} min j = i \sum N - 1 L^{j} (\overset{ˉ}{x}_{j}, u_{j}) + L^{N} (\overset{ˉ}{x}_{N})

V^{i} (\overset{ˉ}{x}_{i}) = u_{i} min [L^{i} (\overset{ˉ}{x}_{i}, u_{i}) + V^{i + 1} (\overset{ˉ}{x}_{i + 1})]

V^{i} (\overset{ˉ}{x}_{i}) = u_{i} min [L^{i} (\overset{ˉ}{x}_{i}, u_{i}) + V^{i + 1} (\overset{ˉ}{x}_{i + 1})]

Q (\overset{ˉ}{x}_{i} + δ \overset{ˉ}{x}_{i}, u_{i} + δ u_{i}) - Q (\overset{ˉ}{x}_{i}, u_{i}) \approx j = 0 \sum k Q_{x_{i - j}} δ x_{i - j} + Q_{u_{i}} δ u_{i} + \frac{1}{2} δ x_{i} ⋮ δ x_{i - k} δ u_{i}^{⊺} Q_{x_{i} x_{i}} ⋮ Q_{x_{i - k} x_{i}} Q_{u_{i} x_{i}} \dots ⋱ \dots \dots Q_{x_{i} x_{i - k}} ⋮ Q_{x_{i - k} x_{i - k}} Q_{u_{i} x_{i - k}} Q_{x_{i} u_{i}} ⋮ Q_{x_{i - k} u_{i}} Q_{u_{i} u_{i}} δ x_{i} ⋮ δ x_{i - k} δ u_{i}

Q (\overset{ˉ}{x}_{i} + δ \overset{ˉ}{x}_{i}, u_{i} + δ u_{i}) - Q (\overset{ˉ}{x}_{i}, u_{i}) \approx j = 0 \sum k Q_{x_{i - j}} δ x_{i - j} + Q_{u_{i}} δ u_{i} + \frac{1}{2} δ x_{i} ⋮ δ x_{i - k} δ u_{i}^{⊺} Q_{x_{i} x_{i}} ⋮ Q_{x_{i - k} x_{i}} Q_{u_{i} x_{i}} \dots ⋱ \dots \dots Q_{x_{i} x_{i - k}} ⋮ Q_{x_{i - k} x_{i - k}} Q_{u_{i} x_{i - k}} Q_{x_{i} u_{i}} ⋮ Q_{x_{i - k} u_{i}} Q_{u_{i} u_{i}} δ x_{i} ⋮ δ x_{i - k} δ u_{i}

L^{i} (\overset{ˉ}{x}_{i} + δ \overset{ˉ}{x}_{i}, u_{i} + δ u_{i}) - L^{i} (\overset{ˉ}{x}_{i}, u_{i}) \approx L_{x_{i}}^{i} δ x_{i} + L_{x_{i - 1}}^{i} δ x_{i - 1} + \dots + L_{x_{i - k}}^{i} δ x_{i - k} + L_{u_{i}}^{i} δ u_{i} + \frac{1}{2} δ x_{i} ⋮ δ x_{i - k} δ u_{i}^{⊺} L_{x_{i} x_{i}}^{i} ⋮ L_{x_{i - k} x_{i}}^{i} L_{u_{i} x_{i}}^{i} \dots ⋱ \dots \dots L_{x_{i} x_{i - k}}^{i} ⋮ L_{x_{i - k} x_{i - k}}^{i} L_{u_{i} x_{i - k}}^{i} L_{x_{i} u_{i}}^{i} ⋮ L_{x_{i - k} u_{i}}^{i} L_{u_{i} u_{i}}^{i} δ x_{i} ⋮ δ x_{i - k} δ u_{i}

L^{i} (\overset{ˉ}{x}_{i} + δ \overset{ˉ}{x}_{i}, u_{i} + δ u_{i}) - L^{i} (\overset{ˉ}{x}_{i}, u_{i}) \approx L_{x_{i}}^{i} δ x_{i} + L_{x_{i - 1}}^{i} δ x_{i - 1} + \dots + L_{x_{i - k}}^{i} δ x_{i - k} + L_{u_{i}}^{i} δ u_{i} + \frac{1}{2} δ x_{i} ⋮ δ x_{i - k} δ u_{i}^{⊺} L_{x_{i} x_{i}}^{i} ⋮ L_{x_{i - k} x_{i}}^{i} L_{u_{i} x_{i}}^{i} \dots ⋱ \dots \dots L_{x_{i} x_{i - k}}^{i} ⋮ L_{x_{i - k} x_{i - k}}^{i} L_{u_{i} x_{i - k}}^{i} L_{x_{i} u_{i}}^{i} ⋮ L_{x_{i - k} u_{i}}^{i} L_{u_{i} u_{i}}^{i} δ x_{i} ⋮ δ x_{i - k} δ u_{i}

V^{i + 1} (\overset{ˉ}{x}_{i + 1} + δ \overset{ˉ}{x}_{i + 1}) - V^{i + 1} (\overset{ˉ}{x}_{i + 1}) \approx V_{0}^{'} δ x_{i + 1} + V_{1}^{'} δ x_{i} + \dots + V_{k}^{'} δ x_{i - k + 1} + \frac{1}{2} δ x_{i + 1} ⋮ δ x_{i - k + 1}^{⊺} V_{00}^{'} ⋮ V_{k 0}^{'} \dots ⋱ \dots V_{0 k}^{'} ⋮ V_{k k}^{'} δ x_{i + 1} ⋮ δ x_{i - k + 1}

V^{i + 1} (\overset{ˉ}{x}_{i + 1} + δ \overset{ˉ}{x}_{i + 1}) - V^{i + 1} (\overset{ˉ}{x}_{i + 1}) \approx V_{0}^{'} δ x_{i + 1} + V_{1}^{'} δ x_{i} + \dots + V_{k}^{'} δ x_{i - k + 1} + \frac{1}{2} δ x_{i + 1} ⋮ δ x_{i - k + 1}^{⊺} V_{00}^{'} ⋮ V_{k 0}^{'} \dots ⋱ \dots V_{0 k}^{'} ⋮ V_{k k}^{'} δ x_{i + 1} ⋮ δ x_{i - k + 1}

δ x_{i + 1} = f (\overset{ˉ}{x}_{i} + δ \overset{ˉ}{x}_{i}, u_{i} + δ u_{i}) - f (\overset{ˉ}{x}_{i}, u_{i}) \approx f_{x_{i}} δ x_{i} + f_{x_{i - 1}} δ x_{i - 1} + \dots + f_{x_{i - k}} δ x_{i - k} + f_{u_{i}} δ u_{i} + \frac{1}{2} δ x_{i} ⋮ δ x_{i - k} δ u_{i}^{⊺} f_{x_{i} x_{i}} ⋮ f_{x_{i - k} x_{i}} f_{u_{i} x_{i}} \dots ⋱ \dots \dots f_{x_{i} x_{i - k}} ⋮ f_{x_{i - k} x_{i - k}} f_{u_{i} x_{i - k}} f_{x_{i} u_{i}} ⋮ f_{x_{i - k} u_{i}} f_{u_{i} u_{i}} δ x_{i} ⋮ δ x_{i - k} δ u_{i}

δ x_{i + 1} = f (\overset{ˉ}{x}_{i} + δ \overset{ˉ}{x}_{i}, u_{i} + δ u_{i}) - f (\overset{ˉ}{x}_{i}, u_{i}) \approx f_{x_{i}} δ x_{i} + f_{x_{i - 1}} δ x_{i - 1} + \dots + f_{x_{i - k}} δ x_{i - k} + f_{u_{i}} δ u_{i} + \frac{1}{2} δ x_{i} ⋮ δ x_{i - k} δ u_{i}^{⊺} f_{x_{i} x_{i}} ⋮ f_{x_{i - k} x_{i}} f_{u_{i} x_{i}} \dots ⋱ \dots \dots f_{x_{i} x_{i - k}} ⋮ f_{x_{i - k} x_{i - k}} f_{u_{i} x_{i - k}} f_{x_{i} u_{i}} ⋮ f_{x_{i - k} u_{i}} f_{u_{i} u_{i}} δ x_{i} ⋮ δ x_{i - k} δ u_{i}

Q_{x_{i - j}} = L_{x_{i - j}}^{i} + f_{x_{i - j}}^{⊺} V_{0}^{'} + V_{j + 1}^{'}

Q_{x_{i - j}} = L_{x_{i - j}}^{i} + f_{x_{i - j}}^{⊺} V_{0}^{'} + V_{j + 1}^{'}

Q_{x_{i - k}} = L_{x_{i - k}}^{i} + f_{x_{i - k}}^{⊺} V_{0}^{'}

Q_{x_{i - k}} = L_{x_{i - k}}^{i} + f_{x_{i - k}}^{⊺} V_{0}^{'}

Q_{u_{i}} = L_{u_{i}}^{i} + f_{u_{i}}^{⊺} V_{0}^{'}

Q_{u_{i}} = L_{u_{i}}^{i} + f_{u_{i}}^{⊺} V_{0}^{'}

Q_{x^{j}}

Q_{x^{j}}

Q_{u}

Q_{x^{j} u}

Q_{uu}

Q_{x^{j} x^{l}}

+ 1_{j \neq = k} V_{j + 1, 0}^{'} f_{x^{l}} + 1_{l \neq = k} f_{x^{j}}^{⊺} V_{0, l + 1}^{'} + 1_{j, l \neq = k} V_{j + 1, l + 1}^{'}

V_{0}^{'} \cdot f_{x^{j} x^{l}} = p = 1 \sum n V_{0}^{'}^{(p)} f_{x^{j} x^{l}}^{(p)}

V_{0}^{'} \cdot f_{x^{j} x^{l}} = p = 1 \sum n V_{0}^{'}^{(p)} f_{x^{j} x^{l}}^{(p)}

\delta u^{*}=-Q_{uu}^{-1}\Big{(}Q_{u}+\sum_{j=0}^{k}Q_{x^{j}u}^{\intercal}\delta x^{j}\Big{)}

\delta u^{*}=-Q_{uu}^{-1}\Big{(}Q_{u}+\sum_{j=0}^{k}Q_{x^{j}u}^{\intercal}\delta x^{j}\Big{)}

\tilde{Q}_{uu} = Q_{uu} + μ I

\tilde{Q}_{uu} = Q_{uu} + μ I

Δ V

Δ V

V_{j}

V_{j, l}

V_{j}^{N}

V_{j}^{N}

V_{j, l}^{N}

\hat{\overset{ˉ}{x}}_{0}

\hat{\overset{ˉ}{x}}_{0}

\overset{u}{^}_{i}

\overset{x}{^}_{i + 1}

\frac{d x _{1} ( t )}{d t}

\frac{d x _{1} ( t )}{d t}

\frac{d x _{2} ( t )}{d t}

\frac{d x _{3} ( t )}{d t}

\frac{d x _{4} ( t )}{d t} = x_{2} (t - τ) - 2 x_{4} (t) - u_{2} (t) (x_{4} (t) + 0.25) + R_{2} - 0.25

\frac{d x _{4} ( t )}{d t} = x_{2} (t - τ) - 2 x_{4} (t) - u_{2} (t) (x_{4} (t) + 0.25) + R_{2} - 0.25

R_{1}

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Full text

Differential Dynamic Programming for Time-Delayed Systems

David D. Fan1 and Evangelos A. Theodorou2 1Institute for Robotics and Intelligent Machines, Georgia Institute of Technology, Atlanta, GA 30332-0150, USA. [email protected] Guggenheim School of Aerospace Engineering, Georgia Institute of Technology, Atlanta, GA 30332-0150, USA. [email protected]

Abstract

Trajectory optimization considers the problem of deciding how to control a dynamical system to move along a trajectory which minimizes some cost function. Differential Dynamic Programming (DDP) is an optimal control method which utilizes a second-order approximation of the problem to find the control. It is fast enough to allow real-time control and has been shown to work well for trajectory optimization in robotic systems. Here we extend classic DDP to systems with multiple time-delays in the state. Being able to find optimal trajectories for time-delayed systems with DDP opens up the possibility to use richer models for system identification and control, including recurrent neural networks with multiple timesteps in the state. We demonstrate the algorithm on a two-tank continuous stirred tank reactor. We also demonstrate the algorithm on a recurrent neural network trained to model an inverted pendulum with position information only.

I INTRODUCTION

Trajectory optimization, or more broadly known as optimal control, deals with the problem of finding a control input to a system which is optimal in some sense, i.e. with respect to a cost function. Optimal control algorithms are often derived for systems with dynamics described by a first-order recurrence equation, where the next state is a function of the current state and control. However, many real-world systems cannot be easily described this way, and may contain delays in controls or states. For example, delays can be caused by communication delays in a distributed system, measurement delays from instrumentation, or from time-varying dynamics. Some real-world examples include chemical processes, pneumatic systems with long transmission lines, hydraulic systems, and soft robotics with elastic members. Furthermore, it may be easier to construct accurate dynamics models of real systems if some past history can be incorporated into the state, since the Markov assumption may often be too restrictive. Finding algorithms for trajectory optimization on time-delay systems is an important problem, particularly with respect to robotics - where distributed, swarm, and non-rigid-body robotics, as well as modeling robots with recurrent neural networks are all quickly growing fields.

Trajectory optimization is often performed on systems with initially unknown dynamics, i.e. the dynamics must be learned. Creating models to approximate a real system’s dynamics is a nontrivial problem. Often in optimal control, the dynamics are assumed to be known, and fully described by a first-order differential equation. In practice, however, it may be difficult to create an reliable model of unknown dynamics using only a first-order differential model. Parametric models which predict the next state given a short history of states may be more flexible and powerful. Examples include NARX models, time-delay neural networks, or other customized neural network architectures. In Lenz et al.’s DeepMPC work, a deep neural network was used to model the dynamics of a robot cutting various foods. Their network used a time-block design, in which time-series data was partitioned into blocks and fed into the neural network[1]. In work by Whalström et al., a dynamical model was learned to map a sequence of images to control torques, with the neural network taking several past images as input [2]. Finally, a time-delay neural network was used to model helicopter blade torsions and flapping in [3].

Broadly speaking, there are two well-studied approaches to trajectory optimization - one is that of dynamic programming, and the other relies on Pontryagin’s maximum principle. For systems with time delay, much work has been done using Pontryagin’s maximum principle, beginning with Kharatishvili in 1961 [4]. A maximum principle approach by Guinn first reduces the delayed problem to a non-delayed one, then proceeds normally [5]. More recently, Göllmann et al. provided necessary optimality conditions for delayed problems with control and state constraints, as well as giving a review of maximum principles for time-delay systems [6].

As for utilizing dynamic programming for time-delayed systems, some work has been done which focused on Iterative Dynamic Programming (IDP) [7][8]. IDP utilizes a grid-search in the controls and iteratively shrinks the grid size and space in an effort to find the globally optimal control. This approach is severely affected by the curse of dimensionality, and will not work for problems with a larger number of states. A dynamic programming method which avoids the need to discretize the state and control space is Differential Dynamic Programming (DDP), which iteratively solves a second-order local approximation of the problem [9]. It has the advantage of providing both feed-forward and feedback control gains, and is fast enough to support real-time control of a humanoid robot [10]. A simpler variant of DDP, called the iterative Linear Quadratic Gaussian (iLQG) method, considers only first-order dynamics, and is well studied [11]. Several extensions to DDP have been recently made, including control-limited DDP [12], stochastic DDP [13], combining optimal control and optimal estimation [14], and probabilistic DDP, which controls the belief space for systems with unknown dynamics [15]. DDP has also been demonstrated to work well for receding horizon control in robotic systems [16].

In this work we extend Differential Dynamic Programming to systems with multiple delays in the state. This opens up a range of possibilities in terms of performing trajectory optimization on time-delayed dynamical systems. Of special interest is the case where a system’s dynamics are unknown but can be approximated with a parametric model containing state delays, e.g. a neural network.

An overview of the paper is as follows: Section II presents the derivation of Delayed DDP. Section III describes the algorithm based on the derivation. Section IV discusses implementation of the algorithm on a two-stage continuously stirred tank reactor system. Section V discusses modeling dynamical systems with delayed recurrent neural networks, and demonstrates Delayed DDP on a learned neural network model for an inverted pendulum model using position information.

II DELAYED DIFFERENTIAL DYNAMIC PROGRAMMING

II-A Problem Formulation

Let the sequence $\{x_{i}\}$ be a state trajectory comprised of states $x_{i}\in\mathbb{R}^{n}$ for times $i=0,\dotsc,N$ . The trajectory is determined by the k-th order difference equation:

[TABLE]

where $\{u_{i}\}$ is a control sequence with $u_{i}\in\mathbb{R}^{m}$ for times $i=0,\dotsc,N-1$ , $f$ maps $\mathbb{R}^{nk}\times\mathbb{R}^{m}\mapsto\mathbb{R}^{n}$ and is twice differentiable, and $x^{0}_{-j}$ are the initial delayed states. The dynamics depend on the past $k+1$ states as well as the controls at the current time. Let $\mathbf{\bar{x}}_{i}$ denote the sequence of states $\{x_{i},x_{i-1},\dotsc,x_{i-k}\}$ . The initial condition is given by $\mathbf{\bar{x}}_{0}^{0}$ . Then we can write (1) as:

[TABLE]

Define a cost function as

[TABLE]

where $\mathcal{U}=\{u_{i}\},i=0,\dotsc,N-1$ , and $L^{j}$ are twice-differentiable nonnegative scalar functions. The problem of optimal control is to find a $\mathcal{U}$ that minimizes this cost function:

[TABLE]

II-B Bellman Equation with Delays

The Bellman equation is a necessary condition of optimality for the dynamic programming problem. In classic DDP without delays, the Taylor expansion of the Bellman equation about the point $(x_{i},u_{i})$ is taken at each timestep. For the case with delays, the Taylor expansion must be taken around the segment of past history within the delay, i.e., about $\mathbf{\bar{x}}_{i}=(x_{i},x_{i-1},\dotsc,x_{i-k},u_{i})$ . This follows the idea of Guinn whereby a delayed system is reduced to one without delays [5].

Define the cost-to-go function as

[TABLE]

Minimizing (5) with respect to the current and all future controls $\mathcal{U}_{i}$ gives an expression for the value function:

[TABLE]

This expression can be written iteratively, yielding a Bellman equation for delayed systems:

[TABLE]

Note that $\mathbf{\bar{x}}_{i+1}$ is simply $\{f(\mathbf{\bar{x}}_{i},u_{i}),x_{i},x_{i-1},\dotsc,x_{i-k+1}\}$ . So the rightmost term of (7) is a function of $\mathbf{\bar{x}}_{i}$ still. Now, as in the case of classic DDP without delays, we can approximate the argument of the minimum in (7) via a second-order Taylor expansion to find $u$ .

II-C Quadratic Approximation

Define $Q$ as the argument of the minimum of (7) (where dependence on time $i$ is implicit), and write it as a function of perturbations around $(x_{i},x_{i-1},\dotsc,x_{i-k},u_{i})$ . Expanding via the second order Taylor expansion gives

[TABLE]

where the subscripts on $Q$ indicate partial derivatives. To find these coefficients of the Taylor expansion we must expand both $L^{i}(\mathbf{\bar{x}}_{i},u_{i})$ and $V^{i+1}(\mathbf{\bar{x}}_{i+1})$ in (7). Expanding $L^{i}(\mathbf{\bar{x}}_{i},u_{i})$ , we have:

[TABLE]

The expansion of $V^{i+1}(\mathbf{\bar{x}}_{i+1})$ requires more careful consideration. We will use the following notation to allow us to drop the indices $i$ : The partial derivative $V^{i+1}_{x_{i+1}}$ is the derivative of the value function at time $i+1$ with respect to the first argument, $x_{i+1}$ . We can write this in short hand as $V^{\prime}_{0}$ , where the ′ denotes the value function at time $i+1$ . Similarly, write $V^{i+1}_{x_{i}}$ as $V^{\prime}_{1}$ , up to $V^{i+1}_{x_{i-k+1}}$ as $V^{\prime}_{k}$ . For second derivatives, write $V^{i+1}_{x_{i+1},x_{i+1}}$ as $V^{\prime}_{00}$ , etc. The expansion yields:

[TABLE]

We can find the expression for $\delta x_{i+1}$ by expanding the function $f$ to the second order as well:

[TABLE]

Plugging (11) into (10) gives a summation of terms which are from first to fourth order with respect to $\delta\mathbf{\bar{x}}_{i}=(\delta x_{i},\delta x_{i-1},\dotsc,\delta x_{i-k},\delta u_{i})$ . Since we are only interested in a second order approximation, we can drop the third and forth order terms, leaving a summation of terms which are either first or second order. The expressions for the coefficients $Q$ in (8) are found after collecting these terms multiplied by the same $\delta x_{i-j}$ , along with the terms in (9). Considering the first order coefficients, for $j=0,\dotsc,k-1$ , we have:

[TABLE]

When $j=k$ , since (10) has no $\delta x_{i-k}$ term, the $V^{\prime}_{j+1}$ disappears:

[TABLE]

Also, gathering up the terms corresponding to $\delta u_{i}$ gives us

[TABLE]

To simplify notation further, write $f_{x_{i-j}}$ as $f_{x^{j}}$ , and follow the same pattern for $L^{i}_{x^{j}}$ and $Q_{x^{j}}$ . The full expressions for the Taylor coefficients $Q$ for both first and second order are, for $j,l=0,\dotsc,k$ :

[TABLE]

where the dot $\cdot$ denotes contraction with a tensor and $\mathbf{1}_{j\neq k}$ is the indicator function, taking a value of $1$ when $j\neq k$ and [math] otherwise. The tensor contractions arise since $f$ is a vector-valued function, so its first derivative is a matrix, and its second derivative is a tensor of rank 3. We can write the tensor contraction explicitly as the sum of each element of the vector $V^{\prime}_{0}$ times the Hessian of the corresponding element of $f$ :

[TABLE]

II-D Backward and Forward Pass

Now we can find an expression for the locally optimal control. Minimizing (8) with respect to $\delta u$ gives

[TABLE]

This is true as long as $Q_{uu}$ is positive-definite. If $Q_{uu}$ is not positive definite, the standard regularization proposed in [9] is to add a diagonal term to the Hessian:

[TABLE]

So we have a linear control law with feedback gains which depends on the past $k$ timesteps.

Define the open-loop gain $\mathbf{k}=-\tilde{Q}_{uu}^{-1}Q_{u}$ and the feedback gains $\mathbf{K}_{j}=-\tilde{Q}_{uu}^{-1}Q_{x^{j}u}^{\intercal}$ for $j=0,\dotsc,k$ . Plugging the control policy (17) into (8) and (7) and gathering terms which are zeroth, first, and second order in $\delta x$ gives recursive expressions for the quadratic approximation of the value at each timestep. Doing so yields, for $j,l=0,\dotsc,k$ :

[TABLE]

Now in the same manner as classic DDP, one may first compute a forward pass with an nominal control trajectory, then compute a backwards pass of the values of $V$ , $V_{j}$ , and $V_{j,l}$ along with the control policy $\{\mathbf{k},\mathbf{K}_{j}\}$ at each timestep, starting at the last timestep and working backwards. When starting with the last timestep, the boundary conditions for V are given by the partial derivatives of $L^{N}(\mathbf{\bar{x}}_{N})$ :

[TABLE]

The forward pass is then calculated as:

[TABLE]

where $0<\alpha\leq 1$ is used to keep the new trajectory close to the old one, since the quadratic approximation is only valid near the nominal trajectory.

III ALGORITHM

Implementation follows the classic DDP case which is covered by Tassa et al. [10] in detail. The line search parameter $0<\alpha\leq 1$ is used to scale the open-loop gain $\mathbf{k}$ . A quadratic schedule is used to change the regularization parameter $\mu$ to find a good value if needed. A single iteration consists of three steps:

Derivatives: Compute derivatives of $L$ and $f$ in (15). 2. 2.

Backward Pass: For $i=N-1,\dotsc,0$ , iteratively calculate the new control policy from (15,19). If for any $i$ a non-positive-definite $\tilde{Q}_{uu}$ is found, increase $\mu$ and restart the backwards pass. Otherwise, decrease $\mu$ . 3. 3.

Forward Pass: Set $\alpha=1$ , then iterate (21) forward for $i=0,\dotsc,N-1$ . If the integration diverges or if the actual cost reduction is less than expected (see [10]), reduce $\alpha$ and restart the forward pass.

The computational cost increase of Delayed DDP compared to classic DDP is proportional to the size of the delay. Assuming $n=m$ , the computational cost of one iteration of classic DDP is $\mathcal{O}(Nn^{3})$ . The cubic dependence on the dimension of the states and control come from multiplying matrices when back-propagating the value function, as well as the matrix inversion of $Q_{uu}$ . With Delayed DDP, the computational cost is increased to $\mathcal{O}(k^{2}Nn^{3})$ . Therefore, a short delay is preferable to a long one.

Classic DDP converges quadratically to a local minimum of the cost function [9]. Delayed DDP also converges quadratically; this can be seen because the delayed problem is collapsed to a non-delayed problem in (7), and the convergence analysis that follows is the same as in that of classic DDP. If fast computation speed is desired over quadratic convergence, the second-order dynamics terms containing tensor contractions can be dropped, resulting in an iLQG algorithm for time-delay systems. This may be advantageous for the receding horizon case, as in Model-Predictive Control (MPC), where less-than-optimal solutions are more acceptable and having low computation time takes a higher priority [16].

IV EXPERIMENTS

IV-A Delayed DDP Applied to Two-Stage Continuously Stirred Tank Reactor System

We simulate the Delayed DPP algorithm on a nonlinear two-stage continuously stirred tank reactor system (CSTR) [7]. The system has four states and is under-actuated with two control inputs. The 1st and 3rd states correspond to normalized concentrations, and the 2nd and 4th states correspond to normalized temperatures. The system is described by:

[TABLE]

with

[TABLE]

and an initial state

[TABLE]

where $\tau$ is the time delay and was set to $0.5$ seconds. Euler’s discretization was used with a step size of $0.05$ seconds to bring the system into discrete equations, and a horizon of $100$ timesteps, giving a $5$ second horizon, was used. We minimized a quadratic cost function in both states and controls:

[TABLE]

where P and R are diagonal and positive semidefinite matrices. We found that using a cost function which depends on the entire delay history rather than the current state alone improved the ease of finding a good solution. The control cost was $R=0.1*I_{2x2}$ and state cost was $P=I_{4x4}$ . For simplicity, we used a fixed learning rate $\alpha=0.4$ with no regularization, and omitted the second-order dynamics terms. The algorithm converged to an acceptable solution after about $5$ iterations; we show the results after $20$ iterations (Figures 1 and 2). Including the second-order dynamics terms results in the same solution if some regularization is used and more iterations are taken.

To test the difference between Delayed DDP and classic DDP, we used classic DDP to find an optimal control for the two-stage CSTR system when $\tau=0$ , i.e. for the system without delays. We then applied this optimal control an identical two-stage CSTR system except for having a delay of $\tau=0.5$ . The classic DDP optimal control sequence was unable to adequately control the system with delays (Figure 3). Therefore, using classic DDP on systems where delays play a role may be an inadequate approach.

A distinct advantage of DDP over other trajectory optimization techniques such as those relying on the maximum principle is that the DDP algorithm gives feedback gains $\mathbf{K}_{j}$ , which arise naturally from the back-propagation of the value function. These feedback gains can be used to steer the system in the presence of noise or disturbances. To demonstrate the value of having these feedback gains, we added some Wiener process noise to the two-stage CSTR system and ran it with and without feedback gains. Independent noise was added to each state at each timestep drawn from the normal distribution $\mathcal{N}(0,\sigma\sqrt{dt})$ . Without feedback gains, the noisy control quickly steers the state trajectories off course, and the unstable dynamical system causes the states to explode (Figure 4). However, using the feedback gains results in reliable control, keeping the states close to their intended trajectories.

V MODELING WITH DELAYED RECURRENT NEURAL NETWORKS

Recurrent neural networks have been shown to be useful for approximating dynamical systems [17][18][19]. Various architectures have been considered, including Temporal-Kernel Recurrent Neural Networks, Echo-State Neural Networks, Long-Short Term Neural Networks, and more (for a review, see [20]). More recently, some work has been done in building deep recurrent neural networks models to approximate the dynamics of various tasks such as cutting fruit with a robot arm [1], controlling a pendulum from images [2], or learning inverse dynamics of a musculoskeletal robot [21]. For each of these works, the authors found that in order to achieve good performance, it was necessary to train the neural networks on segments of delayed data, creating recurrent neural networks which are deep in time [22]. The necessity of making use of delays is especially evident for tasks such as controlling a system from images alone, since a single frame is insufficient for providing information about state velocities. However, control and trajectory optimization of such delayed recurrent neural networks has not been previously addressed. In the aforementioned works, the authors used Model Predictive Control along with policy gradient methods to find a control to accomplish some task. This approach is computationally expensive and demands constantly re-querying the system and updating the trajectory. Instead, our approach here is to use the Delayed DDP framework to efficiently plan an entire trajectory at once. Feedback gains are naturally obtained from the back-propagation of the value function, which can be used to compensate for both modeling errors and control noise. This approach should be both more computationally efficient and more powerful.

We use the following neural network architecture to approximate a dynamical system:

[TABLE]

where $\sigma$ is the activation function of choice (we use the hyperbolic tangent), and $\{W_{u},W_{j}\}$ , $\{b_{u},b_{j}\}$ are weight matrices and bias vectors, respectively (Figure 5). The network is trained to produce the state vector at the next timestep given the past $k+1$ states and the current control input. However, directly training this feed-forward architecture to do one-step prediction is unlikely to create a system which closely approximates the real system’s dynamics. This is because we are interested not only in one-step prediction but in multiple-step sequence prediction. This problem of training a recurrent neural network to do multiple-step prediction has been previously addressed in various ways. One approach, called Scheduled Sampling, is to gradually ease the training of the network on data alone to training the network on its own outputs [23]. Another more data-driven approach known as Data as Demonstrator augments the training data with the model’s own errors [24]. Here, we take a simpler approach. The step forward function in (24) is a feed-forward neural network, but since it takes inputs of past states, it can be considered a recurrent neural network. We therefore train the network with an entire sequence of data, back-propagating the error function backwards through time, just as one would train any normal recurrent neural network. We can also augment the state vector with hidden states to increase the expressiveness of the network. Let $\hat{x}_{i}$ be the augmented state, with $\hat{x}_{i}=[x_{i},h_{i}]$ , $x_{i}\in\mathcal{R}^{n}$ , $h_{i}\in\mathcal{R}^{r}$ . The augmented state obeys the same dynamics given by 24, the only difference being that when we train the network on data from the real system, we do not include the hidden states $h_{i}$ in the error function. The hidden states are therefore free to change however they wish, as long as they result in the visible states $x_{i}$ approximating the real dynamics. After the model has been trained, one can perform Delayed DDP on this model. One interesting point which arises is that since Delayed DDP is being applied to the augmented states, feedback gains will be obtained which correspond to both the hidden and visible states. When applying these feedback gains to the real system, we can simply throw out the feedback gains which correspond to the hidden states.

We trained a neural network with delays to model the dynamics of a pendulum from position information only, then use Delayed DDP to find a control which swings the pendulum into an inverted position. The training dataset consisted of 12,500 1 second trajectories, simulated with a 20ms resolution, with each trajectory starting from the stable hanging position and perturbed by a random torque input. This data consisted of the sequences of x and y coordinates of the pendulum bob scaled to the interval $[-1,1]$ . We did not include velocity information in the data, necessitating the use of a delayed system to infer velocity information. The input was chosen to be a set of sinusoids with random frequency, phase, and amplitude, along with a set of uniformly distributed random control inputs. A delay of $k=3$ timesteps was used, so the input state consisted of 4 timesteps of data. We used a neural network size of 32 units, 2 of which were the visible states. Training was performed with the Adam method using a batch size of 128 and a total of 1000 epochs [25].

Once the model was trained, Delayed DDP was applied to the neural network model. The loss function used was the error in the angle of the pendulum, found by taking the inverse tangent of the x and y coordinates given by the visible states. A small control cost was used to keep the control within the range of control inputs found in the training dataset. Convergence occurred after about 15 iterations. After finding the optimal control for the neural network model, this control was then applied to the real system, along with feedback gains. Since the neural network model contains both delays and hidden states, we obtain extra feedback gains which are not usable on the real system. We only use the feedback gains which correspond to the position at the current time. The feedback gains scale the error between the states of the real system and the neural network model’s states. Figure 6 shows the results comparing the control applied to the model, the real system, and the real system utilizing the feedback gains obtained from Delayed DDP. Delayed DDP gives a control which successfully controls the model. Comparing this control to the control obtained by classic DDP on the actual system shows that a similar solution has been found. Applying the Delayed DDP control to the real system gives a less optimal result, due to slight modeling error in the neural network. However, applying the feedback gains allows the successful control of the real system to the desired optimal trajectory, thereby compensating for the modeling error. It is important to note here that the feedback gains applied to the real system are for position alone, since the neural network encodes position information only and does not encode the velocity of the pendulum.

VI CONCLUSION

We derived a differential dynamic programming algorithm for systems with delays in their state. This allows us to leverage the power of DDP on a broad class of time-delayed systems, including neural network models which incorporate some past segment of history into their states. We demonstrated the algorithm on a two-tank CSTR system, as well as a neural network trained to model an inverted pendulum from position information only, and a neural network trained to model an inverted cart-pole system. We showed that leveraging the feedback gains that DDP gives allows us to create a control which is more tolerant to noise and model error.

Future research includes extension of differential dynamic programming for the case of nonlinear stochastic delayed systems. In addition, min-max and risk sensitive control formulation will also be under consideration. Finally, applications to real systems is ongoing work.

Bibliography25

The reference list from the paper itself. Each links out to its DOI / PubMed record.

1[1] I. Lenz, R. Knepper, and A. Saxena, “Deep MPC : Learning Deep Latent Features for Model Predictive Control,” Robotics: Science and Systems , 2015.
2[2] N. Wahlström, T. B. Schön, and M. P. Deisenroth, “From Pixels to Torques: Policy Learning with Deep Dynamical Models,” ar Xiv preprint ar Xiv:1502.02251 , Feb 2015.
3[3] F. D. Marques, L. D. F. R. D. Souza, D. C. Rebolho, a. S. Caporali, E. M. Belo, and R. L. Ortolan, “Application of time-delay neural and recurrent neural networks for the identification of a hingeless helicopter blade flapping and torsion motions,” Journal of the Brazilian Society of Mechanical Sciences and Engineering , vol. 27, no. 2, pp. 97–103, 2005.
4[4] V. G. Boltyanskiy, “The Maximum Principle in the Theory of Optimal Processes,” Doklady Akademii Nauk , no. 136, pp. 1–5, 1961.
5[5] T. Guinn, “Reduction of delayed optimal control problems to nondelayed problems,” Journal of Optimization Theory and Applications , vol. 18, no. 3, pp. 371–377, Mar 1976.
6[6] B. Houska, H. J. Ferreau, and M. Diehl, “ACADO toolkit-An open-source framework for automatic control and dynamic optimization,” Optimal Control Applications and Methods , vol. 32, no. 3, pp. 298–312, Jul 2011.
7[7] S. Dadebo and R. Luus, “Optimal control of time-delay systems by dynamic programming,” Optimal Control Applications and Methods , vol. 13, no. 1, pp. 29–41, Jan 1992.
8[8] C. Hwang and J. Lin, “An improved computational scheme for solving dynamic optimization problems with iterative dynamic programming,” Journal of the Chinese Institute of Engineers , vol. 22, no. 4, pp. 409–421, Jun 1999.