Chance-Constrained Trajectory Optimization for Non-linear Systems with   Unknown Stochastic Dynamics

Onur Celik; Hany Abdulsamad; Jan Peters

arXiv:1906.11003·eess.SY·August 1, 2019

Chance-Constrained Trajectory Optimization for Non-linear Systems with Unknown Stochastic Dynamics

Onur Celik, Hany Abdulsamad, Jan Peters

PDF

TL;DR

This paper introduces a chance-constrained trajectory optimization method for non-linear systems with unknown stochastic dynamics, improving robustness and avoiding premature convergence in model-based reinforcement learning.

Contribution

It proposes a novel approach that incorporates probabilistic chance constraints into trajectory optimization, addressing physical limits and enhancing learning stability.

Findings

01

Significant improvement in learning robustness.

02

Better avoidance of unreachable state-action areas.

03

Enhanced performance over state-of-the-art algorithms.

Abstract

Iterative trajectory optimization techniques for non-linear dynamical systems are among the most powerful and sample-efficient methods of model-based reinforcement learning and approximate optimal control. By leveraging time-variant local linear-quadratic approximations of system dynamics and reward, such methods can find both a target-optimal trajectory and time-variant optimal feedback controllers. However, the local linear-quadratic assumptions are a major source of optimization bias that leads to catastrophic greedy updates, raising the issue of proper regularization. Moreover, the approximate models' disregard for any physical state-action limits of the system causes further aggravation of the problem, as the optimization moves towards unreachable areas of the state-action space. In this paper, we address the issue of constrained systems in the scenario of online-fitted stochastic…

Tables2

Table 1. TABLE I : Mean total reward and standard deviation of the Furuta swing-up task scaled by 1 e − 2 1 e 2 1\mathrm{e}{-2} .

Iteration	10	30	45
CCTO	$- 6.8 (\pm 0.32)$	$- 1.3 (\pm 0.11)$	$- 0.65 (\pm 0.6)$
iLQG	$- 4.3 (\pm 0.46)$	$- 1.6 (\pm 0.39)$	$- 1.1 (\pm 0.53)$

Table 2. TABLE III : Mean total reward and standard deviation of the Cart-Pole swing-up task scaled by 1 e − 2 1 e 2 1\mathrm{e}{-2} .

Iteration	20	30	55
CCTO	$- 2.3 (\pm 0.32)$	$- 1.2 (\pm 0.32)$	$- 0.31 (\pm 0.06)$
iLQG	$- 9.3 (\pm 0.10)$	$- 9.3 (\pm 0.10)$	$- 9.3 (\pm 0.10)$

Equations66

A max

A max

s_{t + 1} \sim P (s_{t + 1} ∣ s_{t}, a_{t}),

Pr (s_{0 : T} \in S) \geq 1 - θ,

Pr (a_{0 : T - 1} \in A) \geq 1 - ϑ,

\displaystyle J(\boldsymbol{s},\boldsymbol{A})=-\mathbb{E}\Big{[}

\displaystyle J(\boldsymbol{s},\boldsymbol{A})=-\mathbb{E}\Big{[}

\displaystyle+(\boldsymbol{s}_{T}-\boldsymbol{s}_{g,T})^{\intercal}\boldsymbol{M}_{T}(\boldsymbol{s}_{T}-\boldsymbol{s}_{g,T})\Big{]},

Pr (s_{0 : T} \in S)

Pr (s_{0 : T} \in S)

= Pr (t = 0 ⋂ T h_{t}^{⊺} s_{t} \leq b_{t}) \geq 1 - θ .

Pr (t = 0 ⋂ T h_{t}^{⊺} s_{t} \leq b_{t})

Pr (t = 0 ⋂ T h_{t}^{⊺} s_{t} \leq b_{t})

\geq 1 - t = 0 \sum T 1 - Pr (h_{t}^{⊺} s_{t} \leq b_{t}) .

t = 0 \sum T 1 - Pr (h_{t}^{⊺} s_{t} \leq b_{t})

t = 0 \sum T 1 - Pr (h_{t}^{⊺} s_{t} \leq b_{t})

Pr (h_{t}^{⊺} s_{t} \leq b_{t})

\frac{1}{2} [1 + erf (\frac{b _{t} - h _{t}^{⊺} μ _{s_{t}}}{2 h _{t}^{⊺} Σ _{s_{t}} h _{t}})] \geq 1 - θ_{t},

\frac{1}{2} [1 + erf (\frac{b _{t} - h _{t}^{⊺} μ _{s_{t}}}{2 h _{t}^{⊺} Σ _{s_{t}} h _{t}})] \geq 1 - θ_{t},

b_{t} - h_{t}^{⊺} μ_{s_{t}} - 2 h_{t}^{⊺} Σ_{s_{t}} h_{t} erf^{- 1} (1 - 2 θ_{t})

J (s, A) = E [t = 0 \sum T - 1 R_{t} (s_{t}, a_{t}) + R_{T} (s_{T})] .

J (s, A) = E [t = 0 \sum T - 1 R_{t} (s_{t}, a_{t}) + R_{T} (s_{T})] .

V_{t} (s) = a_{t} max [R_{t} (s_{t}, a_{t}) + s_{t + 1} \sum V_{t + 1} (s_{t + 1}) P (s_{t + 1} ∣ s_{t}, a_{t})],

V_{t} (s) = a_{t} max [R_{t} (s_{t}, a_{t}) + s_{t + 1} \sum V_{t + 1} (s_{t + 1}) P (s_{t + 1} ∣ s_{t}, a_{t})],

Q_{t} (δ s, δ a) \approx \frac{1}{2} 1 δ s δ a^{⊺} 0 Q_{s, t} Q_{a, t} Q_{s, t}^{⊺} Q_{ss, t} Q_{a s, t} Q_{a, t}^{⊺} Q_{s a, t} Q_{aa, t} 1 δ s δ a .

Q_{t} (δ s, δ a) \approx \frac{1}{2} 1 δ s δ a^{⊺} 0 Q_{s, t} Q_{a, t} Q_{s, t}^{⊺} Q_{ss, t} Q_{a s, t} Q_{a, t}^{⊺} Q_{s a, t} Q_{aa, t} 1 δ s δ a .

Q_{s, t}

Q_{s, t}

Q_{a, t}

Q_{ss, t}

Q_{aa, t}

Q_{a s, t}

Δ V_{t}

Δ V_{t}

V_{s, t}

V_{ss, t}

a_{t}

a_{t}

s_{t + 1}

\tilde{s}

\tilde{s}

\tilde{B}

\tilde{d}

\tilde{M}

\tilde{M}

\tilde{M}_{C}

M_{T - 1} + K_{T - 1}^{⊺} D_{T - 1} K_{T - 1}, M_{T}),

\tilde{K}

\tilde{s} = \tilde{A} s_{0} + \tilde{B} \tilde{k} + \tilde{G} \tilde{w} + \tilde{G} \tilde{d},

\tilde{s} = \tilde{A} s_{0} + \tilde{B} \tilde{k} + \tilde{G} \tilde{w} + \tilde{G} \tilde{d},

μ_{\tilde{s}}

μ_{\tilde{s}}

\tilde{Σ}_{\tilde{s}}

\tilde{Σ}_{\tilde{a}} = \tilde{K} \tilde{A} Σ_{s_{0}} \tilde{A}^{⊺} \tilde{K}^{⊺} + \tilde{K} \tilde{G} \tilde{Σ}_{\tilde{w}} \tilde{G}^{⊺} \tilde{K}^{⊺} .

\tilde{Σ}_{\tilde{a}} = \tilde{K} \tilde{A} Σ_{s_{0}} \tilde{A}^{⊺} \tilde{K}^{⊺} + \tilde{K} \tilde{G} \tilde{Σ}_{\tilde{w}} \tilde{G}^{⊺} \tilde{K}^{⊺} .

J (\tilde{s}, \tilde{a}) = - E [\tilde{s}^{⊺} \tilde{M}_{C} \tilde{s}] + E [2 \tilde{s}_{g}^{⊺} \tilde{M} \tilde{s}] - E [\tilde{s}_{g}^{⊺} \tilde{M} \tilde{s}_{g}] ...

J (\tilde{s}, \tilde{a}) = - E [\tilde{s}^{⊺} \tilde{M}_{C} \tilde{s}] + E [2 \tilde{s}_{g}^{⊺} \tilde{M} \tilde{s}] - E [\tilde{s}_{g}^{⊺} \tilde{M} \tilde{s}_{g}] ...

... + E [2 \tilde{s}_{r}^{⊺} \tilde{K}^{⊺} \tilde{D} \tilde{K} \tilde{s}] - E [2 \tilde{a}_{r}^{⊺} \tilde{D} \tilde{K} \tilde{s}] - E [2 \tilde{k}^{⊺} \tilde{D} \tilde{K} \tilde{s}] ...

... - E [\tilde{s}_{r}^{⊺} \tilde{K}^{⊺} \tilde{D} \tilde{K} \tilde{s}_{r}] + E [2 \tilde{a}_{r}^{⊺} \tilde{D} \tilde{K} \tilde{s}_{r}] + E [2 \tilde{k}^{⊺} \tilde{D} \tilde{K} \tilde{s}_{r}] ...

... - E [\tilde{a}_{r}^{⊺} \tilde{D} \tilde{a}_{r}] - E [2 \tilde{k}^{⊺} \tilde{D} \tilde{a}_{r}] - E [\tilde{k}^{⊺} \tilde{D} \tilde{k}] .

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Full text

Chance-Constrained Trajectory Optimization for

Non-linear Systems with Unknown Stochastic Dynamics

Onur Celik1, Hany Abdulsamad2 and Jan Peters2,3 This work has received funding from the European Union’s Horizon 2020 research and innovation program under grant agreement # 640554.1Onur Celik is with the Department of Computer Science, Universtität Tübingen. [email protected]2Hany Abdulsamad and Jan Peters are with the Department of Computer Science, Intelligent Autonomous Systems, Technische Universtität Darmstadt. {abdulsamad, peters}@ias.tu-darmstadt.de3*Jan Peters is with the Max Planck Institute for Intelligent Systems.

Abstract

Iterative trajectory optimization techniques for non-linear dynamical systems are among the most powerful and sample-efficient methods of model-based reinforcement learning and approximate optimal control. By leveraging time-variant local linear-quadratic approximations of system dynamics and reward, such methods can find both a target-optimal trajectory and time-variant optimal feedback controllers. However, the local linear-quadratic assumptions are a major source of optimization bias that leads to catastrophic greedy updates, raising the issue of proper regularization. Moreover, the approximate models’ disregard for any physical state-action limits of the system causes further aggravation of the problem, as the optimization moves towards unreachable areas of the state-action space. In this paper, we address the issue of constrained systems in the scenario of online-fitted stochastic linear dynamics. We propose modeling state and action physical limits as probabilistic chance constraints linear in both state and action and introduce a new trajectory optimization technique that integrates these probabilistic constraints by optimizing a relaxed quadratic program. Our empirical evaluations show a significant improvement in learning robustness, which enables our approach to perform more effective updates and avoid premature convergence observed in state-of-the-art algorithms.

I Introduction

Model-based reinforcement learning has played an important role in the latest surge of popular research interest in learning-control of autonomous systems [1]. More specifically, trajectory-centric optimization techniques of non-linear dynamics have proven to be extremely sample efficient in comparison to model-free policy search approaches [2, 3, 4].

With the notable exception of [5], model-based trajectory optimization techniques [6, 7] are closely related to differential dynamic programming methods (DDP), initially presented in [8] and further generalized in [9]. DDP is a powerful approach for generating optimal trajectories with optimal time-variant feedback controllers. By relying on linear-quadratic approximations of the dynamics and reward around a nominal trajectory, DDP-based methods can leverage the local approximations to iteratively optimize both the trajectory and tracking feedback controllers in closed-form via dynamic programming [10]. This view of control has a computational advantage over direct optimization techniques such as collocation methods, which solve large optimization problems directly in the trajectory space and generally result only in open-loop control sequences [11].

However, despite the overwhelming success of DDP, it still suffers from multiple shortcomings. On the one hand, the greedy exploitation of poor local approximations of dynamics is a major problem that leads to premature convergence. This issue has been effectively addressed in recent research by proposing different schemes of regularization [2, 6, 7]. On the other hand, state and action constraints present a serious challenge, as they introduce hard non-linearities, that cannot be straightforwardly incorporated into the dynamic programming framework. The effect of constraints becomes more severe in settings where a global model is not available for automatic differentiation, hence requiring the linear approximation of the dynamics to be fitted online from samples.

We view these issues of DDP as interlocked. The inability of time-variant local linear models to consider state and action constraints results in updates that exploit unreachable parts of the state-actions space, leading to catastrophically poor linear-quadratic approximations in regions subject to hard non-linearities. Moreover, considering constraints becomes more challenging in scenarios with stochastic dynamics, in that the true state of the system is hidden and only available through sufficient statistics. Another crucial aspect in a stochastic setting is the infinite support of the noisy measurements, which results in the constraints being active over the whole state-action space.

To address these issues, we propose an augmented view of DDP that introduces the physical limits as probabilistic chance constraints linear in state and action. When considering time-variant linear-Gaussian approximations of the dynamic, we can relax the generally non-convex chance constraints by applying Boole’s inequality. This relaxation allows us to formulate an additional quadratic program that forces the optimized nominal trajectory to stay in a feasible state-action region with high probability, all while considering the feedback gains optimized by DDP.

Several approaches to trajectory optimization for non-linear systems address the problem of constrained dynamics on different levels. In the domain of deterministic environments, Tassa et al. considered action box-constraints in [12], while the authors in [13] introduce soft state-action limits via a Lagrange function augmentation. More sophisticated integration of constraints is presented in [14], in which the authors formulate a quadratic program to determine the active set of constraints at every iteration. In a stochastic setting, the work by Van Den Berg et. al [15] introduces probabilistic constraints as direct penalty terms on the cost function.

Furthermore, probabilistic constraints are considered in the context of linear optimal control. In [16] the authors optimally handle probabilistic constraints by ellipsoidal relaxation for finite-horizon open-loop scenarios, while in [17] a similar problem is tackled by applying Boole’s inequality. In [18] Vitus et al. propose an algorithm to extend the work in [17] and [16] by considering closed-loop uncertainty and optimizing the risk allocation. Finally, in [19] the problem of infeasible initial solutions is addressed by progressively introducing the constraint into the objective.

We situate our contribution in the class of differential dynamic programming for stochastic non-linear systems subject to probabilistic constraints in state and action. We empirically show that our proposed approach can deal with highly non-linear constrained dynamic environments, leading to better overall performance and a robust learning process by virtue of improved online-fitted local approximations.

II Chance-Constrained Optimization

Chance constraints arise naturally in different fields of optimization when considering stochastic systems. For an overview, we refer to [20]. Dealing with such probabilistic constraints proves to be challenging, as they are often non-convex and hard to evaluate without resolving to computationally expensive sampling techniques. These difficulties have motivated further research into tractable forms of chance constraints, which led to several convex approximations [21]. This work will focus on using Boole’s inequality for constraint relaxation. A detailed description in the context of trajectories will follow in Section II-B.

II-A Problem Formulation

Consider the constrained optimal control problem with probabilistic state and action constraints and unknown stochastic time-discrete transition dynamics

[TABLE]

where $\mathcal{S}$ and $\mathcal{A}$ are the feasible state and action spaces respectively. The probability levels $\theta,\vartheta$ are hyperparameters that influence the risk behavior in terms of violating the constraints. The goal of this constrained optimization is to maximize the objective by finding the optimal action sequence $\boldsymbol{A}$ . In general, we consider the expected cumulative reward for a trajectory of length $T$ in the quadratic form

[TABLE]

where $\boldsymbol{M}$ and $\boldsymbol{D}$ are positive-definite weight matrices of appropriate dimensions and $\boldsymbol{s}_{g}$ is the target state. Note that a quadratic objective is not necessarily required, and non-quadratic objectives can be locally approximated by quadratic forms.

II-B Relaxation of Chance Constraints

Chance constraints can be conservatively relaxed by applying Boole’s inequality [22, 23, 24]. For the purpose of brevity, only upper-bound state constraints are considered. However, the same relaxation procedure can be straightforwardly applied to obtain a lower-bound and to relax the action constraints. Generally, the state-linear joint chance constraint for a whole trajectory is formulated as

[TABLE]

where $\boldsymbol{h}_{t}$ and $b_{t}$ parameterize the half-plane defined by the constraints. Consequently, the probability of a trajectory to be within a feasible set is constrained to be higher than a probability $1-\theta$ . In the framework of stochastic programming, it is usually beneficial to reformulate Equation (II-B) into separate inequalities over individual constraints [20], which is achieved by transforming the intersection operator into a union operator according to rules of probability.

[TABLE]

The sum in Inequality (II-B) can now be treated as a collection of single probabilities per time-step

[TABLE]

where $\sum_{t=0}^{T}\theta_{t}=\theta$ . By assuming a Gaussian probability density, a common assumption in control applications, Equation (II-B) is rewritten using the cumulative density function

[TABLE]

where $\boldsymbol{\mu_{\boldsymbol{s}_{t}}}$ and $\boldsymbol{\Sigma}_{\boldsymbol{s}_{t}}$ are the state mean and covariance respectively. Moreover, due to properties of the error function, the inequality $\sum_{t=0}^{T}\theta_{t}\leq\theta<0.5$ is conservatively enforced by setting $\theta_{t}=\theta/T$ and requiring $\theta<0.5$ , as in [24].

II-C Iterative Linear Quadratic Gaussian Control (iLQG)

We base our trajectory optimization technique on DDP/iLQG methods. This section provides a short overview on the principles of DDP [8] and iLQG [2]. For any arbitrary time-index reward function $R_{t}$ , the trajectory optimization objective is the expected cumulative reward

[TABLE]

DDP and iLQG leverage the principle of dynamic programming to simplify the optimization over a complete sequence of actions $\boldsymbol{a}_{0:T-1}$ to an optimization over single actions $\boldsymbol{a}_{t}$ for each time-step. For this purpose the time-indexed state-value function is introduced

[TABLE]

over which the dynamic programming backward recursion is performed. By assuming linear transitions dynamics and a quadratic rewards along a nominal trajectory, optimal feedback controllers can be derived in closed-form. DDP and iLQG consider the perturbed state-action-value function $Q_{t}(\delta\boldsymbol{s},\delta\boldsymbol{a})=R_{t}(\boldsymbol{s}_{t}+\delta\boldsymbol{s},\boldsymbol{a}_{t}+\delta\boldsymbol{a})-R_{t}(\boldsymbol{s}_{t},\boldsymbol{a}_{t})+V_{t+1}\left(\mathcal{P}(\boldsymbol{s}_{t}+\delta\boldsymbol{s},\boldsymbol{a}_{t}+\delta\boldsymbol{a})\right)-V_{t+1}\left(\mathcal{P}(\boldsymbol{s}_{t},\boldsymbol{a}_{t})\right)$ , resulting from a second order Taylor approximation

[TABLE]

The subscripts $s$ and $a$ stand for the first and second order approximations. The entries of $Q_{t}(\delta\boldsymbol{s},\delta\boldsymbol{a})$ are given as

[TABLE]

The main difference of iLQG compared to DDP is in neglecting the second order derivatives of the dynamics in iLQG. Given these approximations the optimal feedback controller is given as $\delta\boldsymbol{a}^{*}=-\boldsymbol{Q}_{aa,t}^{-1}(\boldsymbol{Q}_{a}+\boldsymbol{Q}_{as,t}\delta\boldsymbol{s})=\boldsymbol{K}_{t}\delta\boldsymbol{s}+\boldsymbol{k}_{t}.$ Inserting $\delta\boldsymbol{a}^{*}$ into $Q_{t}(\delta\boldsymbol{s},\delta\boldsymbol{a})$ returns the update equations of the state-value function per time-step

[TABLE]

During the forward pass, new trajectories of the stochastic non-linear dynamics are sampled by propagating the actions through the real system

[TABLE]

where $\boldsymbol{s}_{r,t},\boldsymbol{a}_{r,t}$ denote the mean state and action at time $t$ from the last iteration and are also referred to as the nominal or reference trajectory, here denoted by the subscript $r$ .

Special care has to be taken during the backward pass of DDP and iLQG to ensure that $\boldsymbol{Q}_{aa,t}$ is negative-definite, which has inspired different regularization schemes. In DDP, this regularization is commonly applied to $\boldsymbol{Q}_{aa,t}$ as $\boldsymbol{\tilde{Q}}_{aa,t}=\boldsymbol{Q}_{aa,t}-\mu\boldsymbol{I}$ , with $\mu\geq 0$ . However, other regularizations directly affecting the value function have been shown to be more effective [2], and will be used throughout this work.

II-D Augmented Linearized Closed-Loop System

To formulate the chance-constrained optimization problem, we first introduce the notation and system description of the online-fitted time-variant linear system. Following [19], our approach optimizes the feedforward terms of the control, while satisfying the constraints for the linearized dynamics and maintains the feedback gains computed during the backward pass of DDP/iLQG.

Given $N$ trajectories from the non-linear system as described in Equation (II-C), we fit linear-Gaussian models to the sampled data via regularized linear regression. Consequently we obtain the transition and control matrices $\boldsymbol{A}_{t},\boldsymbol{B}_{t}$ , as well as the bias vector $\boldsymbol{c}_{t}$ for each time-step. The resulting time-variant linear dynamics $\boldsymbol{s}_{t+1}=\boldsymbol{A}_{t}\boldsymbol{s}_{t}+\boldsymbol{B}_{t}\boldsymbol{a}_{t}+\boldsymbol{c}_{t}+\boldsymbol{w}_{t}$ , with $\boldsymbol{w}_{t}\sim\mathcal{N}(\boldsymbol{0},\Sigma_{t})$ , and the controller $\boldsymbol{a}_{t}=\boldsymbol{K}_{t}(\boldsymbol{s}_{t}-\boldsymbol{s}_{r,t})+\boldsymbol{k}_{t}+\boldsymbol{a}_{r,t}$ are used to formulate the closed-loop linear system $\boldsymbol{s}_{t+1}=\boldsymbol{\hat{A}}_{t}\boldsymbol{s}_{t}+\boldsymbol{B}_{t}\boldsymbol{k}_{t}+\boldsymbol{d}_{t}+\boldsymbol{w}_{t}$ , where $\boldsymbol{\hat{A}}_{t}=\boldsymbol{A}_{t}+\boldsymbol{B}_{t}\boldsymbol{K}_{t}$ and $\boldsymbol{d}_{t}=\boldsymbol{c}_{t}-\boldsymbol{B}_{t}\boldsymbol{K}_{t}\boldsymbol{s}_{r,t}+\boldsymbol{B}_{t}\boldsymbol{a}_{r,t}$ .

To represent the closed-loop system over an entire trajectory we use the augmented notation

[TABLE]

The augmented weighting matrices for the quadratic objective take the form

[TABLE]

and the closed-loop linearized stochastic dynamics is written in terms of the augmented notation as

[TABLE]

which in turn can be decomposed to the mean and covariance of a Gaussian state density

[TABLE]

where $\boldsymbol{\tilde{\Sigma}}_{\boldsymbol{\tilde{w}}}$ are the stacked estimates of the covariance for each time-step, taken under the $N$ samples drawn during the last forward pass. Furthermore, given the feedback gains, we compute the action covariance along the trajectory

[TABLE]

II-E Augmented Objective and Relaxed Chance Constraints

We simplify Objective (II-A) by using the stacked notation and the closed-loop matrices from Section II-D

[TABLE]

Given that the expectations are of linear-quadratic quantities under Gaussian densities, it is possible to evaluate this objective in closed-form. This objective depends only on the forward terms $\boldsymbol{\tilde{k}}$ and can be reformulated as $\tilde{J}(\boldsymbol{\tilde{k}})$ .

Following the relaxation presented in Section II-B and using the stacked notation we can write the upper and lower state-linear chance constraints as

[TABLE]

where $\boldsymbol{\tilde{h}}$ and $\boldsymbol{\tilde{b}}$ parameterize the upper and lower half-planes of the state constraints and $\boldsymbol{\tilde{\theta}}_{u}$ and $\boldsymbol{\tilde{\theta}}_{l}$ denote the probability values per time-step, all stacked and indexed by $u$ and $l$ respectively. Analogously, the action constraints of the closed-loop system can be formulated

[TABLE]

where $\boldsymbol{\lambda}_{u}=\sqrt{2\boldsymbol{\tilde{f}}_{u}^{\intercal}\boldsymbol{\tilde{\Sigma}}_{\boldsymbol{\tilde{a}}}\boldsymbol{\tilde{f}}_{u}}\odot\boldsymbol{\operatorname{erf}}^{-1}(\boldsymbol{1}-2\boldsymbol{\tilde{\vartheta}}_{u})$ and $\boldsymbol{\lambda}_{l}=\sqrt{2\boldsymbol{\tilde{f}}_{l}^{\intercal}\boldsymbol{\tilde{\Sigma}}_{\boldsymbol{\tilde{a}}}\boldsymbol{\tilde{f}}_{l}}\odot\boldsymbol{\operatorname{erf}}^{-1}(2\boldsymbol{\tilde{\vartheta}}_{l}-\boldsymbol{1})$ , $\boldsymbol{\tilde{f}}$ and $\boldsymbol{\tilde{z}}$ are the stacked half-plane parameters of the action constraints and $\boldsymbol{\tilde{\vartheta}}_{u},\boldsymbol{\tilde{\vartheta}}_{l}$ are the stacked upper and lower bound probabilities per time-step. The operator $\odot$ denotes the element-wise multiplication.

II-F Chance-Constrained Trajectory Optimization

Based on the formulations introduced in Section II-D and Section II-E, it is possible to construct an optimization problem around the reference trajectory to find a sequence of feedforward terms $\boldsymbol{\tilde{k}}$ that maintain the Constraints (8-11).

The resulting optimization is a quadratic program with linear constraints in $\boldsymbol{\tilde{k}}$ . Thus, the probabilistic problem reduces to a deterministic one, which can be solved efficiently with many numerical solvers, for example, qpOASES [25] within the CasADi framework [26]. The complete dynamic programming and optimization loop is described in Algorithm 1 and is summarized as follows: During an initial forward pass, we obtain $N$ trajectory samples, around which the dynamics is linearized for each time-step. The linearized dynamics is used to perform the backward pass of iLQG and obtain the feedback and feedforward controllers along the reference trajectory. These controllers are then used to formulate the closed-loop linearized system with the stacked notation and to warm-start the quadratic program. The solution of the constrained program returns the optimal feedforward sequence $\boldsymbol{k}_{t}$ , which is used to perform the next forward pass and linearization. Following [2], we also use the hyperparameter $\alpha$ that scales the feedforward control in order to keep the next forward pass of the non-linear system in a valid trust-region around the linear-quadratic approximations.

III Empirical Evaluation

We evaluate our approach on two highly non-linear dynamical tasks, the Furuta pendulum [27] and a Cart-Pole environment. Both problems are under-actuated and have state and actions constraints. We consider quadratic reward functions for both experiments and set the probability values for violating the constraints to $\theta_{u}=\theta_{l}=\vartheta_{u}=\vartheta_{l}=0.01$ .

Furuta Pendulum Swing-Up

In the Furuta pendulum the state is represented by the angles of both links and the corresponding angular velocities. Only the horizontal link is actuated and is subject to both state and the action constraints. To make the environment stochastic, we introduce both action and process noise. We run our experiment under identical conditions for CCTO and iLQG. We fix the feedforward scalar $\alpha$ to $0.05$ for both algorithms and perform 20 seeded trials, each with 45 iterations, 50 rollouts per iteration. The resulting performance curve of both algorithms can be seen in Figure 1. Furthermore, we present the planned nominal trajectories, as well as the planned nominal actions of both algorithms for one trial. The filled space is the area between the minimum and maximum values of states and actions and should not be confused with a probability distribution over trajectories. The advantage of our approach is clear. CCTO reaches better overall performance with a higher final reward and smaller standard deviation, Table II. iLQG plans frequently and consistently to violate the constraints, while CCTO keeps the state and action trajectories within a feasible space. This consideration leads to an improved approximation of the non-linear system dynamic and allows CCTO to perform robust improvement steps during the optimization. This result is affirmed by the low regularization values of CCTO, Table II.

Cart-Pole Swing-Up

For the well-known Cart-Pole environment, we consider constraints on the position of the cart as well as on the action. To make the task more challenging, we again apply action and process noise, enforce harsh action constraints and limit the time horizon to 100 time steps, the equivalent of 2 seconds. We evaluate iLQG and CCTO on 20 seeded trials, each with 55 iterations and 50 rollouts per iteration. We set the feedforward scaling parameter $\alpha$ to $0.1$ . Analogously to the last experiment, Figure 2 depicts the performance curve of iLQG and CCTO, as well as the spaces of planned nominal trajectories for the cart’s position and the corresponding actions. In this experiment, iLQG moves very quickly towards a local optimum and does not manage to swing the Cart-Pole up. In contrast, CCTO performs the swing-up by finding a suitable nominal trajectory in the feasible constrained space. Tables IV and IV reflect the performance discrepancy between both algorithms, in terms of total rewards and needed regularization.

IV Conclusion and Future Research

We have proposed a new trajectory optimization technique, based on the framework of differential dynamic programming, that takes into consideration probabilistic chance constraints in stochastic environments with unknown non-linear dynamics. We used Boole’s inequality to conservatively relax the non-convex chance constraints, enabling us to formulate a constrained quadratic program and optimize the nominal trajectory to stay inside the feasible set defined by the probabilistic linear state and action limits. We have provided a thorough derivation of our approach and empirically demonstrated the advantage of enforcing physical limits on two simulated highly dynamical and stochastic non-linear systems. The results indicate that incorporating the chance constraints leads to higher fidelity in the online-fitted local linear-quadratic approximations, and consequently greatly influences the robustness of the iterative optimization process. This observation is reflected in very low regularizations in comparison to standard iLQG.

In future research, we will extend our optimization to include not only the nominal trajectory but also the feedback gains, and we will consider optimizing the probabilistic constraint bounds via risk allocation to achieve dynamic risk measures across time and iterations. In addition, we plan to move to the fully stochastic optimization framework of maximum-entropy iLQG [6] to avoid regularization heuristics of the DDP framework.

Bibliography27

The reference list from the paper itself. Each links out to its DOI / PubMed record.

1[1] M. P. Deisenroth, G. Neumann, and J. Peters, “A survey on policy search for robotics,” Foundations and Trends® in Robotics , 2013.
2[2] Y. Tassa, T. Erez, and E. Todorov, “Synthesis and stabilization of complex behaviors through online trajectory optimization,” in IEEE/RSJ International Conference on Intelligent Robots and Systems . IEEE, 2012.
3[3] M. P. Deisenroth, D. Fox, and C. E. Rasmussen, “Gaussian processes for data-efficient learning in robotics and control,” IEEE transactions on pattern analysis and machine intelligence , 2015.
4[4] S. Levine, C. Finn, T. Darrell, and P. Abbeel, “End-to-end training of deep visuomotor policies,” The Journal of Machine Learning Research , 2016.
5[5] M. Deisenroth and C. E. Rasmussen, “PILCO: A model-based and data-efficient approach to policy search,” in Proceedings of the 28th International Conference on machine learning , 2011.
6[6] S. Levine and P. Abbeel, “Learning neural network policies with guided policy search under unknown dynamics,” in Advances in Neural Information Processing Systems , 2014.
7[7] H. Abdulsamad, O. Arenz, J. Peters, and G. Neumann, “State-regularized policy search for linearized dynamical systems,” in International Conference on Automated Planning and Scheduling , 2017.
8[8] D. H. Jacobson and D. Q. Mayne, “Differential dynamic programming,” 1970.