TL;DR
This paper introduces a neural network-based, data-driven approach to approximate solutions of high-dimensional Hamilton-Jacobi-Bellman equations, enabling real-time feedback control for complex nonlinear systems without discretizing the state space.
Contribution
It presents a novel method that leverages physics-informed neural networks and adaptive data generation to solve high-dimensional HJB equations efficiently.
Findings
Successfully applied to 6D rigid body attitude control
Extended to systems of dimension up to 30
Achieved real-time feedback control capabilities
Abstract
Computing optimal feedback controls for nonlinear systems generally requires solving Hamilton-Jacobi-Bellman (HJB) equations, which are notoriously difficult when the state dimension is large. Existing strategies for high-dimensional problems often rely on specific, restrictive problem structures, or are valid only locally around some nominal trajectory. In this paper, we propose a data-driven method to approximate semi-global solutions to HJB equations for general high-dimensional nonlinear systems and compute candidate optimal feedback controls in real-time. To accomplish this, we model solutions to HJB equations with neural networks (NNs) trained on data generated without discretizing the state space. Training is made more effective and data-efficient by leveraging the known physics of the problem and using the partially-trained NN to aid in adaptive data generation. We demonstrate…
| BVP convergence | mean integration time | |
|---|---|---|
| 1 | 0.37 s | |
| 2 | 0.44 s | |
| 3 | 0.40 s | |
| 4 | 0.45 s | |
| 8 | 0.53 s |
| num. trajectories | training time | value RMAE | gradient RM | |
|---|---|---|---|---|
| 10 | 163 | 25 min | ||
| 20 | 128 | 48 min | ||
| 30 | 145 | 62 min |
| BVP convergence | mean integration time | ||
|---|---|---|---|
| 10 | 4 | 0.7 s | |
| 6 | 0.8 s | ||
| 10 | 1.3 s | ||
| 20 | 4 | 3.6 s | |
| 5 | 4.2 s | ||
| 6 | 4.7 s | ||
| 30 | 4 | 11.3 s | |
| 6 | 14.6 s | ||
| 8 | 19.1 s |
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
\newsiamremark
remarkRemark \newsiamremarkassumptionAssumption
\headersDeep Learning for HJB EquationsT. Nakamura-Zimmerer, Q. Gong, and W. Kang
Adaptive Deep Learning for High-Dimensional Hamilton-Jacobi-Bellman Equations††thanks: Preliminary results of this work appeared in the American Control Conference 2020 [30]. \funding: The work of the first and second authors was partially supported with funding from the Defense Advanced Research Projects Agency (DARPA) grant FA8650-18-1-7842.
Tenavi Nakamura-Zimmerer Department of Applied Mathematics, Baskin School of Engineering, University of California, Santa Cruz (, ). [email protected]
Qi Gong22footnotemark: 2
Wei Kang Department of Applied Mathematics, Naval Postgraduate School, Monterery, CA, (). [email protected]
Abstract
Computing optimal feedback controls for nonlinear systems generally requires solving Hamilton-Jacobi-Bellman (HJB) equations, which are notoriously difficult when the state dimension is large. Existing strategies for high-dimensional problems often rely on specific, restrictive problem structures, or are valid only locally around some nominal trajectory. In this paper, we propose a data-driven method to approximate semi-global solutions to HJB equations for general high-dimensional nonlinear systems and compute candidate optimal feedback controls in real-time. To accomplish this, we model solutions to HJB equations with neural networks (NNs) trained on data generated without discretizing the state space. Training is made more effective and data-efficient by leveraging the known physics of the problem and using the partially-trained NN to aid in adaptive data generation. We demonstrate the effectiveness of our method by learning solutions to HJB equations corresponding to the attitude control of a six-dimensional nonlinear rigid body, and nonlinear systems of dimension up to 30 arising from the stabilization of a Burgers’-type partial differential equation. The trained NNs are then used for real-time feedback control of these systems.
keywords:
Hamilton-Jacobi-Bellman Equations, Optimal Feedback Control, Nonlinear Dynamical Systems, Deep Learning, Neural Networks, Optimization
{AMS}
49K15, 49L20, 49N35, 68T05, 90C30, 93C15, 93C20
1 Introduction
For the optimal control of nonlinear dynamical systems, it is well-known that open-loop controls are not robust to model uncertainty or disturbances. For slowly evolving processes, it is possible to use model predictive control by recomputing the open-loop optimal solutions for relatively short time horizons in the future. However, for most applications one typically desires a feedback control law, as feedback controls are inherently more robust to disturbances. In principle, optimal feedback controllers can synthesized by solving a (discretized) Hamilton-Jacobi-Bellman (HJB) equation, a partial differential equation (PDE) in spatial dimensions plus time. The size of the discretized problem increases exponentially with , making direct solution intractable for even moderately large problems. This is the so-called “curse of dimensionality.”
For this reason, there is an extensive literature on methods of finding approximate solutions for HJB equations. Some key examples include series expansions [3, 28, 23], level set methods [32], patchy dynamic programming [9, 31], semi-Lagrangian methods [5, 17], method of characteristics and Hopf formula-based algorithms [15, 11, 38], and polynomial approximation [22]. These existing methods suffer one or more of the following drawbacks: the problem’s dimension is limited; the accuracy of the solution is hard to verify for general systems; the solution may be valid only in a small region; or the system model must have certain special algebraic structure.
In [24, 25], semi-global solutions to HJB equations are computed by combining the method of characteristics with sparse state space discretization. In this approach, a two-point boundary value problem (BVP) is solved at each point in a sparse grid. These BVPs can be solved independently, making the algorithm causality-free. This property is attractive because the computation does not depend on a grid, and hence they can be applied to high-dimensional problems. The Hopf formula methods [15, 11, 38] also have this property, though it is achieved in a different way and under certain convexity/concavity assumptions. Causality-free methods are usually too slow for online computation, but they are perfectly parallelizable so can be used to generate large data sets offline. Such data sets can then be used to construct faster solutions such as sparse grid interpolants [24, 25] or, as in this paper, neural networks.
Using neural networks (NNs) as a basis for solving HJB equations is not by itself a new idea, and deep learning approaches have led to promising results; see for instance [2, 10, 35, 21, 34, 20]. To the best of our knowledge, state-of-the-art NN-based techniques generally rely on either minimizing the residual of the PDE and (artificial) boundary conditions at randomly sampled collocation points [2, 10, 35, 34]; or, due to computational limitations, approximating the control and/or HJB solution and its gradient in a small neighborhood of a nominal trajectory [21, 20]. In [13, 14], a specialized NN architecture is proposed to solve some classes of Hamilton-Jacobi equations, but this method has yet to be generalized to state-dependent HJB equations arising in optimal control. Deep learning techniques have also been proposed for solving high-dimensional stochastic optimal control problems (see e.g. [18, 19, 4]).
In this paper, we develop a computational method for solving high-dimensional HJB equations and synthesizing candidate optimal feedback controllers. Our approach is data-driven and consists of three main steps. First, we generate a small set of open-loop optimal control solutions using a causality-free algorithm based on Pontryagin’s Minimum Principle (PMP). In the second step, we use the data set to train a NN to approximate the solution to the HJB equation, called the value function. During training, we supply information about the value function gradient, which encourages the NN to learn the shape of the value function rather than just fitting point data. We also estimate the number of samples needed to obtain a good model. Additional samples are chosen in regions where the value function is difficult to learn, and are obtained quickly with the aid of the NN. In this sense our method involves adaptive sampling. Lastly, the accuracy of the NN is verified on independent data generated using the same causality free algorithm from the first step. Unlike other NN-based methods for deterministic HJB equations, our approach does not require computing expensive PDE residuals and the solution is valid over large spatial domains.
As an illustrative example, the method is applied to design an attitude controller of a rigid-body satellite equipped with momentum wheels. This is a highly nonlinear problem with spatial dimensions and control inputs. With the proposed method, we obtain a model of the value function with accuracy comparable to that obtained in [25], but require far fewer sample trajectories to do so. Scalability of the method is tested on problems of dimension , 20, and 30 arising from pseudospectral discretization of a Burgers’-type PDE. We show that the method is capable of handling these high-dimensional problems without simplifying the dynamics.
Through these examples, we demonstrate several advantages and potential capabilities of the proposed framework. These include solving HJB equations over semi-global domains with empirically validated levels of accuracy, progressive generation of rich data sets, and computationally efficient nonlinear feedback control for real-time applications. Solution of high-dimensional problems is enabled by efficient and adaptive causality-free data generation, physics-informed learning, and the inherent capacity of NNs for dealing with high-dimensional data.
1.1 Abbreviations and notation
Here we present a brief list of some of the abbreviations, terminology, and notation used in this paper.
[TABLE]
2 A causality-free method for HJB equations
We consider fixed final time optimal control problems (OCP) of the form
[TABLE]
Here is the state, is the control, and is a Lipschitz continuous vector field. is the cost functional which is composed of , the terminal cost, and , the running cost. We assume that the cost functional is convex in and . In this paper we consider the case where the final time is fixed.
For a given initial condition , many numerical methods exist to compute the optimal open-loop solution,
[TABLE]
The open-loop control Eq. 2 which solves Eq. 1 is valid for all , but only for the fixed initial condition . Due to various sources of disturbance and real-time application requirements, for practical implementation one typically desires an optimal control in closed-loop feedback form,
[TABLE]
which can be evaluated online given any and a measurement of .
To compute the optimal feedback control, we follow the standard procedure in dynamic programming (see e.g. [27]) and define the value function as the optimal cost-to-go of Eq. 1 starting at . That is,
[TABLE]
It can be shown that the value function is the unique viscosity solution [12] of the Hamilton-Jacobi-Bellman (HJB) PDE,
[TABLE]
where we denote and . Note that if the value function is , then it is the unique classical solution of Eq. 5.
To compute the control given the value function , we start by defining the Hamiltonian
[TABLE]
where is the costate. The optimal control satisfies the Hamiltonian minimization condition,
[TABLE]
If we denote the minimized Hamiltonian by , then Eq. 5 can be expressed as
[TABLE]
If Eq. 8 can be solved (in the viscosity sense), then it provides both necessary and sufficient conditions for optimality. Moreover, the optimal feedback control is computed by substituting
[TABLE]
into Eq. 7 to get
[TABLE]
This means that with available, the feedback control is obtained as the solution of an (ideally straightforward) optimization problem.
2.1 Pontryagin’s Minimum Principle
To make use of Eq. 10, we need an efficient way to approximate the value function and its gradient. Like [24, 25], rather than solve the full HJB equation Eq. 8 on a grid, we exploit the fact that the characteristics of solutions to Eq. 8 evolve according to a two-point BVP, well-known in optimal control as Pontryagin’s Minimum Principle (PMP):
[TABLE]
The two-point BVP provides a necessary condition for optimality. If we further assume that the solution is optimal, then along the characteristic we have that
[TABLE]
In [24, 25], the two-point BVP Eq. 11 is solved for each point in a sparse grid. Applying Eq. 12, the value function and its gradient are then calculated using high-dimensional interpolation. This technique is called the sparse grid characteristics method. But even in a sparse grid the number of points grows like , where is the state dimension and is the number of grid points in each dimension. Thus one may have to solve a prohibitively large number of BVPs for higher-dimensional problems. Instead of sparse grid interpolation, we use data from solved BVPs to train a NN to approximate the value function. This approach is completely grid-free and hence applicable in high dimensions.
Remark 2.1**.**
In general, the BVP admits multiple solutions which can sometimes be sub-optimal. The characteristics of the value function satisfy Eq. 11, but there may be other solutions to these equations which are sub-optimal and therefore not characteristics of the value function. In many problems it is also possible for the characteristics to intersect, giving rise to non-smooth value functions and difficulties in applying Eq. 9.
Optimality of solutions to the BVP can be guaranteed under some convexity conditions (see e.g. [29]). For most dynamical systems it is difficult to verify such conditions globally, but we can guarantee optimality locally around an equilibrium point [28]. Addressing the challenge of global optimality in a broader context is beyond the scope of the present work, so in this paper we assume that solutions to the two-point BVP Eq. 11 are optimal. Under this assumption, the relationship between PMP and the value function as given in Eq. 12 holds everywhere.
*Note the proposed method can still be applied to problems where this assumption cannot be verified. In such cases PMP remains the prevailing tool for computing candidate optimal solutions, and from these the proposed method will yield a feedback controller which satisfies necessary conditions for optimality. *
2.2 Causality-free data generation
While solving the BVP is easier than solving the full HJB equation, we know of no general algorithm that is reliable and fast enough for real-time applications. However, in our approach the real-time feedback control computation is done by a NN which is trained offline. Thus we can solve the BVP offline to generate data for training and evaluating such a NN. For this purpose, numerically solving the BVP can be manageable although it may require some parameter tuning. In this paper, we use an implementation of the BVP solver introduced in [26]. This algorithm is based on a three-stage Lobatto IIIa discretization, a collocation formula which provides a solution that is fourth-order accurate. But the algorithm is highly sensitive to the initial guess for and : there is no guarantee of convergence with an arbitrary initial guess, and in most cases a good initial guess for cannot be derived from the problem physics. Furthermore, convergence is increasingly dependent on good initializations as we increase the length of the time interval.
To overcome this difficulty, we employ the time-marching trick from [24, 25]. This is a continuation technique in which we sequentially extend the solution from an initially short time interval to the final time . Specifically, we choose a sequence of intermediate times
[TABLE]
in which is small. For the short time interval , the BVP solver converges given most initial guesses near the initial state . Then the resulting trajectory is rescaled over the longer time interval . The rescaled trajectory is used as the initial guess to solve the BVP over . We repeat this process until , at which we obtain the full solution. By appropriately tuning the time sequence , we can largely overcome the problem of sensitivity to initial guesses.
Computing many such solutions becomes expensive, which means that generating the large data sets necessary to train a NN can be difficult. With this in mind, we use the time-marching trick only to generate a small initial data set, and adaptively adding more points during training. The key to doing this efficiently is simulating the system dynamics using the partially-trained NN to close the loop. The closed-loop trajectory and predicted costate provide good guesses for the optimal state and costate, so that we can immediately solve Eq. 11 for all of . Besides being more computationally efficient than time-marching, this approach also requires no parameter tuning. Details are presented in Section 4, and numerical comparisons between this method and time-marching are given in Sections 5.2 and 6.2.
As an alternative to either of these two approaches, one could use backpropagation as suggested in [20]. However, this method does not allow one to choose initial conditions independently and so cannot be considered fully causality-free.
3 Neural network approximation of the value function
Neural networks have become a popular tool for modeling high-dimensional functions, since they are not dependent on discretizing the state space. In this paper, we apply NNs to approximate solutions of the HJB equation and evaluate the resulting feedback control in real-time. Specifically, we carry out the following steps:
Initial data generation: We compute the value function, , along trajectories from initial conditions chosen by Monte Carlo sampling. Data is generated by solving the BVP as discussed in Section 2.2. In this initial data generation step, we require relatively few data points since more data can be added later at little computational cost. 2. 2.
Model training: Given this data set, we train a NN to approximate the value function. Learning is guided by the underlying structure of the problem, specifically by asking the NN to satisfy Eq. Eq. 9. In doing so, we regularize the model and make efficient use out of small data sets. 3. 3.
Adaptive data generation: In the initial training phase we only have a small data set, so the NN only roughly approximates the value function. We now expand the data set by generating data in regions where the value function is likely to be steep or complicated, and thus difficult to learn. Generating additional data is made efficient by good initial guesses obtained from NN-in-the-loop simulations of the system dynamics. 4. 4.
Model refinement and validation: We continue training the model and increasing the size of the data set until we satisfy some convergence criteria. Then, we check the generalization accuracy of the trained NN on a new set of validation data computed at Monte Carlo sample points. 5. 5.
Feedback control: We compute the feedback control online by evaluating the gradient of the trained NN and applying PMP. Notably, evaluation of the gradient is exact and it is extremely cheap even for large , enabling real-time implementation in high-dimensional systems.
The crux of the proposed method depends on modeling the value function Eq. 4 over a semi-global domain . We present details of this process in the following subsections. In Section 3.1, we review the basic structure of feedforward NNs and describe how we train a NN to model the value function. Then in Section 3.2, we propose a simple way to incorporate information about the known solution structure into training. Finally in Section 3.4, we demonstrate how to use the trained NN for feedback control. The adaptive data generation scheme is treated separately in Section 4. The proposed method is illustrated in Section 5 by solving a practical optimal attitude control problem for a rigid body satellite, and then applied to solve larger problems in Section 6.
3.1 Feedforward neural networks
In this paper we use multilayer feedforward NNs. While many more sophisticated architectures have been developed for other applications, we find this basic architecture to be more than adequate for our purposes. Let be the function we wish to approximate and be its NN representation. Feedforward NNs approximate complicated nonlinear functions by a composition of simpler functions, namely
[TABLE]
where each layer is defined as
[TABLE]
Here and are the weight matrices and bias vectors, respectively. represents a nonlinear activation function applied component-wise to its argument; popular choices include ReLU, tanh, and other similar functions. In this paper, we use tanh for all the hidden layers. The final layer, , is typically linear, so is the identity function.
Let denote the collection of the parameters of the NN, i.e.
[TABLE]
The NN is trained by optimizing over the parameters to best approximate by . Specifically, by solving the BVP Eq. 11 from a set of randomly sampled initial conditions, we get a data set
[TABLE]
where are the inputs, are the outputs to be modeled, and are the indices of the data points. In the most naïve setting, the NN is then trained by solving the nonlinear regression problem,
[TABLE]
3.2 Physics-informed machine learning
Motivated by the development of physics-informed neural networks [33], we expect that we can improve on the rudimentary loss function in Eq. 13 by incorporating information about the underlying physics. In [33], and in particular in the context of HJB equations in [2, 10, 35, 34], the known underlying PDE and boundary conditions are imposed by minimizing a residual loss over spatio-temporal collocation points. In this approach, no data is gathered: the PDE is solved directly in the least-squares sense. But this residual must be evaluated over a large number of collocation points and can be rather expensive to compute. Thus we propose a simpler approach of modeling the costate along with the value function itself, taking full advantage of the ability to gather data along the characteristics of the HJB PDE.
Specifically, we know that the costate must satisfy Eq. Eq. 9, so we train the NN to minimize
[TABLE]
where is the gradient of the NN model with respect to the state. This quantity is calculated using automatic differentiation. In machine learning, automatic differentiation is usually used to compute gradients with respect to the model parameters, but is just as easy to apply to computing gradients with respect to inputs. This gradient is exact, so no finite difference approximations are needed. In addition, the computational graph is pre-compiled so evaluating the gradient is cheap.
Costate data is obtained for each trajectory as a natural product of solving the BVP Eq. 11. Hence we have the augmented data set,
[TABLE]
where . We now define the physics-informed learning problem,
[TABLE]
Here is a scalar weight, the loss with respect to data is
[TABLE]
and the gradient regularization is defined as
[TABLE]
Following standard practice, when computing the loss functions Eqs. 16 and 17, the output data is linearly scaled to the range to improve the scaling of the optimization problem.
A NN trained to minimize Eq. 15 learns not just to fit the value data, but it is rewarded for doing so in a way that respects the underlying structure of the problem. Gradient regularization takes the known solution structure into account; this makes it preferable to the usual or regularization, which are based on the (heuristic) principle that simpler representations of data are likely to generalize better. Furthermore, we recall that the optimal control depends explicitly on – see Eqs. Eq. 10 and Eq. 20. Accurate approximation of is therefore essential for calculating optimal controls. Our method achieves this through automatic differentiation to compute exact gradients and by minimization of the gradient loss term Eq. 17.
3.3 Model validation
In common practice, one randomly partitions the given data set Eq. 14 into a training set and validation set . During training, the loss functions Eq. 16 and Eq. 17 are calculated with respect to the training data . We then evaluate the performance of the NN against the validation data , which it did not observe during training. Good validation performance indicates that the NN generalizes well, i.e. it did not overfit the training data. We make the validation test more stringent by generating and from independently drawn initial conditions, so that the two data sets do not share any part of the same trajectories.
We consider the following error metrics for validation. First, the relative mean absolute error (RMAE) of value function prediction, which is defined as
[TABLE]
We also measure the relative mean error (RM) of gradient prediction, which is defined as
[TABLE]
We consider these error metrics instead of pointwise relative errors in order to emphasize predictive accuracy in regions where a lot of control effort is needed. This is important because we are interested in designing nonlinear controllers which are effective and efficient far away from the equilibrium.
3.4 Neural network in the closed-loop system
Once the NN is trained, evaluating at new inputs is highly efficient. Moreover, since we minimized the gradient loss Eq. 17 during training, we also expect to approximate the true gradient well. At runtime, whenever the feedback control needs to be computed, we evaluate and then solve Eq. 10 based on this approximation.
For many problems of interest, the optimization problem Eq. 10 admits an analytic or semi-analytic solution. In particular, for the important class of control affine systems with running cost convex in , we can solve Eq. 10 analytically. Suppose that the system dynamics can be written in the form
[TABLE]
where , , and the control is unconstrained. Further, suppose that the running cost is of the form
[TABLE]
for some convex function and some positive definite weight matrix . Then the Hamiltonian is
[TABLE]
Now we apply PMP, which for unconstrained control requires
[TABLE]
Letting and solving for yields the optimal feedback control law in explicit form:
[TABLE]
The resulting NN controller is then simply
[TABLE]
4 Adaptive sampling and model refinement
Since generating just a single data point requires solving a challenging BVP, it can be expensive to generate large data sets which adequately represent the value function. This necessitates training using limited data and a method to generate new data in a smart and efficient way. In this paper, effective training with small data sets is accomplished by incorporating information about the costate as discussed in Section 3.2, but also by combining progressive data generation with an efficient adaptive sampling technique.
Optimization methods in machine learning (see e.g. [6] for a comprehensive survey) are typically divided into second and first order methods. Second order methods like L-BFGS [8] rely on accurate gradient computations, and hence generally have to use the entire data set. For this reason they are often referred to as batch or full-batch methods. On the other hand, first order methods based on stochastic gradient descent (SGD) use only small subsets, or mini-batches, of the full data set. That is, at each optimization iteration , the loss functions in Eq. 13 and Eq. 15 are evaluated only on a subset with . Here denotes the number of data points in a data set . Although second order methods converge much more quickly than first order methods, the necessary gradient calculations are prohibitively expensive for large data sets. Consequently, SGD variants have become the de facto standard for machine learning applications.
But in the context of deep learning, our NNs are small and data sets smaller. Thus we expect second order methods to be superior for our purposes. With a small initial data set, which we denote by , we find that training a low-fidelity model is very fast using L-BFGS. After this initial round, we want to increase the size of the data set so that it better captures the features of the value function. We then continue training the model using this larger data set, . We continue this process until some convergence conditions are satisfied.
Our approach is similar to and inspired by a progressive batching method proposed in [7]. The primary difference is that the problem addressed in [7] is a standard machine learning problem, where a massive data set is available from the start. This allows one to increase the sample size every few iterations, and take a completely different sample from the available data. In our problem, start with only a small amount of data and we can generate more as we go, but since data generation is expensive, we would like to generate only as much as is needed.
4.1 Convergence test and sample size selection
In this section we derive a convergence test and sample size selection scheme for the purpose of progressive data generation. To start, suppose that the internal optimizer (e.g. L-BFGS) converges in optimization round and let be the available training data set. Given convergence of the internal optimizer, the first order necessary condition for optimality holds, so
[TABLE]
Here is the physics-informed loss defined in Eq. Eq. 15, and is its gradient with respect to the NN parameters . For true first order optimality, we would like the gradient to be small when evaluated over the entire continuous domain of interest, . In other words, we want
[TABLE]
where the Monte Carlo sums in Eqs. Eqs. 16 and 17 become integrals in the limit as the size of the data set approaches infinity.
The simplest way to see if Eq. 23 holds is to generate a validation data set . Then using the fact that in the limit as , one checks if, for example,
[TABLE]
for some small parameter . Convergence tests like Eq. 24 are standard in machine learning and are useful for testing generalization performance. But for many practical problems, it may be too expensive to generate enough validation data to make the test meaningful. More importantly, such tests provides no clear guidance in selecting the sample size should they not be satisfied.
In this paper, we use validation tests to quantify model accuracy after training is complete (see Section 3.3). Indeed, the ability to empirically validate solutions is a key benefit of the causality-free approach. For the purpose of determining convergence between training rounds, however, we propose a different statistically motivated test which provides information on choosing . The idea is simple: since we already assume Eq. 22 holds, then to ensure that Eq. 23 is also satisfied, it suffices to check that the error in approximating Eq. 23 by Eq. 22 is relatively small.
To motivate this more rigorously, consider a finite sample set with fixed size , and assume that the sample points are independent and identically distributed (i.i.d.)111In practice, while initial conditions are i.i.d., points at future times lie along the optimal trajectories coming from these initial conditions and are thus spatially correlated. Adaptive sampling (see Section 4.2) also introduces sample dependence. This likely reduces sample variance compared to i.i.d. data, but we still find the numerical tests useful for providing sample size guidelines. In addition, if we learn only the initial-time value function as in Section 5, then sample independence can be upheld if we forego adaptive sample placement.. By design, if are i.i.d. then the sample gradient is an unbiased estimator for the true gradient (evaluated over the entire continuous domain). That is,
[TABLE]
where denotes the population mean over all possible finite sample sets with fixed size . Intuitively, Eq. 25 implies that if Eq. 22 holds, then on average we also have Eq. 23, as desired. But we must control the mean square error (MSE) of the estimator, which is given by
[TABLE]
To simplify this, using linearity of the expectation we obtain
[TABLE]
and then by construction of the loss function,
[TABLE]
Using the simplifying assumption that are i.i.d., this becomes
[TABLE]
If the estimation error is small, then the sample mean is likely to be a good approximation of the true mean. Hence we expect that will also be small as desired. To this end, we require that the root MSE not be too large compared to the expected gradient. Specifically, we check if
[TABLE]
where is a scalar parameter. On the right hand side we use the norm instead of the as it is less sensitive to outliers in the loss gradient. In practice we find that this makes the test less likely to suggest unreasonably large sample sizes.
In practice, evaluating of the true population variances on the left hand side of Eq. 28 is computationally intractable. But we can approximate these by the corresponding sample variances222Computing a large number of individual gradients can still be too costly, so we often evaluate sample variances over a smaller subset of the training data. taken over all data , which we denote by :
[TABLE]
Similarly, we approximate the expected gradient on the right hand side of Eq. 28 by the sample gradient and arrive at the following practical convergence criterion:
[TABLE]
If the convergence test Eq. 29 is satisfied, then it is likely that the expected gradient is also small. In other words, we expect that the parameters satisfies the first order optimality conditions evaluated over the entire domain, so we can stop optimization. Satisfaction of Eq. 29 does not imply that the trained model is good – merely that seeing more data would probably not improve it significantly. On the other hand, when the criterion is not met, then it guides us in selecting the next sample size . Concretely, suppose that the ratio of the sample variance to the sample gradient doesn’t change significantly by increasing the size of the data set, i.e.
[TABLE]
Then the appropriate choice of to satisfy Eq. 29 after the next round is such that
[TABLE]
where is a scalar parameter which prevents the data set size from growing too quickly. Throughout this paper we use .
The convergence test Eq. 29 and sample size selection scheme Eq. 30 derived above are close to that used in [7], except that we employ the norm of the sample gradient in the denominator instead of the norm. We prefer the norm because it is less sensitive to outliers in the loss gradient. Intuitively, this improves robustness by making the test less likely to suggest unreasonably large sample sizes. We also contribute a different derivation, coming from the perspective of progressive data generation as opposed to sampling from a large pre-existing data set. Finally, like [7] our results are not specific to learning solutions to the HJB equation. They can be applied to many data-driven optimization problems where data is scarce but can be generated over time. Notably, these results facilitate the use of existing algorithms for second order and constrained optimization in such applications.
4.2 Adaptive data generation with NN warm start
The sample size selection criterion Eq. 30 we propose indicates how many data are necessary to satisfy the convergence test Eq. 29, assuming a uniform sampling from the domain. In practice, since all the data we generate will be new, we can choose to generate new data where it is needed most, hence the term adaptive sampling. This condition for generating new data can be interpreted in many ways. In this paper, we concentrate samples where is large. Regions of the value function with large gradients tend to be steep or complicated, and thus may benefit from having more data to learn from. Furthermore, these regions correspond to places where the control effort is large and hence we would like controllers to be especially accurate there.
Specifically, for each initial condition we want to integrate, we can first randomly sample a set of candidate initial conditions from . A quick pass through the NN yields the predicted gradient at all candidate points:
[TABLE]
We then choose the point(s) with the largest predicted gradient norms and solve the BVP Eq. 11 for each of these. To aid in solving these BVPs, instead of using the time-marching trick described in Section 2.2, we simulate the system dynamics using the partially-trained NN as the closed-loop controller and predicting along the trajectory. In most cases, this yields an approximate solution which is reasonably close to the optimal state and costate. By supplying this trajectory as an initial guess to the BVP solver, we then quickly and reliably obtain a solution to the BVP for the full time interval . This process is repeated for new initial conditions until we obtain the desired amount of data (each trajectory may contain hundreds of data points). We refer to this technique as a NN warm start. A summary of the full training procedure is given in Algorithm 1.
Algorithm 1 enables us to build up a rich data set and a high-fidelity model of . Moreover, the data set is not constrained to lie within a small neighborhood of some nominal trajectory. It can contain points from the entire domain , and we can concentrate more data near complicated features of the value function. As we progressively refine the NN model, we can adjust the gradient loss weight , as well as other hyperparameters such as the internal optimizer convergence tolerance and the number of terms in the L-BFGS Hessian approximation. As the NN is already partially-trained, fewer iterations should be needed for convergence in each round so we can afford to make each iteration more expensive.
5 Application to rigid body attitude control
To illustrate the capabilities of proposed method, we consider the six-state rigid body model of a satellite studied by Kang and Wilcox [24, 25]. With the sparse grid characteristics method, they interpolate the value function at initial time, , and use this for moving horizon feedback control of the nonlinear system. We use their successful results as a baseline for evaluating our method.
Let be an inertial frame of orthonormal vectors and let be a body frame. The state of the satellite is then written as . Here is the attitude of the satellite represented in Euler angles,
[TABLE]
in which , , and are the angles of rotation around , , and , respectively, in the order . These are also commonly called roll, pitch, and yaw. denotes the angular velocity in the body frame,
[TABLE]
For details see [16]. The state dynamics are
[TABLE]
Here are matrix-valued functions defined as
[TABLE]
and
[TABLE]
Further, is a combination of the inertia matrices of the momentum wheels and the rigid body without wheels, is the total constant angular momentum of the system, and is a constant matrix where is the number of momentum wheels. To control the system, we apply a torque .
We consider the fully-actuated case where . Let
[TABLE]
The optimal control problem is
[TABLE]
Here
[TABLE]
and
[TABLE]
Finally, we consider initial conditions in the domain
[TABLE]
In [25], to avoid discretizing time the value function is approximated only at initial time . In order to facilitate a fair comparison we do the same. This means that we model , i.e. the NN does not take time as an input variable. Consequently the control is implemented with a time-independent moving horizon rather than as a time-dependent optimal control. In other words, at each time when we evaluate the control, we assume and return . Controlling the system using moving horizon feedback is standard practice. It is also reasonable for the present case because the problem dynamics are time-invariant and the time horizon is relatively long. Because of this we observe near-optimal performance from the moving horizon controller.
5.1 Learning the value function
In this section, we present numerical results of our implementation of a NN for modeling the initial-time value function of the rigid body attitude control problem Eq. 31. To obtain data, we uniformly sample initial conditions from the domain defined in Eq. 32, and for each initial value, solve the two-point BVP Eq. 11 using time-marching and the SciPy [37] implementation of the three-stage Lobatto IIIa algorithm in [26]. Each integrated trajectory contains around 100 data points on average, but we use only initial time data, . For validation, we generate a data set containing data points (at ), and keep this fixed throughout all the tests. As a baseline, the sparse grid characteristics method with grid points achieves a RMAE of on this validation data set.
We implement a standard feedforward NN in TensorFlow 1.11 [1] and train it to approximate . The NN has three hidden layers with 64 neurons in each, but many alternate configurations of depth and width also work. For optimization, we use the SciPy interface for the L-BFGS optimizer [37, 8]. Figure 1 displays the results of a series of tests in which we vary the weight on the value gradient loss term Eq. 17 and the size of the training data set. Results are compared to those obtained in [25].
We highlight that with just 512 data points, we can train NNs with better accuracy than the sparse grid characteristics method with points. Thus for this problem, the proposed method is about 90 times as data-efficient. With 8192 data points, the NN can be almost four times as accurate as the sparse grid characteristics method. This level of accuracy with small data sets is obtained only with physics-informed learning. In particular, NNs trained by pure regression Eq. 13 cannot match the accuracy of the sparse grid characteristics method, as shown in Fig. 1 for the case with . Accuracy improves as we increase but with diminishing returns for . While physics-informed learning is more costly, it facilitates the use of much smaller data sets, and the increased training time is still quite short.
5.2 Training with adaptive data generation
Performing a thorough systematic study of the adaptive sampling and model refinement technique proposed in Section 4 is rather complicated, since a successful implementation depends on various hyperparameter settings, which can and perhaps should change each optimization round. Results also depend on random chance, since data points are chosen in a (partially) random way and the randomly-initialized NN training problem is highly non-convex. For this reason, in this section we show a just few conservative results which we feel illustrate the potential of the method.
Figure 2 shows the progress of the validation error during training when using adaptive sampling starting from a data set with points. We set the gradient loss weight to and the convergence parameter in Eq. 29 to . After each round, we check the convergence criterion Eq. 29 and increase the number of training data according to Eq. 30. Each data set includes all previously generated data, and we generate additional data as needed through Algorithm 1. With these configurations, the model passes the convergence test after seven training rounds and observing a total of samples.
The final value function accuracy is : over twice as accurate as the sparse grid method with about twenty times fewer data, and the gradient prediction accuracy is . As shown in Fig. 2, the gradient predictions of the network trained using the adaptive algorithm are just as accurate as a network trained on a fixed data set of samples. That is to say, the adaptive sampling method facilitates more acurate gradient predictions using fewer data. These results highlight the main advantages of the adaptive sampling and model refinement method: the ability to overcome an initial lack of data, efficiently generate a large data set, and improve gradient prediction accuracy which is needed for effective control. To fully realize the potential of the method, hyperparameters like , , and internal optimizer parameters need to be adjusted in each round. Development of algorithms to do this adaptively remains a topic for future research.
Next, we investigate the convergence of the BVP solver with time-marching and NN warm start. Results are given in Table 2 and Table 2, respectively. For these tests, we randomly sample candidate initial conditions and pick 1000 points with the largest predicted gradient norm, . Initial conditions with large gradient norm tend to be located in regions where the value function is steep and the control effort is large, and may thus be more difficult to solve. The set of initial conditions is fixed for all tests.
In the first row of Table 2, we attempt to solve the BVP with no time-marching, i.e. over the entire time interval without constructing any initial guess. In this case, the proportion of convergent solutions is extremely small, obviating the need for good initial guesses. As shown in Table 2, we reliably obtain solutions for this problem when we use at least time intervals. We note that the initial conditions are purposefully chosen to be difficult – if we simply take uniform samples from the domain , the proportion of convergent solutions increases significantly.
In Table 2, we present results using NN warm start. We train several NNs on a data set of only 64 points. Because the data set is so small, each NN takes only seconds to finish training. We also experiment with using different gradient loss weights for each NN. This directly impacts the accuracy in predicting the initial-time costate, , which in turn is key to synthezing optimal controls.
Even with these low-fidelity models, the rate of BVP convergence is just as high as when using time intervals for time-marching. The quality of initial guesses improves with better costate prediction, and it is not difficult to exceed convergence. For this problem, the speed of the two methods is about the same. However, when we consider higher-dimensional problems in Section 6.2, we find that NN warm start significantly improves both reliability and efficiency.
5.3 Closed-loop simulation
In this section we perform numerical simulations of the rigid body dynamics, demonstrating that the NN feedback controller is capable of stabilizing the system. Using Eq. 21 we calculate the optimal feedback control law
[TABLE]
Recall that because we are using a time-independent value function model, the control is implemented as time-independent moving horizon feedback. Since and are constant matrices, we pre-compute the product . Hence evaluation of the control requires only a forward pass through the computational graph of and a matrix multiplication.
In Fig. 3, we plot a typical closed-loop trajectory starting from a randomly sampled initial condition. To make the simulation more realistic, we implement the controller using a zero-order-hold with a sample rate of 10 [Hz]. In addition, we corrupt inputs to the controller with Gaussian white noise with standard deviation . That is, for all , we apply the control
[TABLE]
where
[TABLE]
In spite of this, the NN controller successfully stabilizes the system. Furthermore, the total cost of the closed-loop trajectory is , about more than the optimal cost . For comparison, a linear quadratic regulator (LQR) for Eq. 31 accumulates a total cost of , which is more than the optimal cost. Finally, short computation time is critical for implementation in real systems, and this is achieved here as each evaluation of the control takes only a couple milliseconds on both an NVIDIA RTX 2080Ti GPU and a 2012 MacBook Pro.
6 Application to control of Burgers’-type PDE
In this section, we test our method on high-dimensional nonlinear systems arising from a Chebyshev pseudospectral (PS) discretization of a one-dimensional forced Burgers’-type PDE. An infinite-horizon version of this problem is studied in [22], in which the value function is approximated using a polynomial Galerkin technique. We note that in [22], separability of the nonlinear dynamics is required to compute the high-dimensional integrals necessary in the Galerkin formulation. Our method does not require this restriction, although it does apply to this problem.
As in [22], let satisfy the following one-dimensional controlled PDE with Dirichlet boundary conditions:
[TABLE]
For notational convenience we have written , and as before we denote , , and . The scalar-valued control is actuated only on , the support of the indicator function
[TABLE]
The PDE-constrained optimal control problem is
[TABLE]
Here
[TABLE]
and we set
[TABLE]
In this problem, the goal of stabilizing is made more challenging by the added reaction term, , which renders the origin unstable. This can be seen clearly in Fig. 4(a).
To solve Eq. 35 using our framework, we perform Chebyshev PS collocation to transform the PDE Eq. 34 into a system of ordinary differential equations (ODEs). Following [36], let
[TABLE]
where is the number of collocation points. After accounting for boundary conditions, we collocate at internal (non-boundary) Chebyshev points, , , where . The discretized state is defined as
[TABLE]
and the PDE Eq. 34 becomes a system of ODEs in dimensions:
[TABLE]
In the above, “” denotes element-wise multiplication (Hadamard product), is the discretized indicator function, and are the internal parts of the first and second order Chebyshev differentiation matrices, which are obtained by deleting the first and last rows and columns of the full matrices. This discretization automatically enforces the boundary conditions. Finally, since is collocated at Chebyshev nodes, the inner product appearing in the cost function is conveniently approximated by Clenshaw-Curtis quadrature [36]:
[TABLE]
where are the internal Clenshaw-Curtis quadrature weights and is calculated element-wise, i.e. . Now the original OCP Eq. 35 can be reformulated as an ODE-constrained problem,
[TABLE]
6.1 Learning high-dimensional value functions
The state dimension of the OCP Eq. 36 can be adjusted, presenting a good opportunity to test the scalability of our algorithms. For this problem, we learn the value function with time-dependence, rather than just as in Section 5. Consequently, the resulting controls can be implemented as time-dependent controls or with a moving horizon. We consider the following domain of initial conditions:
[TABLE]
Using the proposed adaptive deep learning framework, we approximate solutions to Eq. 36 in , 20, and 30 dimensions. We focus on demonstrating what is possible using our approach, rather than carrying out a detailed study of its effectiveness under different parameter tunings. In [22] an infinite-horizon version of the problem is solved up to twelve dimensions, but the accuracy of the solution is not readily verifiable. The ability to conveniently measure model accuracy for general high-dimensional problems with no known analytical solution is a key advantage of our framework.
For each discretized OCP, , 20, and 30, we apply the time-marching strategy to build an initial training data set from 30 uniformly sampled initial conditions, , . For each initial condition , the BVP solver outputs an optimal trajectory , evaluated at collocation points chosen by the solver. Typically this can be a few hundred per initial condition, depending on the state dimension and the BVP solver tolerances. Since these data sets can be get quite large, we often train on randomly selected subsets of the data. This can significantly improve training speed without sacrificing accuracy. When neeeded, we solve additional BVPs to expand the data set as described in Section 4.2. We use the same NN architecture as in Section 5, with three hidden layers with 64 neurons each. We set , 1.3, and 1.8 for , 20, and 30, respectively, and use in all cases.
In Table 3, we present validation accuracy results for the trained NNs. We include the RMAE in predicting the value function and the RM error in predicting the costate, . Accuracy is measured empirically on independently generated validation data sets comprised of trajectories from 50 randomly selected initial conditions. We find that the trained NNs accurately predict both the value function and its gradient, even in 30 dimensions.
Table 3 also shows the total number of sample trajectories seen by the NN, including the initial data . It may seem surprising that we are able to reach the same level of accuracy in higher dimensions with similar numbers of sample trajectories. This happens because the BVP solver usually needs more collocation points for larger problems, thus producing more data per trajectory. Consequently, fewer trajectories are needed to fulfill the data set size recommendation Eq. 30. Similarly, in Section 5 we used data only for , so we needed thousands of trajectories to fill the state space. This suggests that learning the time-dependent value function can be more efficient than learning only. Note that, if preferred, the time-dependent controller can still be used with a moving horizon like in Section 5.
Lastly, Table 3 shows the training time for each NN, including time spent testing convergence and generating additional trajectories on the fly, but not time spent generating the initial data. Generating data becomes the most expensive computation as increases, but even so we find that computational effort scales reasonably with the problem dimension. Furthermore, it is possible to obtain a rough low-fidelity NN model in just minutes as shown in Table 5, which in turn allows for more efficient data generation. This demonstrates the viability of the proposed method for solving high-dimensional optimal control problems.
6.2 NN warm start for fast and reliable BVP solutions
In our experience, generating the initial training data set can be the most computationally demanding part of the process, especially as the problem dimension increases. Consequently, for difficult high-dimensional problems it may be impractical to generate a large-enough data set from scratch. This obstacle can be largely overcome by using partially-trained/low-fidelity NNs to aid in further data generation. In this section, we briefly compare the reliability and speed of BVP convergence between our two strategies: time-marching and NN warm start. These experiments demonstrate the importance of NN guesses for high-dimensional data generation.
For each of , 20, and 30, we randomly sample a set of 1000 candidate points from the domain defined in Eq. 37. From these we choose 100 points with the largest predicted value gradient. The set of initial conditions is fixed for each . Next we proceed as in Section 5.2, solving each BVP by time-marching with different number of time-marching iterations, . We tune each time sequence to improve convergence as much as possible. Results are summarized in Table 5. We then solve the same BVPs directly over the whole time interval with NN warm start. These NNs are trained on fixed data sets containing only 30 trajectories, but with different gradient loss weights , resulting in varying costate prediction accuracy. We also limit the number of L-BFGS iterations so that each model is trained only for a short time. Results are given in Table 5.
As before, we find that even NNs with relatively large costate prediction error enable consistently convergent BVP solutions. Time-marching also works once the sequence of time steps is properly tuned, but the speed of this method scales poorly with . Now the advantage of utilizing NNs to aid in data generation becomes clear: when is large, the average time needed for convergence when using NN warm start is drastically lower than that of the time-marching trick. This approach also requires no tuning of the time-marching sequence. Because low-fidelity NNs are quick to train, training such a NN and then using it to aid in data generation is the most efficient strategy for building larger data sets.
6.3 Closed-loop simulations
In this section we use simulations to demonstrate that the feedback control output by the trained NN not only stabilizes the high-dimensional system, but that it is close to the true optimal control. The optimal feedback control law can again be calculated with Eq. 21, from which we obtain
[TABLE]
In Fig. 4, we plot the uncontrolled (Fig. 4(a)) and closed-loop controlled dynamics (Fig. 4(b)), starting from two different initial conditions, and , where the dimension of the discretized system is . For both of these initial conditions (and almost all others tested), the NN controller successfully stabilizes the open-loop unstable origin. Further, as shown in Fig. 4(c), the NN-generated controls are very close to the true optimal controls which are calculated by solving the associated BVPs. Finally, the speed of online control computation is not sensitive to the problem dimension: each evaluation still takes just milliseconds on both an NVIDIA RTX 2080Ti GPU and a 2012 MacBook Pro.
7 Conclusion
In this paper, we have developed a novel machine learning framework for solving HJB equations and designing candidate optimal feedback controllers. Unlike many other state of the art techniques, our method does not require finite difference approximations of the gradient nor strict restrictions on the structure of the dynamics. The causality-free algorithm we use for data generation enables application to high-dimensional systems and validation of model accuracy. We also emphasize that while our method is data-driven, by leveraging the costate data we are able to train more physically-consistent models and better controllers with surprisingly small data sets.
The proposed method is not only a consumer of data, but through adaptive data generation it can also be used build rich data sets with points anywhere in a semi-global domain. Thus the value function and control are valid for large ranges of dynamic states, rather than just in the neighborhood of some nominal trajectory. Furthermore, data can be generated specifically near complicated regions of the value function or where the required control effort is large. This in turn allows us to train more accurate NN models and synthesize better-performing controllers, or employ other data-driven methods.
We have demonstrated the possibility for use of the framework in a practical setting by designing candidate optimal feedback controllers for a six-dimensional nonlinear rigid body. The potential for scalability of the method is demonstrated by solving HJB equations in up to 30 dimensions using limited data, and empirical validation indicates that the NN models are good approximations of the value function. How well the proposed techniques work for even larger problems remains an open question. Indeed, understanding the scalability of deep learning methods in general is still an active area of research. Nevertheless, we are encouraged by the simulations in Section 6 which suggest that the method may scale quite well for moderately high-dimensional problems. In addition, the computational burden associated with an increase in dimensionality is incurred entirely offline: due to the structure of NNs, increasing the dimension has a negligible impact on the speed of online control calculation.
These promising results leave plenty of room for future development. Of special interest are extensions of the framework to solve problems with free final time and state and control constraints, which appear ubiquitously in practical applications. Such problems typically give rise to non-unique solutions of PMP and non-smooth value functions, thus presenting substantial challenges for both data generation and neural network modeling. Overcoming these obstacles would open the door to solving many important and difficult optimal control problems.
The reference list from the paper itself. Each links out to its DOI / PubMed record.
- 1[1] M. Abadi, A. Agarwal, P. Barham, et al. , Tensor Flow: Large-scale machine learning on heterogeneous systems , 2016, https://arxiv.org/abs/1603.04467 .
- 2[2] M. Abu-Khalaf and F. L. Lewis , Nearly optimal control laws for nonlinear systems with saturating actuators using a neural network HJB approach , Automatica, 41 (2005), pp. 779–791, https://doi.org/10.1016/j.automatica.2004.11.034 . · doi ↗
- 3[3] E. Al’brekht , On the optimal stabilization of nonlinear systems , J. Appl. Math. Mech., 25(5) (1961), pp. 1254–1266, https://doi.org/10.1016/0021-8928(61)90005-3 . · doi ↗
- 4[4] A. Bachouch, C. Huré, N. Langrené, and H. Pham , Deep neural networks algorithms for stochastic control problems on finite horizon: numerical applications , 2018, https://arxiv.org/abs/1812.05916 .
- 5[5] O. Bokanowski, J. Garcke, M. Griebel, and I. Klompmaker , An adaptive sparse grid semi-Lagrangian scheme for first order Hamilton-Jacobi Bellman equations , J. Sci. Comput., 55 (2013), pp. 575–605, https://doi.org/10.1007/s 10915-012-9648-x . · doi ↗
- 6[6] L. Bottou, F. E. Curtis, and J. Nocedal , Optimization methods for large-scale machine learning , SIAM Rev., 60 (2018), pp. 223–311, https://doi.org/10.1137/16M 1080173 . · doi ↗
- 7[7] R. H. Byrd, G. M. Chin, J. Nocedal, and Y. Wu , Sample size selection in optimization methods for machine learning , Math. Program., 134 (2012), pp. 127–155, https://doi.org/10.1007/s 10107-012-0572-5 . · doi ↗
- 8[8] R. H. Byrd, P. Lu, J. Nocedal, and C. Zhu , A limited memory algorithm for bound constrained optimization , SIAM J. Sci. Comput., 16 (1995), pp. 1190–1208, https://doi.org/10.1137/0916069 . · doi ↗
