Connections Between Adaptive Control and Optimization in Machine   Learning

Joseph E. Gaudio; Travis E. Gibson; Anuradha M. Annaswamy; Michael A.; Bolender; Eugene Lavretsky

arXiv:1904.05856·math.OC·April 17, 2020

Connections Between Adaptive Control and Optimization in Machine Learning

Joseph E. Gaudio, Travis E. Gibson, Anuradha M. Annaswamy, Michael A., Bolender, Eugene Lavretsky

PDF

TL;DR

This paper explores the deep connections between adaptive control and machine learning optimization, revealing shared concepts and proposing new avenues for algorithm analysis and improvement.

Contribution

It uncovers fundamental links between adaptive control and machine learning optimization, leading to novel insights and solutions for higher order learning problems.

Findings

01

Identifies similarities in update laws between the fields

02

Discusses shared concepts like stability and performance

03

Provides new analysis opportunities and solves a higher order learning problem

Abstract

This paper demonstrates many immediate connections between adaptive control and optimization methods commonly employed in machine learning. Starting from common output error formulations, similarities in update law modifications are examined. Concepts in stability, performance, and learning, common to both fields are then discussed. Building on the similarities in update laws and common concepts, new intersections and opportunities for improved algorithm analysis are provided. In particular, a specific problem related to higher order learning is solved through insights obtained from these intersections.

Equations56

y (t) = f_{1} (ϕ (t), θ^{*})

y (t) = f_{1} (ϕ (t), θ^{*})

\overset{x}{˙} (t) = f_{2} (x (t), u (t), θ^{*}), y (t) = f_{3} (x (t), u (t), θ^{*})

\overset{x}{˙} (t) = f_{2} (x (t), u (t), θ^{*}), y (t) = f_{3} (x (t), u (t), θ^{*})

\overset{y}{^} (t) = f (ϕ (t), θ (t))

\overset{y}{^} (t) = f (ϕ (t), θ (t))

\dot{θ} (t) = g_{1} (e_{y} (t), ϕ (t))

\dot{θ} (t) = g_{1} (e_{y} (t), ϕ (t))

u (t) \dot{θ} (t) = g_{2} (e_{y} (t), ϕ (t), θ (t)) = g_{3} (e_{y} (t), ϕ (t), θ (t))

u (t) \dot{θ} (t) = g_{2} (e_{y} (t), ϕ (t), θ (t)) = g_{3} (e_{y} (t), ϕ (t), θ (t))

e_{y} (t) = \tilde{θ}^{T} (t) ϕ (t)

e_{y} (t) = \tilde{θ}^{T} (t) ϕ (t)

e_{y} (t) = W (s) [\tilde{θ}^{T} (t) ϕ (t)]

e_{y} (t) = W (s) [\tilde{θ}^{T} (t) ϕ (t)]

\overset{y}{^}_{k} = f (ϕ_{k}, θ_{k}) .

\overset{y}{^}_{k} = f (ϕ_{k}, θ_{k}) .

\dot{θ} (t) = - γ \nabla_{θ} L (θ (t)) = - γ ϕ (t) e_{y} (t) .

\dot{θ} (t) = - γ \nabla_{θ} L (θ (t)) = - γ ϕ (t) e_{y} (t) .

θ_{k + 1} = θ_{k} - γ_{k} \nabla_{θ} L (θ_{k})

θ_{k + 1} = θ_{k} - γ_{k} \nabla_{θ} L (θ_{k})

\dot{θ} (t) = - γ [\nabla_{θ} L (θ (t)) + σ G (θ (t), e_{y} (t))]

\dot{θ} (t) = - γ [\nabla_{θ} L (θ (t)) + σ G (θ (t), e_{y} (t))]

θ_{k + 1} = θ_{k} - γ_{k} [\nabla_{θ} L (θ_{k}) + σ \nabla_{θ} R (θ_{k})] .

θ_{k + 1} = θ_{k} - γ_{k} [\nabla_{θ} L (θ_{k}) + σ \nabla_{θ} R (θ_{k})] .

\dot{\theta}(t)=\left\{\begin{array}[]{ll}-\gamma\nabla_{\theta}L(\theta(t)),&\mathcal{D}(e_{y})>d_{0}+\epsilon\\ 0,&\mathcal{D}(e_{y})\leq d_{0}+\epsilon\end{array}\right.

\dot{\theta}(t)=\left\{\begin{array}[]{ll}-\gamma\nabla_{\theta}L(\theta(t)),&\mathcal{D}(e_{y})>d_{0}+\epsilon\\ 0,&\mathcal{D}(e_{y})\leq d_{0}+\epsilon\end{array}\right.

\text{Proj}(\theta_{i},\zeta_{i})=\left\{\begin{array}[]{ll}\frac{\theta_{i,\max}^{2}-\theta_{i}^{2}}{\theta_{i,\max}^{2}-\theta_{i,\max}^{\prime 2}}\zeta_{i},&\theta_{i}\in\Omega_{i}\wedge\theta_{i}\zeta_{i}>0\\ \zeta_{i},&\text{otherwise}\end{array}\right.

\text{Proj}(\theta_{i},\zeta_{i})=\left\{\begin{array}[]{ll}\frac{\theta_{i,\max}^{2}-\theta_{i}^{2}}{\theta_{i,\max}^{2}-\theta_{i,\max}^{\prime 2}}\zeta_{i},&\theta_{i}\in\Omega_{i}\wedge\theta_{i}\zeta_{i}>0\\ \zeta_{i},&\text{otherwise}\end{array}\right.

\dot{θ} (t) = - γ Proj [θ (t), \nabla_{θ} L (θ (t))] .

\dot{θ} (t) = - γ Proj [θ (t), \nabla_{θ} L (θ (t))] .

Π_{Θ} (\overset{ˉ}{θ}) ≜ θ \in Θ ar g min ∥ θ - \overset{ˉ}{θ} ∥

Π_{Θ} (\overset{ˉ}{θ}) ≜ θ \in Θ ar g min ∥ θ - \overset{ˉ}{θ} ∥

\overset{ˉ}{θ}_{k + 1} = θ_{k} - γ_{k} \nabla_{θ} L (θ_{k}), θ_{k + 1} = Π_{Θ} (\overset{ˉ}{θ}_{k + 1}) .

\overset{ˉ}{θ}_{k + 1} = θ_{k} - γ_{k} \nabla_{θ} L (θ_{k}), θ_{k + 1} = Π_{Θ} (\overset{ˉ}{θ}_{k + 1}) .

\displaystyle\begin{split}\dot{\theta}(t)&=-\gamma\Gamma(t)\nabla_{\theta}L(\theta(t))\\ \dot{\Gamma}(t)&=\left\{\begin{array}[]{ll}\Upsilon\Gamma(t)-\frac{\Gamma(t)\phi(t)\phi^{T}(t)\Gamma(t)}{\mathcal{N}(t)},&\lVert\Gamma(t)\rVert\leq\Gamma_{\max}\\ 0,&\text{otherwise}\end{array}\right.\end{split}

\displaystyle\begin{split}\dot{\theta}(t)&=-\gamma\Gamma(t)\nabla_{\theta}L(\theta(t))\\ \dot{\Gamma}(t)&=\left\{\begin{array}[]{ll}\Upsilon\Gamma(t)-\frac{\Gamma(t)\phi(t)\phi^{T}(t)\Gamma(t)}{\mathcal{N}(t)},&\lVert\Gamma(t)\rVert\leq\Gamma_{\max}\\ 0,&\text{otherwise}\end{array}\right.\end{split}

\overset{ˉ}{θ}_{k + 1} = θ_{k} - γ_{k} m_{k} / V_{k}^{1/2}, θ_{k + 1} = Π_{Θ} (\overset{ˉ}{θ}_{k + 1})

\overset{ˉ}{θ}_{k + 1} = θ_{k} - γ_{k} m_{k} / V_{k}^{1/2}, θ_{k + 1} = Π_{Θ} (\overset{ˉ}{θ}_{k + 1})

\overset{e}{˙} (t) e_{y} (t) = A e (t) + b \tilde{θ}^{T} (t) \hat{ϕ} (t) + b θ^{* T} \tilde{ϕ} (t) = ce (t) .

\overset{e}{˙} (t) e_{y} (t) = A e (t) + b \tilde{θ}^{T} (t) \hat{ϕ} (t) + b θ^{* T} \tilde{ϕ} (t) = ce (t) .

V = γ^{- 1} \tilde{θ}^{T} \tilde{θ} + e^{T} P e + α \tilde{ϕ}^{T} \overset{ˉ}{P} \tilde{ϕ} .

V = γ^{- 1} \tilde{θ}^{T} \tilde{θ} + e^{T} P e + α \tilde{ϕ}^{T} \overset{ˉ}{P} \tilde{ϕ} .

\dot{V} = - e^{T} Q e - α \tilde{ϕ}^{T} \overset{ˉ}{Q} \tilde{ϕ} + 2 e^{T} P b θ^{* T} \tilde{ϕ}

\dot{V} = - e^{T} Q e - α \tilde{ϕ}^{T} \overset{ˉ}{Q} \tilde{ϕ} + 2 e^{T} P b θ^{* T} \tilde{ϕ}

\int_{t_{0}}^{T} e^{T} Q e d t - \int_{t_{0}}^{T} δ (t) d t \leq - \int_{t_{0}}^{T} \dot{V} d t = V (t_{0}) - V (T) .

\int_{t_{0}}^{T} e^{T} Q e d t - \int_{t_{0}}^{T} δ (t) d t \leq - \int_{t_{0}}^{T} \dot{V} d t = V (t_{0}) - V (T) .

regret_{T} = k = 1 \sum T C_{k} (θ_{k}) - θ \in Θ min k = 1 \sum T C_{k} (θ)

regret_{T} = k = 1 \sum T C_{k} (θ_{k}) - θ \in Θ min k = 1 \sum T C_{k} (θ)

continuous regret_{T} = \int_{t_{0}}^{T} e^{T} Q e d t - \int_{t_{0}}^{T} \overset{ˉ}{δ} (t) d t

continuous regret_{T} = \int_{t_{0}}^{T} e^{T} Q e d t - \int_{t_{0}}^{T} \overset{ˉ}{δ} (t) d t

C (\overset{ˉ}{θ}_{T}) - C (θ^{*}) \leq \frac{1}{T} k = 1 \sum T [C (θ_{k}) - C (θ^{*})] = \frac{regret _{T}}{T}

C (\overset{ˉ}{θ}_{T}) - C (θ^{*}) \leq \frac{1}{T} k = 1 \sum T [C (θ_{k}) - C (θ^{*})] = \frac{regret _{T}}{T}

θ_{k + 1} ϑ_{k} = ϑ_{k} - γ \nabla_{θ} L (ϑ_{k}) = θ_{k} + β (θ_{k} - θ_{k - 1})

θ_{k + 1} ϑ_{k} = ϑ_{k} - γ \nabla_{θ} L (ϑ_{k}) = θ_{k} + β (θ_{k} - θ_{k - 1})

\dot{ϑ} (t) \dot{θ} (t) = - γ \nabla_{θ} L (θ (t)) = - β (θ (t) - ϑ (t)) N (t)

\dot{ϑ} (t) \dot{θ} (t) = - γ \nabla_{θ} L (θ (t)) = - β (θ (t) - ϑ (t)) N (t)

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Full text

Connections Between Adaptive Control and Optimization in Machine Learning

Joseph E. Gaudio

Massachusetts Institute of Technology

Travis E. Gibson

Brigham and Women’s Hospital and Harvard Medical School

Anuradha M. Annaswamy

Massachusetts Institute of Technology

Michael A. Bolender

Air Force Research Laboratory

Eugene Lavretsky

The Boeing Company

(April 11, 2019)

Abstract

This paper demonstrates many immediate connections between adaptive control and optimization methods commonly employed in machine learning. Starting from common output error formulations, similarities in update law modifications are examined. Concepts in stability, performance, and learning, common to both fields are then discussed. Building on the similarities in update laws and common concepts, new intersections and opportunities for improved algorithm analysis are provided. In particular, a specific problem related to higher order learning is solved through insights obtained from these intersections.

1 Introduction

The fields of adaptive control and machine learning have evolved in parallel over the past few decades, with a significant overlap in goals, problem statements, and tools. Machine learning as a field has focused on computer based systems that improve through experience [1, 2, 3, 4, 5, 6]. Often times the process of learning is encapsulated in the form of a parameterized model, whose parameters are learned in order to approximate a function. Optimization methods are commonly employed to reduce the function approximation error using any and all available data. The field of adaptive control, on the other hand, has focused on the process of controlling engineering systems in order to accomplish regulation and tracking of critical variables of interest (e.g. speed in automotive systems, position and force in robotics, Mach number and altitude in aerospace systems, frequency and voltage in power systems) in the presence of uncertainties in the underlying system models, changes in the environment, and unforeseen variations in the overall infrastructure [7, 8, 9, 10, 11]. The approach used for accomplishing such regulation and tracking in adaptive control is the learning of underlying parameters through an online estimation algorithm. Stability theory is employed for enabling guarantees for the safe evolution of the critical variables, and convergence of the regulation and tracking errors to zero.

Learning parameters of a model in both machine learning and adaptive control occurs through the use of input-output data. In both cases, the main algorithm used for updating the parameters is based on a gradient descent-like algorithm [11]. Related tools of analysis, convergence, and robustness in both fields have a tremendous amount of similarity. As the scope of problems in both fields increases, the associated complexity and challenges increase as well. Therefore it is highly attractive to understand these similarities and connections so that the two communities can develop new methods for addressing new challenges. In this paper, we discuss the similarities and connections in detail between the fields of adaptive control and machine learning. Using these connections, we state and provide a solution for a new problem in machine learning using methods developed in adaptive control.

In this paper, the adaptive control perspective will be presented in continuous time with machine learning material presented in discrete time. The paper organization is as follows. We introduce the formulation of output errors commonly employed in adaptive control and machine learning with their associated update laws in Section 2. Numerous connections between the two fields are then made with respect to the underlying parameter update laws in Section 3 and important concepts in Section 4. Examples of intersections between both fields are provided in Section 5, with concluding remarks in Section 6.

2 Problem Statements

In this section, we state typical problems that are addressed in the areas of adaptive control and machine learning. In both cases, we illustrate the role of learning, the input-output data used, and the overall problem that is desired to be solved.

2.1 Adaptive Control

The main goal in adaptive control is to carry out problems such as estimation or tracking in the presence of parametric uncertainties. The underlying model that relates inputs, outputs, and the unknown parameters is assumed to stem from either the underlying physics or from data-driven approaches. Often these models take the form

[TABLE]

or

[TABLE]

where $u\in\mathbb{R}^{m}$ is an exogenous input, $x\in\mathbb{R}^{n}$ denotes the state, $y\in\mathbb{R}^{p}$ corresponds to output measurements, $\phi\in\mathbb{R}^{N}$ corresponds to measured and computed variables, and $\theta^{*}\in\mathbb{R}^{N}$ denotes the uncertain parameter. In an estimation problem, the goal is to estimate the state $x$ in (2) and output $y$ in both (1), (2), alongside the unknown parameter $\theta^{*}$ simultaneously, using all available variables. In a control problem, the goal is to determine a control input $u$ so that the output $y$ in (2) follows a desired output $\hat{y}$ .

A typical approach taken in order to solve the estimation problem in (1) is to choose an estimator structure of the form

[TABLE]

where $\theta\in\mathbb{R}^{N}$ denotes the estimate of $\theta^{*}$ and adjust $\theta$ so that the estimation error $e_{y}=\hat{y}-y$ is minimized, i.e., choose a function $g_{1}(e_{y},\phi)$ with

[TABLE]

so that the estimator has bounded signals, $e_{y}(t)$ converges to zero and $\theta(t)$ converges to $\theta^{*}$ . Similarly, the control problem consists of constructing an output tracking error $e_{y}=\hat{y}-y$ , where $\hat{y}$ denotes the desired output that $y$ is required to track. The goal is to then choose functions $g_{2}(e_{y},\phi,\theta)$ and $g_{3}(e_{y},\phi,\theta)$ so that the control input $u$ and a control parameter estimate $\theta$ can be chosen as

[TABLE]

leading to closed-loop signals remaining bounded, $e_{y}(t)$ converging to zero and $\theta(t)$ converging to its true value $\theta^{*}$ . Denote the corresponding parameter errors as $\tilde{\theta}=\theta-\theta^{*}$ .

In order to derive the function $g_{1}$ for the estimation problem in (1) and the functions $g_{2}$ and $g_{3}$ for the control problem in (2) so as to realize the underlying goals, a stability framework together with an error model approach is often employed in adaptive control. The error model approach consists of identifying the basic relationship between the two errors that are commonly present in these adaptive systems, which are the estimation (or tracking) error $e_{y}$ and the parameter error $\tilde{\theta}$ . While the estimation error is measurable and correlated with the parameter error, the parameter error is unknown but adjustable through the parameter estimate. In order to determine the update laws $g_{i}$ , the relationship (error model) that relates these two errors is used as a cue.

Two types of error models frequently occur in adaptive systems, and are presented below (see Figure 1). The first corresponds to the case when the relation in (1) is linear, and the underlying error model is simply of the form (cf. [11])

[TABLE]

and as a result, the function $g_{1}$ in (4) can be determined simply using the gradient rule that minimizes $\lVert e_{y}\rVert^{2}$ . The second is of the form (cf. [11])

[TABLE]

where $W(s)[\zeta]$ denotes a dynamic operator operating on $\zeta(t)$ . It has been shown in the adaptive control literature [7, 8, 9, 10, 11] that for specific classes of dynamic operators $W(s)$ , a stable, gradient-like rule can be determined for adjusting $\tilde{\theta}$ . Most of these results apply uniformly to the case when $u$ and $y$ are scalars or vectors, with the latter introducing additional technicalities. In this paper we consider the case where inputs and outputs are scalars for notational simplicity, and to focus on the core of the learning problem with multi-dimensional regressors $\phi$ and parameter estimates $\theta$ . Often the unknown parameter $\theta^{*}$ is assumed to reside in a compact convex set, which we will denote as $\Theta$ .

2.2 Machine Learning

Machine learning is a broad field encompassing a wide variety of learning techniques and problems such as classification and regression. A large portion of machine learning considers supervised learning problems, where regressors $\phi$ and outputs $y$ are related to one another in an unknown algebraic manner [1, 2, 3, 4, 5, 6]. A typical approach taken in order to perform classification or regression is to choose an output estimator $\hat{y}_{k}$ parameterized with adjustable weights $\theta_{k}$ as

[TABLE]

A common form of the estimator as in (8) is that of neural networks, where the parameters $\theta_{k}$ represent the adjustable weights in the network [1, 2, 3, 4, 5].

Similar to adaptive control, $\theta_{k}$ is often adjusted using the output error $e_{y,k}=\hat{y}_{k}-y_{k}$ . A loss function $L:\Theta\rightarrow\mathbb{R}$ of $e_{y,k}$ is minimized through the adjustable weights. An example loss function for regression is $\ell_{p}$ loss (with $p\in\mathbb{N}$ , $p>0$ and even) $L(\theta_{k})=(1/p)\lVert e_{y,k}\rVert_{p}^{p}$ . For binary classification ( $y_{k}\in\{-1,1\}$ ) common loss functions include hinge loss $L(\theta_{k})=\max(0,1-y_{k}\hat{y}_{k})$ , and logistic loss $L(\theta_{k})=\ln(1+\text{exp}(-y_{k}\hat{y}_{k}))$ . Additionally, as in empirical risk minimization (ERM) [12], the total loss function considered for the purpose of a parameter update may be an average of loss functions over $m$ samples as: $(1/m)\sum_{i=1}^{m}L_{i}(\theta_{k})$ . The above descriptions make it clear that the structure of the estimation problem in both adaptive control and machine learning are strikingly similar. In the next section, we examine the nature of the adjustment of $\theta_{k}$ .

2.3 Common Update Laws

As previously stated, the goal in adaptive control is to design a rule to adjust $\theta$ in an online continuous manner using knowledge of $\phi$ and $e_{y}$ such that $e_{y}$ tends toward zero. Given that the output errors may be corrupted by noise, an iterative, gradient-like update is usually employed. To do so for the algebraic error model (6), consider the squared loss cost function: $L(\theta(t))=(1/2)e_{y}^{2}(t)$ . The gradient of this function with respect to the parameters can be expressed as: $\nabla_{\theta}L(\theta(t))=\phi(t)e_{y}(t)$ . The standard gradient flow update law [7] may be expressed as follows with user-designed gain parameter $\gamma>0$ as

[TABLE]

For dynamical error models such as (7), a stability approach rather than a gradient based one is taken using Lyapunov methods, which leads to an adaptive law identical to (9) for a class of dynamic systems $W(s)$ that are strictly positive real [7, 13].

The common update law for supervised machine learning problems, gradient descent111While this is not true of all machine learning as the field is broad, (for example Bayesian methods often use sampling based techniques such as Markov Chain Monte Carlo), even in the world of probabilistic inference, gradient based methods can also be used, cf. variational inference [14]., is akin to the time varying regression law (9) in discrete time, and of the form

[TABLE]

where the “stepsize” $\gamma_{k}$ is usually chosen as a decreasing function of time [15, 16, 17, 18, 19], a standard feature of stochastic gradient algorithms.

3 Connections: Update Law

This section details a variety of connections between adaptive control and the optimization methods commonly used in machine learning as viewed from the perspective of their common update laws (9), (10).

3.1 $\sigma$ -Modification, $e$ -Modification, and Regularization

While the update laws in (9) and (10) are designed primarily to reduce the output error $e_{y}$ , there are several secondary reasons to modify these update laws from robustness considerations due to perturbations stemming from disturbances, noise, and other unmodeled causes. We outline these updates in in this section.

3.1.1 Adaptive Control

Historically the adaptive update law in (9) has been modified to ensure robustness in the presence of bounded disturbances as

[TABLE]

where $\sigma>0$ is a tuneable parameter that scales the extra term $\mathcal{G}$ . Common choices for $\mathcal{G}$ include the $\sigma$ -modification $\mathcal{G}=\theta$ [20], and the $e$ -modification $\mathcal{G}=\lVert e_{y}\rVert\theta$ [21].

3.1.2 Machine Learning

Regularization is often included in a machine learning optimization problem in order to help cope with overfitting by including constraints on the parameter, thus resulting in an augmented loss function [1, 2, 4, 5, 16, 3, 18, 17]: $\bar{L}(\theta)=L(\theta)+\sigma\mathcal{R}(\theta)$ where $\sigma>0$ is a tunable parameter, often referred to as a Lagrange multiplier. The gradient descent update (10) for this augmented loss function is often referred to as the “regularized follow the leader” algorithm in online learning [17] and may be expressed as

[TABLE]

The common choice of $\ell_{p}$ regularization in machine learning of $\mathcal{R}=(1/p)\lVert\theta\rVert_{p}^{p}$ with $p=2$ , (as in ridge regression), coincides with the $\sigma$ -modification [20], as then $\nabla_{\theta}\mathcal{R}=\mathcal{G}$ . Given that the dimension of the parameter vector may be large, a sparse representation is often obtained with $\ell_{1}$ regularization (as in lasso), with $\mathcal{R}=\lVert\theta\rVert_{1}$ [2, 3, 4, 5].

3.2 Deadzone Modification and Early Stopping

This subsection details common modifications of the adaptive law adopted to cease updating the parameter estimate after sufficient tuning.

3.2.1 Adaptive Control

Another method employed to increase robustness in the presence of bounded disturbances is to employ a “dead zone” [22], for the update in (9) as

[TABLE]

where $d_{0}>0$ is the dead zone width that may correspond to an upper bound on the disturbance, and $\epsilon>0$ is a small constant. The function $\mathcal{D}$ is a non-negative metric on the output error to stop adaptation in desired regions of the output space. A common choice is $\mathcal{D}=\lVert e_{y}\rVert$ such that adaptation stops after a small output error is achieved above a noise level with upper bound $d_{0}$ .

3.2.2 Machine Learning

The training processes is often stopped in machine learning applications as a method to deal with overfitting [2, 3, 4, 5, 23]. This may be done by using multiple data sets and stopping the parameter update process (10) when the loss computed for a validation data set starts to increase [23]. Early stopping is often seen to be needed for training neural networks due to their large number of parameters [2, 3, 4, 5] and can act as regularization [24].

3.3 Projection

It is often desirable to define a compact region a priori for the parameters $\theta$ , such that during the learning process the parameters are not allowed to leave that region. In physical systems there are natural constraints which may aid in the design of that region, and for non physical systems, the constraints are often engineered by the algorithm designer.

3.3.1 Adaptive Control

A continuous projection algorithm is commonly employed to provide for robustness of the adaptive update law in the presence of unmodeled dynamics [25, 26, 27]. One such implementation is

[TABLE]

where $\Omega$ , $\theta_{i,\max}$ , $\theta_{i,\max}^{\prime}$ define a user-specified boundary layer region inside of $\Theta$ (see [26]). The update law in (9) may then be modified as

[TABLE]

3.3.2 Machine Learning

Projected gradient descent methods have a long history in optimization. The following projection operation finds the point in a convex set which is closest to a specified point, and may be defined as

[TABLE]

which may be employed in the update sequence [15, 16, 17, 18, 19]

[TABLE]

3.4 Adaptive Gains and Stepsizes

3.4.1 Adaptive Control

The following parameter update law for the algebraic error model222This update law has not been proven stable for the error model in (7). (6) is one example which alters the gain of the standard update law (9) as a function of the time varying regressors $\phi$ [7, 10]:

[TABLE]

where $\Upsilon\geq 0$ is a forgetting factor and $\mathcal{N}(t)$ is a normalizing signal, with common choice $\mathcal{N}(t)=(1+\mu\phi^{T}(t)\phi(t))$ for $\mu>0$ chosen appropriately (see for example [10] for a discussion of the choice of parameters). It can be seen that the update for $\Gamma$ may be integrated and used in the update for $\theta$ to result in a gain adaptive to the regressor $\phi$ .

3.4.2 Machine Learning

Adaptive step size methods [28, 29, 30, 31] have seen widespread use in machine learning problems due to their ability to handle sparse and small gradients by adjusting the step size as a function of features as they are processed online. Define the following: $g_{k}=\nabla_{\theta}L(\theta_{k})$ , $m_{k}=\mathcal{F}_{1,k}(g_{1},\ldots,g_{k})$ , $V_{k}=\mathcal{F}_{2,k}(g_{1},\ldots,g_{k})$ for user defined averaging functions $\mathcal{F}_{1,k}$ , $\mathcal{F}_{2,k}$ . A common update law for adaptive step size methods [31] can then be seen to be similar to (17) as

[TABLE]

where the following parameterizations are common [31]: (i) projected gradient descent333 $\mathcal{F}_{1,k}=g_{k}$ , $\mathcal{F}_{2,k}=I$ . (17), (ii) AdaGrad444 $\mathcal{F}_{1,k}=g_{k}$ , $\mathcal{F}_{2,k}=\epsilon I+\text{diag}(\sum_{i=1}^{k}g_{i}^{2})$ , where $g_{i}^{2}=g_{i}\odot g_{i}$ . [28], and (iii) Adam555 $\mathcal{F}_{1,k}=(1-\beta_{1})\sum_{i=1}^{k}\beta_{1}^{k-i}g_{i}$ , $\mathcal{F}_{2,k}=(1-\beta_{2})\text{diag}(\sum_{i=1}^{k}\beta_{2}^{k-i}g_{i}^{2})$ . [30]. It can be noted that the normalization in these update laws is a function of the gradient, which can be compared to the normalization by the regressor in (18).

4 Connections: Tools and Concepts

This section details concepts and tools common to both machine learning and adaptive control.

4.1 Lyapunov Functions and Regret

Stability and convergence tools in adaptive control and online machine learning are analyzed in this section.

4.1.1 Adaptive Control

Suppose we consider the error model in (7) where $W(s)=c(sI-A)^{-1}b$ , and a corresponding state space representation of the form [7]

[TABLE]

The term $\tilde{\phi}$ is due to exponentially decaying terms in the regressor $\phi$ . That is, $\tilde{\phi}=\hat{\phi}-\phi$ and $\dot{\tilde{\phi}}=\Lambda\tilde{\phi}$ for a Hurwitz matrix $\Lambda\in\mathbb{R}^{N\times N}$ .666This formulation is common in the design of non-minimal adaptive observers [7]. It can be noted that $\hat{\phi}\rightarrow\phi$ as $t\rightarrow\infty$ as $\Lambda$ is Hurwitz. Also for $\hat{\phi}=\phi$ , (20) is the same as (7). A Hurwitz matrix $\Lambda$ implies the existence of a positive definite matrix $\bar{P}=\bar{P}^{T}\in\mathbb{R}^{N\times N}$ and $0<\bar{Q}=\bar{Q}^{T}\in\mathbb{R}^{N\times N}$ such that: $\Lambda^{T}\bar{P}+\bar{P}\Lambda=-\bar{Q}$ . Stability is often proven in adaptive control by the use of a Lyapunov function $V$ , such as

[TABLE]

It should be noted that the last two terms in $V$ are not needed for the algebraic error model in (6). The time derivative of the Lyapunov function may then be stated using the update law in (9) and the KYP lemma [7] as

[TABLE]

where $\dot{V}\leq 0$ for $\alpha>(4\lVert Pb\rVert^{2}\lVert\theta^{*}\rVert^{2}/(\min eig(Q)\cdot\min eig(\bar{Q}))$ . It can be shown [7] that $\delta(t)\triangleq 2e^{T}Pb\theta^{*T}\tilde{\phi}$ is an exponentially decaying signal with $\tilde{\phi},e\in\mathcal{L}_{2}\cap\mathcal{L}_{\infty}$ . By integrating $\dot{V}$ from $t_{0}$ to $T$ , we obtain

[TABLE]

Given that $\dot{V}\leq 0$ , $V(t_{0})-V(T)\leq V(t_{0})<\infty$ .

4.1.2 Machine Learning

In online learning, efficiency of an algorithm is often analyzed using the notion of “regret” as

[TABLE]

where regret can be seen to correspond to the sum of the time varying convex costs $\mathcal{C}_{k}$ associated with the choice of the time varying parameter estimate $\theta_{k}$ , minus the cost associated with the best static parameter estimate choice, over a time horizon of $T$ steps [15, 16, 17, 19]. Suppose we consider a quadratic cost $\mathcal{C}_{k}=e_{k}^{T}Qe_{k}$ , $Q=Q^{T}>0$ . A continuous time limit of (24) leads to an integral as

[TABLE]

where $\bar{\delta}(t)$ is an exponentially decaying signal which is due to nonzero initial conditions in (7) or similarly in (20).777This may be seen by setting $\theta(t)\equiv\theta^{*}$ in (7) or (20), thus resulting in an exponentially decreasing $e^{T}Qe$ . Note that this exponentially decaying term is absent in the time varying regression case (6). A strong similarity can thus be seen between (23) and (25).

It is desired to have regret grow sub-linearly with time, such that average regret, $(1/T)\text{regret}_{T}$ , goes to zero in the limit $T\rightarrow\infty$ , to provide for an efficient algorithm [17]. Average regret can be connected to convergence in the case of a constant cost and by applying Jensen’s inequality as [17]

[TABLE]

where $\bar{\theta}_{T}=(1/T)\sum_{k=1}^{T}\theta_{k}$ is the average parameter estimate. Here sub-linear regret helps show convergence of the costs in (26). For adaptive control, convergence of state/output errors is shown from a similar integral which is akin to constant regret upper bounded by $V(t_{0})$ in (23).

4.2 Unmodeled Dynamics and Generalization

This section discusses robustness to unforeseen perturbations such as unmodeled dynamics and unseen data.

4.2.1 Adaptive Control

Models used to design adaptive controllers, including the examples of (6) and (7), are linearized approximations with a certain amount of modeling errors. As such, they may only hold about an operating point and need to contend with unmodeled dynamics. This implies that any stabilizing controllers must be designed to not only adapt to parametric uncertainties, but also be robust to unmodeled dynamics. In addition, constraints on the state and input may also be present in adaptive control problems [32, 33]. Analysis becomes more complicated when considering such unmodeled dynamics and constraints, resulting in non-global guarantees. Many of the update law modification in adaptive control from Section 3 were initially derived to ensure robustness in such cases.

4.2.2 Machine Learning

This same notion of robustness to modeling errors exists in machine learning in which an estimator $\hat{y}$ is constructed from a finite training data set, often with a finite number of tuneable parameters. It is then desired that this estimator produces a low prediction error based on a test data set consisting of not just seen data, but unseen data as well. Generalization in machine learning thus refers to the concept of a designed estimator having low loss when applied to new problems. In particular it can be seen that in specific cases, generalization pertains to stability, where algorithms that are stable and train in a small amount of time result in a small generalization error [34, 35].

4.3 Persistence of Excitation and Stochastic Perturbations

This section discusses conditions under which parameter estimates can be guaranteed to converge to their true values.

4.3.1 Adaptive Control

Persistence of excitation (PE) of the system regressor in adaptive control is a condition that has been shown to be necessary and sufficient for parameter convergence [36]. It can be shown that if the regressor $\phi$ is persistently exciting, then the algebraic error model (6) parameter estimation error $\tilde{\theta}(t)$ converges to zero uniformly in time [7]. Similar conditions can be imposed for the dynamical error model (7) and update law (9) [7]. The PE condition essentially corresponds to certain spectral conditions being satisfied by the regressor [37].888In particular, [38] established a condition on spectral lines of signals. Parameter convergence can also occur through the use of “the hybrid algorithm”, “the integral algorithm”, “the algorithm with time-varying adaptive gains”, and “the algorithm using multiple models” as is discussed in [7]. A detailed exposition of system identification and parameter convergence in both deterministic and stochastic cases can be found in [39, 40, 41, 42, 43]. Another way to think of the PE condition is that it leads to a perfect test error, since it provides for convergence of the parameter error to zero, and therefore zero output/state error once transients decay to zero.

4.3.2 Machine Learning

Many machine learning problems consider the case when stochastic perturbations are present. In this context, significant improvements may be possible by leveraging well known concepts in system identification [43]. For example [44] purposely includes a Gaussian random input into a dynamical system in order to provide for PE by construction. Such stochastic perturbations can guarantee a PE condition only in the limit, when infinite samples can be obtained. In order to address the realistic case of finite samples, approaches in machine learning algorithms for system identification and control have attempted to obtain performance bounds with probability $1-p_{f}$ for $p_{f}\in(0,1)$ .999The performance bound usually scales inversely with $p_{f}$ as well. The probability of failure given by the choice of $p_{f}$ allows for error due to the presence of finite samples.

4.4 Tracking vs Exploration

The concept of exploration can be viewed as the opposite of tracking, with the former often employed in machine learning while the latter is one of the main control goals.

4.4.1 Adaptive Control

The goal of adaptive control is to adjust the parameter $\theta$ in such a way to minimize the output error $e_{y}$ in (6) and (7). It can be seen from the error models in (6) and (7) with the update in (9), that as the output error $e_{y}$ goes to zero, learning becomes less and less, and that it is possible for a large parameter error to remain even with zero output (or tracking) error. That is, in many adaptive control applications, stability and tracking are successfully accomplished even without parameter convergence.

4.4.2 Machine Learning

In many machine learning methods, including reinforcement learning, there exist explicit modifications to update laws to promote exploration of the parameter space. These modifications include restarting trajectories with random initial conditions, adding random perturbations to algorithms, and driving the system towards a non-zero error regions [44, 45, 46]. This preference of exploration and learning over stability is motivated by the desire to find optimal parameters of a system. Stability is not always crucial as models are often trained with offline data on a computer, allowing for many iterations without the financial cost of failure present in physical systems (i.e. a nonzero probability of failure $p_{f}$ is acceptable).

4.5 Convergence Guarantees

Notions of convergence guarantees are of importance in both fields, and are discussed here.

4.5.1 Adaptive Control

Adaptive control problems are often parameterized in a specific way such that $e_{y}$ goes to zero asymptotically as in (6) and (7). Parameter convergence is shown to occur in these cases with a persistence of excitation condition (see Section 4.3.1). The specific parameterizations in the output space ensure that a global minimum of $e_{y}=0$ exists and is unique. In the absence of PE, standard adaptive control algorithms converge to one of the many local minima in the parameter space (i.e. $\dot{\tilde{\theta}}\rightarrow 0$ but $\tilde{\theta}\neq 0$ ) [7].

4.5.2 Machine Learning

Machine learning has rapidly grown in recent years, as demonstrated by highly popular and well attended conferences such as ICML and NeurIPS, where rigorous proofs of stability are not always the main focus, instead focusing on empirical performance on large scale problems. A notable exception is a body of work that is emerging which consists of optimization-centric problem formulations, and the examination of the loss landscape, where recent results have shown that in certain classes of problems, local minimums are nearly equivalent to global minimums in terms of performance on test data [47, 48, 49].

4.6 Neural Networks

This section discusses neural networks, a topic common to both fields.

4.6.1 Adaptive Control

Gradient based methods to solve for estimates of unknown parameters via back propagation, in what would develop into the foundations of neural networks have been used for decades in control, with early examples consisting of finding optimal trajectories [50] in flight control [51], and resource allocation problems [52] (see [53] for a brief history). Since then, the use of neural networks in control systems has expanded to include stabilizing nonlinear dynamical systems [54]. Design and analysis of stable controllers based on neural networks was taken up by the adaptive control community due to the the similarities of gradient-like update laws used in neural networks and adaptive control. The adaptive control community developed a well established literature for the use of neural networks in nonlinear dynamical systems in the 1990s [54, 55, 56, 57, 58].

4.6.2 Machine Learning

The use of neural networks in the machine learning community greatly expanded as of recent due to the increase in computing power available and an increase in applications [5, 59, 60]. Recurrent neural networks [61, 62, 63], while often similar in structure to nonlinear dynamical systems, have historically been trained in a manner similar to feed-forward neural networks [64] using back propagation through time [65].101010Hebbian learning [66] based approaches have also been considered. While a theoretical understanding of why deep neural networks work as well as they do for given problems has been lacking, the machine learning community has worked to rigorously analyze sub-classes of deep neural network architectures such as deep linear networks [67, 68]. The update laws employed in training deep neural networks often include selections of modifications of the update laws as discussed in Section 3. For an overview of the history of neural networks see [69].

4.7 Other Parameterization Schemes

In addition to neural networks as discussed in the previous section, other parameterizations are often considered in adaptive control and machine learning.

4.7.1 Adaptive Control

Adaptive control schemes often consider the case where an unknown parameter occurs linearly with respect to a regressor vector $\phi$ and may be related to an output error $e_{y}$ algebraically (6) or dynamically (7). Often times the vector $\phi$ is a nonlinear function of the state of the system or reference model in order to approximate a more general nonlinear function $D$ as: $D(x)=\theta^{*T}\phi(x)$ [70]. Common parameterizations for unknown nonlinearities include Gaussian radial basis functions [70]. Another class of parameterizations consist of nonlinearly parameterized uncertainty $D(\theta^{*},\phi)$ in dynamical systems, for which there exists stabilizing adaptive control methods [71, 72].

4.7.2 Machine Learning

Parametric methods are common in machine learning as well, and are useful in many regression and classification based tasks [1, 2, 3, 4, 5]. However, Bayesian based approaches are also widespread in areas such as topic models [73], clustering [74] and graphical models [75]. Additionally, new results in high dimensional statistics are increasingly being considered in which the model may be of higher dimension than the sample size [76].

5 Advantageous Combinations of Machine Learning and Adaptive Control Tools

Given the enormous number of similarities in problem statements, tools, concepts, and algorithms, it is natural to examine what the benefits are that accrue by combining insights obtained in these two different communities. Two examples of such an exercise is delineated in this section.

5.1 Higher Order Learning

Many of the update laws addressed thus far were first-order in nature, and made use of gradient-like quantities for learning. A question of increasing interest in the ML community is when accelerated learning can occur for higher-order learning methods. Higher order learning methods are commonly used in machine learning practice [77, 59, 60] as they can provide for a guaranteed bound on a faster rate of convergence. In particular, Nesterov’s accelerated method [78] was able to certify a convergence rate of $O(1/k^{2})$ as compared to the standard gradient descent (10) rate of $O(1/k)$ for a class of convex functions. A parameterization of Nesterov’s accelerated method may be stated as

[TABLE]

where $\beta>0$ is a design parameter that weighs the effect of past parameters. Continuous time problem formulations have been explored in [79, 80], with rate-matching discretizations established in [81, 82, 83]. Many of these methods however become inadequate for time varying inputs.

Adaptive update laws which include additional levels of integration appeared in the “higher order tuners” in [84, 85], and take the form

[TABLE]

where $\mathcal{N}(t)\triangleq(1+\mu\phi^{T}(t)\phi(t))$ for a $\mu>0$ . This update law can be seen to be the standard first order update (9) passed through a time varying filter normalized by the regressor. It was shown in [86] that (28) can provide for rates comparable to accelerated methods in machine learning for static features [80]. In addition, in contrast to (27), the update law in (28) can be shown to be stable in the presence of time varying regressors as in (6) and as well as in adaptive control applications with error model as in (7) [86]. This extension of accelerated methods in machine learning to include time varying and dynamic error models was only possible by leveraging techniques from adaptive control [86].

5.2 Improved Algorithm Performance Bounds

Regret analysis common in online machine learning (see Section 4.1.2) can result in overly conservative bounds for the performance of an algorithm. In particular, in online projected gradient descent (17) for regression (6) with squared output error cost $\mathcal{C}=(1/2)e_{y}^{2}$ , regret analysis guarantees $\text{regret}_{T}=O(\sqrt{T})$ (cf. [17]). For the same regret cost function, one can guarantee $\text{regret}_{T}=O(1)$ (constant regret)111111For regression as in (6), regret contains a sum of non-negative costs and is therefore a non-decreasing function of the time horizon $T$ . Thus $O(1)$ regret is the best achievable regret., using adaptive control methods.

6 Conclusions

This paper explored many immediate connections between adaptive control and machine learning, both through common update laws as well as common concepts. Adaptive control as a field has focused on mathematical rigor and guaranteed convergence. The rapid advances in machine learning on the other hand have brought about a plethora of new techniques and problems for learning. This paper was written to elucidate the numerous common connections between both fields such that results from both may be leveraged together to solve new problems.

7 Acknowledgements

The authors acknowledge Dr. Michael I. Jordan for useful discussions. This work was supported by the Air Force Research Laboratory, Collaborative Research and Development for Innovative Aerospace Leadership (CRDInAL), Thrust 3 - Control Automation and Mechanization grant FA 8650-16-C-2642 and the Boeing Strategic University Initiative.

Bibliography86

The reference list from the paper itself. Each links out to its DOI / PubMed record.

1[1] R. O. Duda, P. E. Hart, and D. G. Stork, Pattern Classification, 2nd Edition . John Wiley & Sons, 2001.
2[2] C. M. Bishop, Pattern Recognition and Machine Learning . Springer, 2006.
3[3] T. Hastie, R. Tibshirani, and J. Friedman, The Elements of Statistical Learning: Data Mining, Inference, and Prediction . Springer, 2009.
4[4] B. Efron and T. Hastie, Computer Age Statistical Inference: Algorithms, Evidence and Data Science . Cambridge University Press, 2016.
5[5] I. Goodfellow, Y. Bengio, and A. Courville, Deep Learning . MIT Press, 2016.
6[6] M. I. Jordan and T. M. Mitchell, “Machine learning: Trends, perspectives, and prospects,” Science , vol. 349, pp. 255–260, jul 2015.
7[7] K. S. Narendra and A. M. Annaswamy, Stable Adaptive Systems . NJ: Prentice-Hall, Inc., 1989. (out of print).
8[8] S. Sastry and M. Bodson, Adaptive Control: Stability, Convergence and Robustness . Prentice-Hall, 1989.

TL;DR

Contribution

Findings

Abstract

Peer Reviews

Videos

Connections Between Adaptive Control and Optimization in Machine Learning

Abstract

1 Introduction

2 Problem Statements

2.1 Adaptive Control

2.2 Machine Learning

2.3 Common Update Laws

3 Connections: Update Law

3.1 σ\sigmaσ-Modification, eee-Modification, and Regularization

3.1.1 Adaptive Control

3.1.2 Machine Learning

3.2 Deadzone Modification and Early Stopping

3.2.1 Adaptive Control

3.2.2 Machine Learning

3.3 Projection

3.3.1 Adaptive Control

3.3.2 Machine Learning

3.4 Adaptive Gains and Stepsizes

3.4.1 Adaptive Control

3.4.2 Machine Learning

4 Connections: Tools and Concepts

4.1 Lyapunov Functions and Regret

4.1.1 Adaptive Control

4.1.2 Machine Learning

4.2 Unmodeled Dynamics and Generalization

4.2.1 Adaptive Control

4.2.2 Machine Learning

4.3 Persistence of Excitation and Stochastic Perturbations

4.3.1 Adaptive Control

4.3.2 Machine Learning

4.4 Tracking vs Exploration

4.4.1 Adaptive Control

4.4.2 Machine Learning

4.5 Convergence Guarantees

4.5.1 Adaptive Control

4.5.2 Machine Learning

4.6 Neural Networks

4.6.1 Adaptive Control

4.6.2 Machine Learning

4.7 Other Parameterization Schemes

4.7.1 Adaptive Control

4.7.2 Machine Learning

5 Advantageous Combinations of Machine Learning and Adaptive Control Tools

5.1 Higher Order Learning

5.2 Improved Algorithm Performance Bounds

6 Conclusions

7 Acknowledgements

3.1 $\sigma$ -Modification, $e$ -Modification, and Regularization