A probabilistic incremental proximal gradient method

\"Omer Deniz Akyildiz; \'Emilie Chouzenoux; V\'ictor Elvira; Joaqu\'in; M\'iguez

arXiv:1812.01655·math.OC·June 20, 2019·IEEE Signal Process. Lett.

A probabilistic incremental proximal gradient method

\"Omer Deniz Akyildiz, \'Emilie Chouzenoux, V\'ictor Elvira, Joaqu\'in, M\'iguez

PDF

TL;DR

This paper introduces the probabilistic incremental proximal gradient (PIPG) method, which interprets the algorithm probabilistically, allowing uncertainty propagation and enabling the use of Bayesian filters for large-scale optimization.

Contribution

It develops a novel probabilistic interpretation of the incremental proximal gradient algorithm and integrates Bayesian filtering techniques for improved uncertainty management.

Findings

01

Enables uncertainty propagation in optimization iterations

02

Allows use of Kalman and extended Kalman filters in optimization

03

Facilitates large-scale regularized problem solving

Abstract

In this paper, we propose a probabilistic optimization method, named probabilistic incremental proximal gradient (PIPG) method, by developing a probabilistic interpretation of the incremental proximal gradient algorithm. We explicitly model the update rules of the incremental proximal gradient method and develop a systematic approach to propagate the uncertainty of the solution estimate over iterations. The PIPG algorithm takes the form of Bayesian filtering updates for a state-space model constructed by using the cost function. Our framework makes it possible to utilize well-known exact or approximate Bayesian filters, such as Kalman or extended Kalman filters, to solve large-scale regularized optimization problems.

Equations47

θ \in R^{d} min f (θ) + g (θ),

θ \in R^{d} min f (θ) + g (θ),

\overline{θ}_{k}

\overline{θ}_{k}

V_{k}

\overline{θ}_{k} = prox_{γ f_{k}, V_{k - 1}} (\overline{θ}_{k - 1}), (\forall k \in {1, \dots, n})

\overline{θ}_{k} = prox_{γ f_{k}, V_{k - 1}} (\overline{θ}_{k - 1}), (\forall k \in {1, \dots, n})

f_{k} (θ) = \frac{1}{2} (y_{k} - x_{k}^{⊤} θ)^{2}, (\forall θ \in R^{d})

f_{k} (θ) = \frac{1}{2} (y_{k} - x_{k}^{⊤} θ)^{2}, (\forall θ \in R^{d})

f_{k} (θ) = \frac{1}{2} (y_{k} - h_{k} (θ))^{2}, \forall θ \in R^{d},

f_{k} (θ) = \frac{1}{2} (y_{k} - h_{k} (θ))^{2}, \forall θ \in R^{d},

\overline{θ}_{k}

\overline{θ}_{k}

V_{k}

\overline{θ}_{k} = prox_{γ f_{k}, V_{k - 1}} (\overline{θ}_{k - 1} - γ V_{k - 1} \nabla g (\overline{θ}_{k - 1})),

\overline{θ}_{k} = prox_{γ f_{k}, V_{k - 1}} (\overline{θ}_{k - 1} - γ V_{k - 1} \nabla g (\overline{θ}_{k - 1})),

g (θ) = \frac{1}{2} ∥ A θ ∥_{2}^{2} (\forall θ \in R^{d}),

g (θ) = \frac{1}{2} ∥ A θ ∥_{2}^{2} (\forall θ \in R^{d}),

θ_{k}

θ_{k}

\overline{θ}_{k}

p (θ_{0})

p (θ_{0})

p (θ_{k} ∣ θ_{k - 1})

p (y_{k} ∣ θ_{k})

M_{k} = (I_{d} - γ V_{k - 1} A^{⊤} A) (\forall k \in {1, \dots, n})

M_{k} = (I_{d} - γ V_{k - 1} A^{⊤} A) (\forall k \in {1, \dots, n})

θ_{k}

θ_{k}

V_{k}

\overline{θ}_{k}

\overline{θ}_{k}

V_{k}

m_{V} (\overline{θ}) = \overline{θ} - γ V \nabla g (\overline{θ}),

m_{V} (\overline{θ}) = \overline{θ} - γ V \nabla g (\overline{θ}),

p (θ_{0})

p (θ_{0})

p (θ_{k} ∣ θ_{k - 1})

p (y_{k} ∣ θ_{k})

M_{k} = I_{d} - γ V_{k - 1} \nabla^{2} g (\overline{θ}_{k - 1}) (\forall k \in {1, \dots, n}),

M_{k} = I_{d} - γ V_{k - 1} \nabla^{2} g (\overline{θ}_{k - 1}) (\forall k \in {1, \dots, n}),

θ_{k}

θ_{k}

V_{k}

\overline{θ}_{k}

\overline{θ}_{k}

V_{k}

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Full text

A probabilistic incremental

proximal gradient method

Ömer Deniz Akyildiz, Émilie Chouzenoux, Víctor Elvira, Joaquín Míguez Ö. D. Akyildiz is with the Dept. of Computer Science and Dept. of Statistics at University of Warwick and Alan Turing Institute, London, UK. Email: [email protected]É. Chouzenoux is with the Center for Visual Computing, INRIA Saclay, CentraleSupélec, Gif-sur-Yvette, France.V. Elvira is with IMT Lille Douai & CRIStAL laboratory (UMR CNRS 9189), Villeneuve d’Ascq, France.J. Míguez is with the Dept. of Signal Theory and Communications, Universidad Carlos III de Madrid, Leganés, Spain, 28912.Ö. D. A. is funded by the Lloyds Register Foundation programme on Data Centric Engineering through the London Air Quality project and supported by The Alan Turing Institute for Data Science and AI under EPSRC grant EP/N510129/1. J. M. acknowledges the support of the Spanish Agencia Estatal de Investigación (TEC2015-69868-C2-1-R ADVENTURE) and the Office of Naval Research (N00014-19-1-2226). V. E. and É .C. acknowledge the support from the Agence Nationale de la Recherche of France under PISCES (ANR-17-CE40-0031-01) and MAJIC (ANR-17-CE40-0004-01) projects.

Abstract

In this paper, we propose a probabilistic optimization method, named probabilistic incremental proximal gradient (PIPG) method, by developing a probabilistic interpretation of the incremental proximal gradient algorithm. We explicitly model the update rules of the incremental proximal gradient method and develop a systematic approach to propagate the uncertainty of the solution estimate over iterations. The PIPG algorithm takes the form of Bayesian filtering updates for a state-space model constructed by using the cost function. Our framework makes it possible to utilize well-known exact or approximate Bayesian filters, such as Kalman or extended Kalman filters, to solve large-scale regularized optimization problems.

Index Terms:

Probabilistic optimization, stochastic gradient, proximal algorithms, extended Kalman filtering

I Introduction

In this paper, we are interested in optimization problems of the form

[TABLE]

with $f({\boldsymbol{\theta}})=\sum_{k=1}^{n}f_{k}({\boldsymbol{\theta}})$ where, for $1\leq k\leq n$ , $f_{k}:{\mathbb{R}}^{d}\to{\mathbb{R}}$ , are nonlinear least squares functions i.e., for $1\leq k\leq n$ , $f_{k}=\frac{1}{2}(y_{k}-h_{k}(\cdot))^{2}$ , with $y_{k}\in{\mathbb{R}}$ and $h_{k}:{\mathbb{R}}^{d}\to{\mathbb{R}}$ a nonlinear differentiable mapping. Moreover, $g:{\mathbb{R}}^{d}\to{\mathbb{R}}$ is a twice-differentiable regularizer. Because classical optimization schemes may be inefficient to solve (1) when $n$ is very large, stochastic or incremental optimization methods have gained a significant momentum. In particular, the stochastic gradient descent (SGD) [1] has become widely popular to solve such problems. At each SGD iteration, a mini-batch of component functions is randomly selected and a gradient step with respect to this mini-batch is performed. A number of variants of SGD have been since developed (see [2, 3] for a review).

The objective function in Eq. (1) has a sum structure. Therefore, it opens the door for more efficient algorithms than gradient methods, such as proximal splitting methods [4, 5]. In particular, the proximal gradient (PG) method minimizes a sum of two terms, one being smooth, by alternating gradient steps on the differentiable one and proximal update on the second, thereby exploiting fully the structure of the cost function. Naturally, stochastic extensions of proximal methods have become increasingly popular in the machine learning literature, see, e.g., [6, 7, 8, 9, 10]. The optimization method in consideration in this paper is known as the incremental proximal gradient (IPG) algorithm [7], and can be understood as an incremental version of the stochastic proximal gradient method [8, 9]. Similarly to its batch version PG, the IPG method would solve (1) by using the gradient of $g$ and the proximal operator of $f_{k}$ (or vice versa) at each iteration to move within the parameter space. Therefore, the IPG takes advantage of the structure of the cost function while staying computationally efficient for large $n$ .

In this paper, we propose a probabilistic IPG (PIPG) method to solve the problem in Eq. (1). The PIPG algorithm reads as an approximate inference method in a probabilistic state-space model (SSM), tailored to the loss function. To be specific, it takes the form of an extended Kalman filter (EKF) to infer the hidden states of this SSM. This setting yields a probabilistic interpretation which enables the quantification of the uncertainty of the estimates at any time, extending our previous work [11] which only focused on the case $g=0$ . The posterior covariance matrix involved in PIPG updates plays the role of a variable-metric. Thus, another key advantage of PIPG is to provide an adaptive rule for the metric update within the IPG scheme. Note that the PIPG method is related to the class of probabilistic numerical methods (see, e.g., [12, 13, 14]), extending such methods for solving large-scale optimization problems. We mention [15] as a related work, that emphasizes the links between Kalman filtering and the online natural gradient method, which can be viewed as an SGD within a specific variable metric. In [16], connections between LMS and Kalman filters are exploited to propose a new algorithm. In [17], the author proposes some variance reduction strategies for SGD, relying on a Kalman interpretation. In contrast, in this work we take advantage of the structure of the cost function itself and we focus on the connection between Kalman and proximal methods.

The paper is organized as follows. In Section II, we briefly give background. In Section III, we introduce the new scheme and the update rules in detail. In Section IV, we demonstrate the performance of our method, on a ridge regression and a nonlinear sparse filter identification problem. We conclude with Section V.

II Background

Let us start by defining the proximal operator [18]111See also http://proximity-operator.net/.

Definition 1.

The proximal operator of a convex, proper, lower semi-continuous function $f:{\mathbb{R}}^{d}\to{\mathbb{R}}$ within the metric induced by a symmetric, positive definite (SPD) matrix ${\mathbf{V}}_{0}~{}\in~{}{\mathbb{R}}^{d\times d}$ is defined as, $\operatorname{prox}_{f,{\mathbf{V}}_{0}}({\boldsymbol{\theta}}_{0})=\operatornamewithlimits{argmin}_{{\boldsymbol{\theta}}\in{\mathbb{R}}^{d}}f({\boldsymbol{\theta}})+\frac{1}{2}\|{\boldsymbol{\theta}}-{\boldsymbol{\theta}}_{0}\|_{2,{\mathbf{V}}_{0}}^{2}$ where $\|{\boldsymbol{\theta}}\|_{2,{\mathbf{V}}}:={({\boldsymbol{\theta}}^{\top}{\mathbf{V}}^{-1}{\boldsymbol{\theta}})^{1/2}}$ is the Mahalanobis distance.

Let us now present the Kalman updates from [11] which aim at performing Bayesian inference in the case of the model $p({\boldsymbol{\theta}})={\mathcal{N}}({\boldsymbol{\theta}};\overline{{\boldsymbol{\theta}}}_{0},{\mathbf{V}}_{0})$ and $p(y_{k}|{\boldsymbol{\theta}})={\mathcal{N}}(y_{k};{\mathbf{x}}_{k}^{\top}{\boldsymbol{\theta}},\gamma^{-1})$ , where $\gamma>0$ , $\overline{{\boldsymbol{\theta}}}_{0}\in\mathbb{R}^{d}$ , ${\mathbf{V}}_{0}\in\mathbb{R}^{d\times d}$ SPD, ${\mathbf{x}}_{k}\in\mathbb{R}^{d}$ , for $k=1,\ldots,n$ , are predefined values, and $(y_{k},{\boldsymbol{\theta}})$ are random variables in $\mathbb{R}$ and $\mathbb{R}^{d}$ , respectively. For this model, assuming that the inputs ${\mathbf{x}}_{1:k}$ are fixed and the likelihood factorizes as $p(y_{1:k}|{\boldsymbol{\theta}})=\prod_{k=1}^{n}p(y_{k}|{\boldsymbol{\theta}})$ (i.e., the observations are conditionally independent), the mean and the covariance of the Gaussian posterior $p({\boldsymbol{\theta}}|y_{1:k})=\mathcal{N}({\boldsymbol{\theta}};\overline{{\boldsymbol{\theta}}}_{k},{\mathbf{V}}_{k})$ can be written as [11]

[TABLE]

Note that at the last iteration, with $k=n$ , the Gaussian posterior $p({\boldsymbol{\theta}}|{\mathbf{y}}_{1:k})$ is perfectly computed with parameters given by Eqs. (2)-(3). The sequence $(\overline{{\boldsymbol{\theta}}}_{k})_{1\leq k\leq n}$ turns out to be identical to the $n$ first iterations of the incremental proximal method (IPM) recursion [6, 7] applied to Problem (1):

[TABLE]

when $g=0$ and

[TABLE]

for all $k\in\left\{1,\ldots,n\right\}$ and $({\mathbf{V}}_{k})_{1\leq k\leq n}$ are specified as in (3) (see Props. 4.2–4.4 in [19] for a proof) This viewpoint has been extended in [11] for nonlinear least squares, where

[TABLE]

for all $k\in\left\{1,\ldots,n\right\}$ . In Eq. (6), each $h_{k}:{\mathbb{R}}^{d}\to{\mathbb{R}}$ is a differentiable function, possibly nonlinear. Thus, the IPM iteration of Eq. (4) may not be feasible in a closed form. One can implement the EKF for a model with prior $p({\boldsymbol{\theta}})={\mathcal{N}}({\boldsymbol{\theta}};\overline{{\boldsymbol{\theta}}}_{0},{\mathbf{V}}_{0})$ and the likelihood $p(y_{k}|{\boldsymbol{\theta}})={\mathcal{N}}(y_{k};h_{k}({\boldsymbol{\theta}}),\gamma^{-1})$ , by linearizing $(h_{k})_{1\leq k\leq n}$ . Denoting ${{\mathbf{d}}}_{k}=\nabla h_{k}({\boldsymbol{\theta}}_{k-1})$ , we obtain the update rules [11, 19]

[TABLE]

for $k\in\left\{1,\ldots,n\right\}$ . Since the EKF is an approximate Bayesian scheme, multiple passes over the dataset can be performed.

III A Probabilistic IPG method

We now focus on the resolution of the optimization problem in Eq. (1) when $g\neq 0$ . The structure of the cost function suggests the use of the IPG iteration [6, 7]. We consider a variable-metric extension of the IPG. In particular, given (1), the $n$ first iterations of the variable-metric IPG update read as

[TABLE]

with $\overline{{\boldsymbol{\theta}}}_{0}\in\mathbb{R}^{d}$ , and $({\mathbf{V}}_{k})_{k\geq 0}\in\mathbb{R}^{d\times d}$ some predefined SPD matrices. The update (7) can be viewed as an incremental version of the batch variable-metric PG method that has been extensively studied recently in the optimization literature [20, 21]. In the sequel, we propose a probabilistic interpretation of the IPG which leads to a new update rule for the variable-metric matrices. We first consider the linear case (i.e., for quadratic $f$ and $g$ ) for the sake of simplicity, since all computations are tractable and the inference can be performed in exact manner. Then we present our general version of the PIPG that encompasses a wider class of cost functions.

III-A Linear-Quadratic case

Let us first assume that $(f_{k})_{1\leq k\leq n}$ is defined as in (5) and

[TABLE]

with ${\mathbf{A}}\in\mathbb{R}^{m\times d}$ , $m\geq 1$ . Note that ${\mathbf{A}}$ is assumed to be known. Using (8), we can write (7) as

[TABLE]

for $k=1,\ldots,n$ . The key observation here is that Eqs. (9)–(10) can be seen as approximate (Kalman) filtering recursions [22]. To be specific, Eq. (9) can be seen as the analog to the prediction step within a Kalman filter. Similarly, the update (10) can be seen as a Bayesian update using (5), see Eq. (2) [11]. However, Eqs. (9)–(10) are different from a Kalman filter, where there would be an update of the covariance matrix between (9)–(10). Therefore, inspired by Eqs. (9)–(10), we propose the use of the following state-space model,

[TABLE]

where $\mathbf{0}_{d\times d}$ , the zero-matrix in ${\mathbb{R}}^{d\times d}$ , and

[TABLE]

with $\mathbf{I}_{d}$ the identity matrix of $\mathbb{R}^{d}$ . Now, assume that, the pair $(\overline{{\boldsymbol{\theta}}}_{k-1},{\mathbf{V}}_{k-1})$ is given. We propose to apply filtering recursions for the model (11)–(13), which leads to the PIPG updates for the linear quadratic case. Recursions now consist of a predictive step of the mean and covariance

[TABLE]

respectively, and the update of the mean and covariance,

[TABLE]

respectively, with $({\mathbf{M}}_{k})_{1\leq k\leq n}$ defined in (14). It is worth noting that, in Eqs. (9) and (10), a single ${\mathbf{V}}_{k-1}$ is used for both iterations. In the corresponding iterations in the proposed method, i.e., Eqs. (15) and (17), we make use of ${\mathbf{V}}_{k-1}$ and $\widetilde{{\mathbf{V}}}_{k}$ , respectively. In this case, one pass over the dataset is enough since the posterior is exact for the model (11)–(13).

III-B General case

In this section, we present the PIPG algorithm for the general nonlinear case. To be specific, we are going to focus on functions $(f_{k})_{1\leq k\leq n}$ taking the form (6). Moreover, we will consider a general function $g$ that we assume to be twice differentiable. In this case, the variable-metric IPG update given in (7) does not usually yield analytically tractable computations. Moreover, the Kalman recursions, as we presented in the previous section, do not apply. To see this, first consider the mapping $m_{\mathbf{V}}:{\mathbb{R}}^{d}\mapsto{\mathbb{R}}^{d}$ , where

[TABLE]

for some given $\overline{{\boldsymbol{\theta}}}\in{\mathbb{R}}^{d}$ , ${\mathbf{V}}\in{\mathbb{R}}^{d\times d}$ SPD and $\gamma>0$ . Except when $g$ is quadratic, the above mapping is nonlinear, making it impossible to propagate the uncertainty for the gradient step in (7) as it was done in (9). Moreover, when $(f_{k})_{1\leq k\leq n}$ are chosen as in (6), it may be complicated to realize the proximal step given in (7). To alleviate both problems, we can use the EKF [11, 22]. To this end, we build the model

[TABLE]

In order to apply the EKF in the model (19)–(21), which will lead to the PIPG algorithm, we need to linearize the transition model and the observation model. At iteration $k$ , given $(\overline{{\boldsymbol{\theta}}}_{k-1},{\mathbf{V}}_{k-1})$ pair, we define the transition matrix,

[TABLE]

with $\nabla^{2}g$ the Hessian map of $g$ . Finally, the PIPG updates can be computed: first the predicted mean and covariance

[TABLE]

respectively,222Note that, although the dynamical model (20) is deterministic (the process covariance matrix is zero), we have introduced ${\mathbf{Q}}$ in (23), a SPD matrix that accounts for the linearization error made by the EKF. and then the updated mean and covariance

[TABLE]

respectively, where ${{\mathbf{d}}}_{k}=\nabla h_{k}(\widetilde{{\boldsymbol{\theta}}}_{k})$ . The algorithm is iterated for $k=1,\ldots,n$ and referred to as the PIPG method.

Remark 1.

Note that Eqs. (22)–(25) are the most general recursions for our method. Like in the linear case presented in Section III-A, sometimes we can simplify the computations. For instance, if $m_{{\mathbf{V}}_{k-1}}(\cdot)$ yields a linear mapping for $g$ while $f$ is a nonlinear least squares loss as in (6), then Eqs. (22)–(23) simplify into (15)–(16). Similarly, when $m_{{\mathbf{V}}_{k-1}}(\cdot)$ is nonlinear, and $f$ is quadratic as in (5), then Eqs. (24)–(25) simplify into (17)–(18).

Remark 2.

Although the choice of metrics has been studied in the batch case [20, 23], no practical ways for choosing them are available in the incremental setting to the best of our knowledge. The PIPG scheme provides a natural recipe on how to update the metric matrices $({\mathbf{V}}_{k})_{1\leq k\leq n}$ in the form of a sequence of posterior covariance matrices.

Remark 3.

As mentioned earlier, in the linear and tractable case the PIPG updates given by (15)–(18), are guaranteed to provide, after $k=n$ iterations (i.e., after a single pass of the data), the exact mean and covariance parameters of the Gaussian posterior associated to the state-space model (11)-(13). However, the convergence analysis for the general recursions (22)–(25) (with inexact Kalman updates) would need further investigation that we leave for future work.

IV Numerical results

In this section, we present two experiments in order to illustrate the performance of PIPG in the context described in Sections III-A and III-B.

IV-A Ridge regression

We consider first the linear-quadratic case, depicted in Section III-A. We set ${\mathbf{A}}=\sqrt{\lambda}\mathbf{I}_{d}$ . Moreover, the sought signal ${\boldsymbol{\theta}}^{\star}\in\mathbb{R}^{d}$ is generated as the realization of a multivariate Gaussian variable using $d=100$ . We then simulated $n=100,000$ noisy observations $y_{k}={\mathbf{x}}_{k}^{\top}{\boldsymbol{\theta}}^{\star}+\eta_{k}$ with $\eta_{k}\sim\mathcal{N}(0,1)$ for $k=1,\ldots,n$ . PIPG recursions (15)–(18) are implemented, for $n=k$ iterations and $40$ step-size values $\gamma$ withing the range $[0.005,0.2]$ . We also compare the results to the IPG obtained using (9)-(10) with $\mathbf{V}_{k}=\mathbf{I}_{d}$ for $k=1,\ldots,n$ . IPG was run with a decaying step-size of the form $\gamma/k^{0.51}$ , for the same range of step-size values than PIPG. Note that running the IPG with a constant step-size causes the algorithm to diverge, therefore we do not show those results. For both methods, we access the data $(y_{k},{\mathbf{x}}_{k})_{1\leq k\leq n}$ in a random order, hence the time dependency of $({\mathbf{x}}_{k})_{1\leq k\leq n}$ is not affecting our results. We compute the relative mean squared error (RMSE) between the current estimate ${\boldsymbol{\theta}}_{k}$ and the true filter coefficient vector ${\boldsymbol{\theta}}^{\star}$ as $\text{E}_{k}=\|{\boldsymbol{\theta}}_{k}-{\boldsymbol{\theta}}^{\star}\|/\|{\boldsymbol{\theta}}^{\star}\|$ . The regularization parameter is set to $\lambda=10^{-2}$ so as to minimize the final RMSE.

The results are displayed in Fig. 1(a). It can be seen that PIPG shows a stable performance with respect to the step-size value. In contrast, IPG appears to be very sensitive to both the step-size tuning and also the decay rate (which is not shown here). Moreover, for a wide range of step-size values, PIPG requires less iterations than IPG to achieve minimal RMSE. Finally, PIPG provides an estimate of the covariance as an additional output, which can be particularly useful in practical applications that require an uncertainty quantification in the solution (e.g., biomedical data processing, financial data analytics).

IV-B Sparse nonlinear regression

Let us now apply the proposed method on the more challenging problem of sparse system identification [24, 25] under nonlinear observation model. Given a real-valued discrete-time input signal $\big{(}x_{k}\big{)}_{k\in\mathbb{Z}}$ , the output of the system at time $k\in\left\{1,\ldots,n\right\}$ is defined as $y_{k}=h({\mathbf{x}}_{k}^{\top}{\boldsymbol{\theta}})+w_{k}$ , where ${\mathbf{x}}_{k}=[x_{k-d+1},\ldots,x_{k}]^{\top}\in{\mathbb{R}}^{d}$ (assuming circulant boundaries) and $w_{k}\sim{\mathcal{N}}(0,\gamma^{-1})$ are i.i.d. measurement noise samples, and ${\boldsymbol{\theta}}\in\mathbb{R}^{d}$ represents the unknown filter taps. A sigmoid nonlinearity $h(u)={1}/(1+\exp(-u))$ for $u\in{\mathbb{R}}^{d}$ is introduced in the system response, modeling for instance some saturation of the sensor. We set the input signal $(x_{k})_{k\in\mathbb{Z}}$ as in [26], $x_{k}=ax_{k-1}+\eta_{k},$ with $a=0.8$ , $\eta_{k}\sim{\mathcal{N}}(0,1)$ and $x_{0}\sim{\mathcal{N}}(0,1)$ . We run the PIPG recursions (22)–(25) from Section III-B, where we set $h_{k}({\boldsymbol{\theta}})=h({\mathbf{x}}_{k}^{\top}{\boldsymbol{\theta}})$ for every ${\boldsymbol{\theta}}\in{\mathbb{R}}^{d}$ , and the regularization function $g$ is chosen as smoothed $\ell_{2}-\ell_{1}$ regularization function [27] i.e., $g({\boldsymbol{\theta}})=\lambda\left(\sum_{i=1}^{d}\left(1+\theta_{i}^{2}/\delta^{2}\right)^{1/2}-1\right)$ with $\lambda>0$ and $\delta>0$ the smoothing parameter. Such regularizer allows to promote sparsity, as when $\delta\to 0$ , the $\ell_{1}$ norm is obtained. The measurement noise variance is $\gamma^{-1}=1$ . Note that the parameter $\gamma$ is also the step-size in the proposed method, as we will discuss below. The filter length is $d=50$ and the output of the system is observed at every time $k\in\{1,\ldots,n\}$ with $n=300,000$ . Regularization parameters are set manually to $(\lambda,\delta)=(10^{-5},0.1)$ so as to reach the best performance in terms of RMSE. We initialize the PIPG algorithm with a prior distribution with large uncertainty, namely ${\mathbf{V}}_{0}=v_{0}\mathbf{I}_{d}$ , where $v_{0}=100$ . The process noise covariance matrix, which models the linearization errors in our method, is chosen as ${\mathbf{Q}}=q\mathbf{I}_{d}$ with $q=10^{-4}$ . We set $\gamma=1$ , accordingly with the noise model. Note that, in general, $\gamma$ is an unknown parameter that is to be set by the user depending on the approximate noise level. For comparison, we implement a stochastic gradient descent (SGD) with learning rate $\gamma^{\textnormal{sgd}}_{k}=\frac{\alpha_{0}}{1+\alpha_{1}k}$ for $k=\{1,\ldots,n\}$ , where $\alpha_{0}=1$ and $\alpha_{1}=10^{-4}$ which are chosen to reach an optimal decrease. Note that, for this model, it is not possible to implement the IPG since $(f_{k})_{1\leq k\leq n}$ are not easily proximable.

Fig. 1(b) displaying RMSE evolution for both algorithms, shows that the PIPG method reaches stability in a reduced number of iterations, compared to the SGD, which is a significant practical advantage when one has a limited accessibility to the dataset. From Fig. 1(c)–(d), it can be seen that the PIPG method in Fig. 1(c) provides a better estimate together with the uncertainty bars $(2\sigma_{i})_{1\leq i\leq d}$ . A great feature of PIPG is to provide estimates for the covariance matrix, which provides the uncertainty quantification on the parameters. The behavior of the entries of $({\mathbf{V}}_{k})_{1\leq k\leq n}$ can be seen from Fig. 2, along with some comments. Let us remark that the computation of this matrix of dimension $d\times d$ implies an increase in computational complexity, as PIPG scales as $\mathcal{O}(d^{2})$ while SGD scales as $\mathcal{O}(d)$ .

V Conclusions

We have proposed a probabilistic incremental optimization method which quantifies and propagates the uncertainty over its estimates. In the case of a regularized non-linear least squares, we have reinterpreted the classical IPG method as an approximate inference method in a state-space model. The extension of IPG to the probabilistic setting enables us to provide quantification of the uncertainties inherent in the numerical problem or caused by modeling errors. Our probabilistic interpretation also allows the use of accelerated variable metric updates, whose metric matrices are derived in an automatic and well-defined way. Future investigations will be devoted to the analysis of the convergence of the PIPG iterates, and the reduction of its complexity by means of suitable approximations.

Bibliography27

The reference list from the paper itself. Each links out to its DOI / PubMed record.

1[1] H. Robbins and S. Monro, “A stochastic approximation method,” Annals of Mathematical Statistics , vol. 22, pp. 400–407, 1951.
2[2] L. Bottou, F. E. Curtis, and J. Nocedal, “Optimization methods for large-scale machine learning,” ar Xiv:1606.04838 , 2016.
3[3] M. Pereyra, P. Schniter, E. Chouzenoux, J.-C. Pesquet, J.-Y. Tourneret, A. Hero, and S. Mc Laughlin, “A survey of stochastic simulation and optimization methods in signal processing,” IEEE J. Sel. Top. Signal Process. , vol. 10, no. 2, pp. 224–241, Mar. 2016.
4[4] P. L. Combettes and J.-C. Pesquet, “Proximal splitting methods in signal processing,” in Fixed-point algorithms for inverse problems in science and engineering . Springer, 2011, pp. 185–212.
5[5] N. Parikh, S. Boyd et al. , “Proximal algorithms,” Foundations and Trends® in Optimization , vol. 1, no. 3, pp. 127–239, 2014.
6[6] D. P. Bertsekas, “Incremental gradient, subgradient, and proximal methods for convex optimization: A survey,” Optimization for Machine Learning , vol. 2010, pp. 1–38, 2011.
7[7] ——, “Incremental proximal methods for large scale convex optimization,” Mathematical programming , vol. 129, no. 2, pp. 163–195, 2011.
8[8] L. Rosasco, S. Villa, and B. C. Vũ, “Convergence of stochastic proximal gradient algorithm,” ar Xiv preprint ar Xiv:1403.5074 , 2014.

TL;DR

Contribution

Findings

Abstract

Peer Reviews

Videos

A probabilistic incremental

Abstract

Index Terms:

I Introduction

II Background

Definition 1**.**

III A Probabilistic IPG method

III-A Linear-Quadratic case

III-B General case

Remark 1**.**

Remark 2**.**

Remark 3**.**

IV Numerical results

IV-A Ridge regression

IV-B Sparse nonlinear regression

V Conclusions

Definition 1.

Remark 1.

Remark 2.

Remark 3.