Perturbed Proximal Descent to Escape Saddle Points for Non-convex and   Non-smooth Objective Functions

Zhishen Huang; Stephen Becker

arXiv:1901.08958·cs.LG·August 13, 2019

Perturbed Proximal Descent to Escape Saddle Points for Non-convex and Non-smooth Objective Functions

Zhishen Huang, Stephen Becker

PDF

TL;DR

This paper introduces a novel algorithm for non-convex, non-smooth optimization that effectively escapes saddle points, extending previous results from smooth to non-smooth settings.

Contribution

It provides the first known theoretical results for escaping saddle points in non-smooth optimization using a perturbed proximal descent method.

Findings

01

First theoretical guarantees for non-smooth saddle point escape

02

Algorithm successfully finds local minima in non-smooth problems

03

Extends saddle point analysis to non-smooth optimization

Abstract

We consider the problem of finding local minimizers in non-convex and non-smooth optimization. Under the assumption of strict saddle points, positive results have been derived for first-order methods. We present the first known results for the non-smooth case, which requires different analysis and a different algorithm.

Equations148

minimize_{x \in R^{d}} (Φ (x) : = f (x) + g (x))

minimize_{x \in R^{d}} (Φ (x) : = f (x) + g (x))

x_{t + 1}

x_{t + 1}

= prox_{η g} \circ (I - η \nabla f) (x_{t})

\|G(\mathbf{x})\|\leq\varepsilon~{}\textrm{ and }~{}\lambda\big{(}\nabla^{2}f(\mathbf{x})\big{)}_{\min}\geq-\sqrt{\rho\varepsilon}

\|G(\mathbf{x})\|\leq\varepsilon~{}\textrm{ and }~{}\lambda\big{(}\nabla^{2}f(\mathbf{x})\big{)}_{\min}\geq-\sqrt{\rho\varepsilon}

O (\frac{L ( Φ ( x _{0} ) - Φ ^{⋆} )}{ε ^{2}} ln^{4} (\frac{d L Δ _{Φ}}{ε ^{2} δ}))

O (\frac{L ( Φ ( x _{0} ) - Φ ^{⋆} )}{ε ^{2}} ln^{4} (\frac{d L Δ _{Φ}}{ε ^{2} δ}))

\tilde{f}_{x} (y) : = f (x) + \nabla^{T} f (x) (y - x) + \frac{1}{2} (y - z)^{T} H (y - z)

\tilde{f}_{x} (y) : = f (x) + \nabla^{T} f (x) (y - x) + \frac{1}{2} (y - z)^{T} H (y - z)

F : = η L \frac{γ ^{3}}{ρ ^{2}} \cdot ln^{- 3} (\frac{d κ}{δ}),

F : = η L \frac{γ ^{3}}{ρ ^{2}} \cdot ln^{- 3} (\frac{d κ}{δ}),

S : = η L \frac{γ}{ρ} \cdot ln^{- 1} (\frac{d κ}{δ}),

T = min {t in f {t ∣ \tilde{f}_{u_{0}} (u_{t}) - f (u_{0}) + g (u_{t}) - g (u_{0}) \leq - 3 F}, \overset{c}{^} T}

T = min {t in f {t ∣ \tilde{f}_{u_{0}} (u_{t}) - f (u_{0}) + g (u_{t}) - g (u_{0}) \leq - 3 F}, \overset{c}{^} T}

\tilde{u}_{t + 1}

\tilde{u}_{t + 1}

u_{t + 1}

∥ u_{t} - u_{0} ∥

∥ u_{t} - u_{0} ∥

v_{t} = w_{t} - u_{t}

v_{t} = w_{t} - u_{t}

∥ \tilde{v}_{t + 1} ∥

∥ \tilde{v}_{t + 1} ∥

= ∥ (I - η \nabla f) \circ prox_{η g} (\tilde{w}_{k}) - (I - η \nabla f) \circ prox_{η g} (\tilde{u}_{k}) ∥

= ∥ w_{k} - u_{k} - η (\nabla f (w_{k}) - \nabla f (u_{k})) ∥

\geq ∥ w_{k} - u_{k} ∥ - η L ∥ w_{k} - u_{k} ∥ = (1 - η L) ∥ w_{k} - u_{k} ∥

\geq (1 - η L) (∥ \tilde{w}_{k} - \tilde{u}_{k} ∥ - 2 η λ d) = (1 - η L) (∥ \tilde{v}_{k} ∥ - 2 η λ d)

\geq (1 - η L)^{t} ∥ \tilde{v}_{1} ∥ - 2 η λ d i = 1 \sum t (1 - η L)^{i}

\displaystyle=(1-\eta L)^{t}\|\tilde{\mathbf{v}}_{1}\|-2\lambda\sqrt{d}\frac{(1-\eta L)\big{(}1-(1-\eta L)^{t}\big{)}}{L}

∥ \tilde{v}_{t + 1} ∥ \geq (1 - η L)^{t} μ r (1 + η γ θ) - 2 λ d \frac{( 1 - η L ) ( 1 - ( 1 - η L ) ^{t} )}{η L}

∥ \tilde{v}_{t + 1} ∥ \geq (1 - η L)^{t} μ r (1 + η γ θ) - 2 λ d \frac{( 1 - η L ) ( 1 - ( 1 - η L ) ^{t} )}{η L}

∥ v_{t + 1} ∥ \geq ∥ \tilde{v}_{t + 1} ∥ - 2 η λ d \geq (1 - η L)^{t} μ r (1 + η γ θ) - 2 λ d \frac{( 1 - η L ) ( 1 - ( 1 - η L ) ^{t} ) + η L}{L}

∥ v_{t + 1} ∥ \geq ∥ \tilde{v}_{t + 1} ∥ - 2 η λ d \geq (1 - η L)^{t} μ r (1 + η γ θ) - 2 λ d \frac{( 1 - η L ) ( 1 - ( 1 - η L ) ^{t} ) + η L}{L}

λ < \frac{( 1 - η L ) ^{\overset{c}{^} T} μ \frac{1}{κ ( l n \frac{d κ}{δ} ) ^{2}} η L ^{\frac{3}{2}} \frac{γ}{ρ} ( 1 + η γ θ )}{2 d [( 1 - η L ) ( 1 - ( 1 - η L ) ^{\overset{c}{^} T} ) + η L ]}

λ < \frac{( 1 - η L ) ^{\overset{c}{^} T} μ \frac{1}{κ ( l n \frac{d κ}{δ} ) ^{2}} η L ^{\frac{3}{2}} \frac{γ}{ρ} ( 1 + η γ θ )}{2 d [( 1 - η L ) ( 1 - ( 1 - η L ) ^{\overset{c}{^} T} ) + η L ]}

∥ P_{E^{⊥}} prox_{η g} (x) ∥ \leq K ∥ P_{E} prox_{η g} (x) ∥

∥ P_{E^{⊥}} prox_{η g} (x) ∥ \leq K ∥ P_{E} prox_{η g} (x) ∥

\hat{v}_{move} (x) = {- η λ \cdot sgn (x_{i}) - x_{i} if ∣ x_{i} ∣ \geq η λ if ∣ x_{i} ∣ < η λ = min {∣ x ∣, η λ \mathbbm 1} \otimes sgn (- x)

\hat{v}_{move} (x) = {- η λ \cdot sgn (x_{i}) - x_{i} if ∣ x_{i} ∣ \geq η λ if ∣ x_{i} ∣ < η λ = min {∣ x ∣, η λ \mathbbm 1} \otimes sgn (- x)

\lambda<\bigg{\|}\frac{\mathrm{Proj}_{\bm{n}}\mathbf{x}}{\hat{\mathbf{v}}_{\mathrm{move}}\cdot\hat{\bm{n}}}\bigg{\|}=\bigg{\|}\frac{\mathbf{x}\cdot\hat{\bm{n}}}{\hat{\mathbf{v}}_{\mathrm{move}}\cdot\hat{\bm{n}}}\bigg{\|}\leq\frac{\|\mathbf{x}\|}{\|\hat{\mathbf{v}}_{\mathrm{move}}\cdot\hat{\bm{n}}\|}

\lambda<\bigg{\|}\frac{\mathrm{Proj}_{\bm{n}}\mathbf{x}}{\hat{\mathbf{v}}_{\mathrm{move}}\cdot\hat{\bm{n}}}\bigg{\|}=\bigg{\|}\frac{\mathbf{x}\cdot\hat{\bm{n}}}{\hat{\mathbf{v}}_{\mathrm{move}}\cdot\hat{\bm{n}}}\bigg{\|}\leq\frac{\|\mathbf{x}\|}{\|\hat{\mathbf{v}}_{\mathrm{move}}\cdot\hat{\bm{n}}\|}

λ < \frac{C}{∥ v ^ _{move} \cdot n ^ ∥}

λ < \frac{C}{∥ v ^ _{move} \cdot n ^ ∥}

T = min {t in f {t ∣ \tilde{f}_{w_{0}} (w_{t}) + g (w_{t}) - f (w_{0}) - g (w_{0}) \leq - 3 F}, \overset{c}{^} T}

T = min {t in f {t ∣ \tilde{f}_{w_{0}} (w_{t}) + g (w_{t}) - f (w_{0}) - g (w_{0}) \leq - 3 F}, \overset{c}{^} T}

\tilde{u}_{t + 1}

\tilde{u}_{t + 1}

u_{t + 1}

\tilde{v}_{t + 1} = (I - η H - η Δ_{t}^{'}) v_{t}

\tilde{v}_{t + 1} = (I - η H - η Δ_{t}^{'}) v_{t}

∥ v_{t} ∥ \leq 200 (S \cdot \overset{c}{^})

∥ v_{t} ∥ \leq 200 (S \cdot \overset{c}{^})

\tilde{ψ}_{t + 1}

\tilde{ψ}_{t + 1}

\tilde{φ}_{t + 1}

for all t < T, φ_{t} \leq 4 ζ t \cdot ψ_{t}

for all t < T, φ_{t} \leq 4 ζ t \cdot ψ_{t}

\tilde{φ}_{t + 1}

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Full text

11institutetext: Dept. of Applied Math., University of Colorado, Boulder, USA 11email: {zhishen.huang,stephen.becker}@colorado.edu

Perturbed Proximal Descent to Escape Saddle Points for Non-convex and Non-smooth Objective Functions

Zhishen Huang 11

Stephen Becker 11 0000-0002-1932-8159

Abstract

We consider the problem of finding local minimizers in non-convex and non-smooth optimization. Under the assumption of strict saddle points, positive results have been derived for first-order methods. We present the first known results for the non-smooth case, which requires different analysis and a different algorithm.

This is the extended version of the paper that contains the proofs.

Keywords:

Saddle-points Proximal gradient descent Non-smooth optimization.

1 Introduction

We consider the problem of finding approximate local minimizers of the problem

[TABLE]

where $f(\mathbf{x})$ is not convex but smooth (and with full domain), and $g(\mathbf{x})$ is convex but not smooth. Many optimization problems in engineering, signal processing and machine learning can be cast in this framework, where $f$ is a smooth loss function, and $g$ is a non-smooth regularizer such as a norm. For example, our model captures regularized neural networks [11], where the regularization can induce sparsity as an alternative to dropout. In this paper, for simplicity we restrict our discussion to $g(\mathbf{x})=\lambda\|\mathbf{x}\|_{1}$ , where $\lambda\geq 0$ is a constant, but many of the results apply to more general choices of $g$ . The first-order condition is $0\in\nabla f(\mathbf{x})+\partial g(\mathbf{x})$ , and any $\mathbf{x}$ satisfying this condition is called a “stationary point” (see [2] for background on the subdifferential $\partial g$ ). All local minimizers are stationary points, but not vice-versa. We define a “saddle point” to be any stationary point where the Hessian is indefinite (and therefore not a local minimizer). This paper extends a recent line of work [13] to analyze when we can expect to find a local minimizer. It has been argued that in many machine learning problems, finding any local minimizer is often enough for good performance, but finding a saddle point is not useful [9].

The fact that $g$ is non-smooth is crucially important, and it does more than just complicate the analysis, as it also requires a new algorithm. In the smooth case, $f$ is often minimized using gradient descent or an accelerated variant [16] with a fixed stepsize. Naïvely extending gradient descent to apply to (1) leads to subgradient descent with fixed-stepsize. Unfortunately, this method fails to converge as the example $d=1,\lambda=1$ and $f=0$ shows [18] since for a generic choice of the initial point, the sequence is not Cauchy.

Instead of gradient descent, we use a perturbed version of proximal gradient descent. For a real-valued convex lower semi-continuous function $g$ , define the “proximity” operator (or “prox” for short) as the map $\mathrm{prox}_{g}(\mathbf{y})=\mathop{\rm argmin}_{\mathbf{x}}g(\mathbf{y})+\frac{1}{2}\|\mathbf{x}-\mathbf{y}\|^{2}$ (throughout the paper, for vectors we use $\|\cdot\|$ to denote the Euclidean norm). Equivalently, $\mathrm{prox}_{g}=(I+\partial g)^{-1}$ , and thus the first-order condition is equivalent to $\mathbf{x}=\mathrm{prox}_{\eta g}[\mathbf{x}-\eta\nabla f(\mathbf{x})]$ for any $\eta>0$ . Proximal gradient descent is the iteration $\mathbf{x}_{t+1}=\mathrm{prox}_{\eta g}[\mathbf{x}_{t}-\eta\nabla f(\mathbf{x}_{t})]$ , so it immediately follows that if the sequence converges, it converges to a stationary point. Convergence of the sequence is known to follow from mild assumptions on $f$ and $g$ , the stepsize $\eta$ , and boundedness of the sequence $\{\mathbf{x}_{t}\}$ [1].

We define a second-order stationary point to be a first-order stationary point $\mathbf{x}$ that additionally satisfies $\nabla^{2}f(\mathbf{x})\succ 0$ , which is a sufficient condition for $\mathbf{x}$ to be a local minimizer. Our main contribution is showing that under suitable assumptions, a perturbed version of proximal gradient descent will generate a sequence that converges to an approximate second-order stationary point. We make assumptions on the second-order behavior of $f$ , similar to assumptions under which it is known that gradient descent will always converge to a second-order stationary point except for adversarially chosen starting points [14] — in contrast to Newton’s method, which is attracted to all stationary points. However, even in the smooth case when the sequence converges, gradient descent converges arbitrarily slowly [10] in the presence of a saddle point, so perturbation is necessary. In the non-smooth case, perturbation is even more important due to the proximal nature of the algorithm.

A toy example: Gaussian Bump

Consider the function $\Phi:\mathbb{R}^{2}\rightarrow\mathbb{R},x\mapsto\frac{1}{2}(x^{2}-y^{2})\mathrm{e}^{-\frac{x^{2}+y^{2}}{5}}+\frac{1}{100}h_{100}(\mathbf{x})$ where $h_{100}(\mathbf{x})$ is the Huber function with parameter 100 [3]. The choice of this combination of Huber parameter and the magnitude of Huber function ensures that the origin is a saddle point. The Huber function approximates the $\ell_{1}$ norm. The plot is show in Fig. 2.

This function has two local minima and a saddle point at $(0,0)$ . Because the Huber function is both smooth and it has a known proximity operator, we can treat it as either part of the smooth $f$ component or the non-smooth $g$ component, and therefore run either gradient descent or proximal gradient descent. We experiment with both algorithms, randomly picking initial points at $\mathbf{x}_{0}=(0.3,0.01)+\bm{\xi}$ where $\bm{\xi}$ is sampled uniformly from $\mathbb{B}_{0}(\frac{1}{10}\|\mathbf{x}_{0}\|)$ , and varying the stepsize $\eta$ , with fixed maximum iteration 1000. Figure 2 shows the empirical success rate of finding a local minimizer (as opposed to converging to the saddle point at $(0,0)$ ).

We observe that the range of stable step size for the proximal descent algorithm is wider than gradient descent, and the success rate of proximal descent is as high as the gradient descent. This example motivates us to adopt proximal descent over gradient descent in real application for better stability and equivalent, if not better, accuracy.

A coincidence

In this toy example, the saddle point at $(0,0)$ happens to be a fixed point of proximal operator of $\eta\lambda\|\mathbf{x}\|_{1}$ . Soft thresholding, as the proximal operator of $\lambda\|\mathbf{x}\|_{1}$ is known [7], has an attracting region that sets nearby points to [math]. The radius of the attracting region (per dimension) is $\eta\lambda$ , thus if $\|\mathbf{x}_{t_{0}}-\eta\nabla f(\mathbf{x}_{t_{0}})\|_{\infty}\leq\eta\lambda$ for some iteration $t_{0}$ , then $\mathbf{x}_{t}=0$ for all $t>t_{0}$ . Proximal gradient descent performs even better when the saddle point is not in the attracting region.

Structure of the paper

Section 2 states the algorithm, followed by section 3 where the theoretical guarantee is presented with proof. Section 4 shows numerical experiments.

1.1 Related literature

Second order methods for smooth objectives

Some recent second order methods, mainly based on either cubic-regularized Newton methods as in [17] or based on trust-region methods (as in Curtis et al. [8]), have been shown to converge to $\varepsilon$ -approximate local minimizers of smooth non-convex objective functions in $\mathcal{O}(\varepsilon^{-1.5})$ iterations. See [6, 13, 21] for a more thorough review of these methods. We do not consider these methods further due to the high-cost of solving for the Newton step in large dimensions.

First order methods for smooth objectives

We focus on first order methods because each step is cheaper and these methods are more frequently adopted by the deep learning community. Xu et al. in [20] and Allen-Zhu et al. in [21] develop Negative-Curvature (NC) search algorithms, which find descent direction corresponding to negative eigenvalues of Hessian matrix. The NC search routines avoid using either Hessian or Hessian-vector information directly, and it can be applied in both online and deterministic scenarios. In the online setting, combining NC search routine with first-order stochastic methods will give algorithms NEON- $\mathcal{A}$ [20] and NEON2+SGD [21] with iteration cost $\mathcal{O}(\frac{d}{\varepsilon^{3.5}})$ and $\mathcal{O}(\varepsilon^{-3.5})$ respectively (the latter still depends on dimension, whose induced complexity is at least $\ln^{2}(d)$ ), and these methods generate a sequence that converges to an approximate local minimum with high probability. In the offline setting, Jin et al. in [13] provide a stochastic first order method that finds an approximate local minimizer with high probability at computational cost $\mathcal{O}(\frac{\ln^{4}(d)}{\varepsilon^{2}})$ . Combining NEON2 with gradient descent or SVRG, the cost to find an approximate local minimum is $\mathcal{O}(\varepsilon^{-2})$ , whose dependence on dimension is not specified but at least $\ln^{2}(d)$ . These methods make Lipschitz continuity assumptions about the gradient and Hessian, so they do not apply to non-smooth optimization.

A recent preprint [15] approaches the problem of finding local minima using the forward-backward envelope technique developed in [19], where the assumption about the smoothness of objective function is weakened to local smoothness instead of global smoothness.

Non-smooth objectives

In the offline settings, Boţ et al. propose a proximal algorithm for minimizing non-convex and non-smooth objective functions in [5]. They show the convergence to KKT points instead of approximate second-order stationary points. Other work [1, 4] relies on the Kurdya-Lojasiewicz inequality and shows convergence to stationary points in the sense of the limiting subdifferential, which is not the same as a local minimizer or approximate second-order stationary point. In the online setting, Reddi et al. demonstrated in [12] that the proximal descent with variance reduction technique (proxSVRG) has linear convergence to a first-order stationary point, but not to a local minimizer.

2 Algorithm

The algorithm takes as input a starting vector $\mathbf{x}_{0}$ , the gradient Lipschitz constant $L$ , the Hessian Lipschitz constant $\rho$ , the second-order stationary point tolerance $\varepsilon$ , a positive constant $c$ , a failure probability $\delta$ , and estimated function value gap $\Delta_{\Phi}$ . The key parameter for Algorithm 1 is the constant $c$ . It should be made large enough so that the effect of perturbation will be significant enough for escaping saddle points, and at the same time not too large so that the iteration stepsize is of reasonable magnitude and the iteration will not go wild. The output of the algorithm is an $\varepsilon$ -second-order stationary point (see Def. 3).

3 Escaping Saddle Points through Perturbed Proximal Descent

The main step in the algorithm is a proximal gradient descent step applied to $f+g$ , defined as

[TABLE]

One motivation of preferring proximal descent to gradient descent, as shown in Figure 2, is the stability of the algorithm with respect to stepsize change. The proximal step is similar to the implicit/backward Euler scheme, as equation (3) can be written as $\mathbf{x}_{t+1}=\mathbf{x}_{t}-\eta\big{(}\nabla f(\mathbf{x}_{t})+\partial g(\mathbf{x}_{t+1})\big{)}$ . From this perspective, we expect that proximal descent will demonstrate at least the same convergence speed as gradient descent and stronger stability with respect to hyperparameter setting.

Definition 1 (Gradient Mapping)

Consider a function $\Phi(\mathbf{x})=f(\mathbf{x})+g(\mathbf{x})$ . The gradient mapping is defined as $G^{f,g}_{\eta}(\mathbf{x})\mathrel{\mathop{:}}=\mathbf{x}-\mathrm{prox}_{\eta g}[\mathbf{x}-\eta\nabla f(\mathbf{x})]$

In the rest of this paper, the super- and subscript of the gradient mapping are not specified, as it is always clear that $f$ represents the smooth nonconvex part of $\Phi$ , $g$ represents $\lambda\|\mathbf{x}\|_{1}$ , and $\eta$ is the stepsize used in the algorithm. Observe that the gradient map is just the gradient of $f$ if $g\equiv 0$ .

Definition 2 (First order stationary points)

For a function $\Phi(\mathbf{x})$ , define first order stationary points as the points which satisfy $G(\mathbf{x})=0.$

Definition 3 ( $\varepsilon$ -second-order stationary point)

Consider a function $\Phi(\mathbf{x})=f(\mathbf{x})+g(\mathbf{x})$ . A point $\mathbf{x}$ is an $\varepsilon$ -second-order stationary point if

[TABLE]

where $\lambda(\cdot)_{\min}$ is the smallest eigenvalue.

The first Lipschitz assumption below is standard [3], and the assumption on the Hessian was used in [13] (for example, it is true if $f$ is quadratic).

Assumption A1 (Lipschitz Properties)

$\nabla f$ * is $L$ -Lipschitz continuous and $\nabla^{2}f$ is $\rho$ Lipschitz continuous. We write $\mathcal{H}$ as shorthand for $\nabla^{2}f(\mathbf{x})$ when $\mathbf{x}$ is clear from context.*

Assumption A2 (Moderate Nonsmooth Term)

The magnitude of $\|\mathbf{x}\|_{1}$ term, which is denoted by $\lambda$ , satisfies inequalities (7) and (9).

Theorem 3.1 (Main)

There exists an absolute constant $c_{\max}$ such that if $f(\cdot)$ satisfies A1 and A2, then for any $\delta>0,\varepsilon\leq\frac{L^{2}}{\rho},\Delta_{\Phi}\geq\Phi(\mathbf{x}_{0})-\Phi^{\star}$ , and constant $c\leq c_{\max}$ , with probability $1-\delta$ , the output of $\text{PPD}(\mathbf{x}_{0},L,\rho,\varepsilon,c,\delta,\Delta_{f})$ will be a $\varepsilon$ -second order stationary point, and terminate in iterations:

[TABLE]

Remark

Assuming $\varepsilon\leq\frac{L^{2}}{\rho}$ does not lead to loss of generality. Recall the second order condition is specified as $\lambda\big{(}\nabla^{2}f(\mathbf{x}^{\star})\big{)}_{\min}\geq-\sqrt{\rho\varepsilon}$ , since when $\varepsilon\geq\frac{L^{2}}{\rho}$ , we always have $-\sqrt{\rho\varepsilon}\leq-L\leq\lambda\big{(}\nabla^{2}f(\mathbf{x}^{\star})\big{)}_{\min}$ , where the second inequality follows from the fact that the Lipschitz constant is the upper bound for $\lambda(\nabla^{2}f(\mathbf{x}))$ in norm. Consequently, when $\varepsilon\geq\frac{L^{2}}{\rho}$ , every $\varepsilon$ -second-order stationary point is automatically a first order stationary point.

For the proof of the main theorem, we introduce some notation and units for the simplicity of proof statement.

For matrices we use $\|\cdot\|$ to denote spectral norm. The operator $\mathcal{P}_{\mathcal{S}}(\cdot)$ denotes projection onto set $\mathcal{S}$ . Define the local approximation of the smooth part of the objective function by

[TABLE]

Units

With the conditional number of the Hessian matrix $\kappa\mathrel{\mathop{:}}=\frac{L}{\gamma}\geq 1$ , we define the following units for the convenience of proof statement:

[TABLE]

3.1 Lemma: Iterates remain bounded if stuck near a saddle point

Lemma 1

For any constant $\hat{c}\geq 3$ , there exists absolute constant $c_{\max}$ : for any $\delta\in(0,\frac{d\kappa}{e}]$ , let $f(\cdot),\tilde{\mathbf{x}}$ satisfies the condition in Lemma 6, for any initial point $\mathbf{u}_{0}$ with $\|{\mathbf{u}_{0}-\tilde{\mathbf{x}}}\|\leq 2\mathscr{S}/(\kappa\cdot\ln(\frac{d\kappa}{\delta}))$ , define:

[TABLE]

then, for any $\eta\leq c_{\max}/L$ , we have for all $t<T$ that $\|{\mathbf{u}_{t}-\tilde{\mathbf{x}}}\|\leq 100(\mathscr{S}\cdot\hat{c})$ .

Proof

We show if the function value did not decrease, then all the iteration updates must be constrained in a small ball. The proximal descent updates the solution as

[TABLE]

Without losing of generality, set $\mathbf{u}_{0}=0$ to be the origin. For any $t\in\mathbb{N}$ ,

[TABLE]

Jin et al. prove in [13] by induction that if $\|\mathbf{u}_{t}\|\leq 100(\mathscr{S}\cdot\hat{c})$ , then $\|\tilde{\mathbf{u}}_{t+1}\|\leq 100(\mathscr{S}\cdot\hat{c})$ . Consequently, $\|\mathbf{u}_{t+1}\|\leq 100(\mathscr{S}\cdot\hat{c})$ .

We point out that it is implicitly assumed that $\frac{2\mathscr{S}}{\kappa\cdot\ln(\frac{d\kappa}{\delta})}\ll\hat{c}$ , so that for all $t<T$ , $\|\tilde{\mathbf{x}}\|\ll\|\mathbf{u}_{t}\|$ , and the relation $\|\mathbf{u}_{t}-\tilde{\mathbf{x}}\|\leq\|\mathbf{u}_{t}\|+\|\tilde{\mathbf{x}}\|\leq 100(\mathscr{S}\cdot\hat{c})$ holds.

3.2 Preparation for Building Pillars

Lemma 2 (Existence of lower bound for the difference sequence $\{\mathbf{v}_{t}\}_{t=1}^{T}$ )

For iteration sequences $\{\mathbf{w}_{t}\}$ and $\{\mathbf{u}_{t}\}$ defined in Lemma 4, define the difference sequence as

[TABLE]

There exists a positive lower bound for $\{\mathbf{v}_{t}\}$ when $t<\hat{c}\mathscr{T}$ .

Proof

To show that the lower bound for iteration difference $\{\mathbf{v}_{t}\}_{t=1}^{T}$ exists, we consider bounding the iteration sequence $\tilde{\mathbf{v}}_{t+1}$ first. Define the difference between the proximal of $l_{1}$ penalty term and its coimage as $\mathcal{D}_{g}[\mathbf{x}]=\mathrm{prox}_{g}[\mathbf{x}]-\mathbf{x}=\min\{\lambda\mathbbm{1},|\mathbf{x}|\}\otimes\,\mathrm{sgn}(-\mathbf{x})$ , where $\otimes$ is Hadamard product and the minimum is taken elementwise. We notice that $\|\mathcal{D}_{\eta\lambda\|\cdot\|_{1}}[\mathbf{x}]\|\leq\eta\lambda\sqrt{d}$ . Thus, $\|\mathbf{w}_{k}-\mathbf{u}_{k}\|=\|\tilde{\mathbf{w}}_{k}-\tilde{\mathbf{v}}_{k}-\lambda(\mathcal{D}_{\eta g}[\tilde{\mathbf{w}}_{k}]-\mathcal{D}_{\eta g}[\tilde{\mathbf{u}}_{k}])\|\geq\|\tilde{\mathbf{w}}_{k}-\tilde{\mathbf{v}}_{k}\|-2\eta\lambda\sqrt{d}$ .

[TABLE]

As $\tilde{\mathbf{v}}_{1}=(I-\eta\nabla f)\mathbf{v}_{0}=(I-\eta\nabla f)\mu r\mathbf{e}_{1}=\mu r(\mathbf{e}_{1}-\eta\nabla^{2}f(\bm{\xi})\theta\mathbf{e}_{1})=\mu r(1+\eta\gamma\theta)\mathbf{e}_{1}$ , where $\theta\in(0,1)$ , we have

[TABLE]

To compare $\|\mathbf{v}_{t}\|$ and $\|\tilde{\mathbf{v}}_{t}\|$ ,

[TABLE]

Therefore, as long as

[TABLE]

the difference sequence $\{\|\mathbf{v}_{t}\|\}$ has a positive lower bound on its norm.

Lemma 3 (Preservation of subspace projection monotonicity after prox of $l_{1}$ in rotated coordinate with small $\lambda$ )

Denote the subspace of $\mathbb{R}^{n}$ spanned by $\{\mathbf{e}_{1}\}$ as $\mathbb{E}$ , while the complement subspace spanned by $\{\mathbf{e}_{2},\cdots,\mathbf{e}_{n}\}$ as $\mathbb{E}^{\perp}$ . For a given vector $\mathbf{x}$ chosen from a lower bounded set $\mathcal{X}$ , i.e. $\forall\,\mathbf{x}\in\mathcal{X}$ , $\|\mathbf{x}\|\geq C$ for some constant $C>0$ , assume $\|\mathcal{P}_{\mathbb{E}^{\perp}}\mathbf{x}\|\leq K\|\mathcal{P}_{\mathbb{E}}\mathbf{x}\|$ , where $0<K\leq 1$ is a constant. If the parameter $\lambda$ for the $l_{1}$ penalty term is small enough, then

[TABLE]

Proof

We want to find a constraint on $\lambda$ such that when $\lambda$ is small enough, if the projection in the original coordinate demonstrates the monotonicity relation $\|\mathcal{P}_{\mathbb{E}}\mathbf{x}\|\leq\|\mathcal{P}_{\mathbb{E}^{\perp}}\mathbf{x}\|$ , this monotonicity relation will be preserved after proximal operator of $l_{1}$ is applied on the input vector.

Naturally there exists a normal vector, denoted as $\hat{\bm{n}}_{\text{boundary}}\equiv\hat{\bm{n}}$ , for the boundary hyperplane on which $\|\mathcal{P}_{\mathbb{E}}\mathbf{x}\|=K\|\mathcal{P}_{\mathbb{E}^{\perp}}\mathbf{x}\|$ . By moving along $\hat{\bm{n}}$ , a point approaches the boundary most efficiently. Any vector inside the hyperplane is perpendicular to $\hat{\bm{n}}$ , which we denote as $\hat{\bm{n}}^{\perp}$ .

Define

[TABLE]

where $\otimes$ is the Hadamard product, and the minimum is taken elementwise. Because $\mathrm{prox}_{\eta g}(\mathbf{x})=\mathbf{x}+\hat{\mathbf{v}}_{\textrm{move}}$ , a sufficient condition to be imposed on $\lambda$ to guarantee the preservation of projection monotonicity $\|\mathcal{P}_{\mathbb{E}^{\perp}}\mathrm{prox}_{\eta g}(\mathbf{x})\|\leq K\|\mathcal{P}_{\mathbb{E}}\mathrm{prox}_{\eta g}(\mathbf{x})\|$ is that

[TABLE]

which means the moving distance caused by applying the $l_{1}$ proximal operator (soft shrinkage) projected on the direction of $\hat{\bm{n}}$ is less that the distance between $\mathbf{x}$ to the boundary hyperplane, hence rendering the vector stay on the same side of the boundary after moving.

Therefore, as long as

[TABLE]

the monotonicity of projection onto subspaces can be preserved.

Remark 1 for Lemma 3

As an examples in $\mathbb{R}^{2}$ , set $K=1$ , we visualise the shift caused by proximal operator and the boundary of projection-monotonicity preserving region. Assume $\mathbf{e}_{1,2}$ are orthonormal basis of Cartesian coordinate in the standard position. The directional vector for region division boundary is $\displaystyle\hat{\mathbf{e}}_{\mathrm{boundary}}=\hat{\bm{n}}^{\perp}=\frac{\pm\hat{\mathbf{e}}_{1}\pm\hat{\mathbf{e}}_{2}}{\sqrt{2}}$ , and $\hat{\mathbf{e}}_{\mathrm{boundary}}^{\perp}=\hat{\bm{n}}$ is the corresponding perpendicular directional vector. For $l_{1}$ norm, $\hat{\mathbf{v}}_{\mathrm{move}}$ is $(\pm 1,\pm 1)$ .

Remark 2 for Lemma 3

We point out that the upper bound for the parameter $\lambda$ is related to the alignment of the eigenspace of $\mathcal{H}$ . If the eigenspace of $\mathcal{H}$ is aligned with canonical orthonormal basis of $\mathbb{R}^{d}$ , then $\lambda\in(0,\infty)$ . The most stringent restriction on the upper bound of $\lambda$ applies when $\hat{\mathbf{v}}_{\textrm{move}}$ is parallel to $\hat{\bm{n}}$ .

3.3 Lemma: Perturbed iterates will escape the saddle point

Lemma 4

There exists absolute constant $c_{\max},\hat{c}$ such that: for any $\delta\in(0,\frac{d\kappa}{e}]$ , let $f(\cdot),\tilde{\mathbf{x}}$ satisfies the condition in Lemma 6, and sequences $\{\mathbf{u}_{t}\},\{\mathbf{w}_{t}\}$ satisfy the conditions in Lemma 6, define:

[TABLE]

then, for any $\eta\leq c_{\max}/L$ , if $\|{\mathbf{u}_{t}-\tilde{\mathbf{x}}}\|\leq 100(\mathscr{S}\cdot\hat{c})$ for all $t<T$ , we will have $T<\hat{c}\mathscr{T}$ .

Proof

We show that if the iterate sequence before time $T$ starting from $\mathbf{u}_{0}$ does not provide sufficient function value decrease, the other iterate sequence, which starts from $\mathbf{w}_{0}$ , will be able to achieve the function value decrease purpose. Ultimately, we will prove $T<\hat{c}\mathscr{T}$ . We establish the inequality about $T$ by considering the difference between $\mathbf{w}_{t}$ and $\mathbf{u}_{t}$ . Define $\mathbf{v}_{t}=\mathbf{w}_{t}-\mathbf{u}_{t}$ . The assumption of the lemma 4, $\mathbf{v}_{0}=\mu[\mathscr{S}/(\kappa\cdot\ln(\frac{d\kappa}{\delta}))]\mathbf{e}_{1}$ , $\mu\in[\delta/(2\sqrt{d}),1].$

We bound $\|\mathbf{v}_{t}\|$ from both sides for all $t<T$ to obtain an inequality about $T$ .

Recall that the proximal descent updates the solution as

[TABLE]

Simple algebraic computation gives

[TABLE]

where $\Delta^{\prime}_{t}=\int_{0}^{1}\nabla^{2}f(\mathbf{u}_{t}+\theta\mathbf{v}_{t})\,\mathrm{d}\theta-\mathcal{H}$ , and $\tilde{\mathbf{v}}_{t}=\tilde{\mathbf{w}}_{t}-\tilde{\mathbf{u}}_{t}$ .

Consider $\|\tilde{\mathbf{u}}_{t}\|$ and $\|\tilde{\mathbf{w}}_{t}\|$ . Because $\mathbf{v}_{0}=\tilde{\mathbf{v}}_{0}$ , we have $\|\tilde{\mathbf{w}}_{0}-\tilde{\mathbf{x}}\|\leq\|\tilde{\mathbf{u}}_{0}-\tilde{\mathbf{x}}\|+\|\tilde{\mathbf{v}}_{0}\|\leq 2\mathscr{S}/(\kappa\cdot\ln(\frac{d\kappa}{\delta}))$ . With same logic in the proof for lemma 1, we see $\|\tilde{\mathbf{u}}_{t}\|\leq 100(\mathscr{S}\cdot\hat{c})$ , and $\|\tilde{\mathbf{w}}_{t}\|\leq 100(\mathscr{S}\cdot\hat{c})$ . (Same relation hold for $\|\mathbf{u}_{t}\|$ and $\|\mathbf{w}_{t}\|$ respectively.) As a result, $\|\tilde{\mathbf{v}}_{t}\|\leq\|\tilde{\mathbf{w}}_{t}\|+\|\tilde{\mathbf{u}}_{t}\|\leq 200(\mathscr{S}\cdot\hat{c})$ for all $t<T$ . Also,

[TABLE]

Equation (11) and Hessian Lipschitz gives for $t<T$ , $\|\Delta^{\prime}_{t}\|\leq\rho(\|\mathbf{u}_{t}\|+\|\mathbf{v}_{t}\|+\|\tilde{\mathbf{x}}\|)\leq\rho\mathscr{S}(300\hat{c}+1)=\frac{\zeta}{\eta}$ , where $\zeta=\eta\rho\mathscr{S}(300\hat{c}+1)$ .

Denote $\psi_{t}$ be the norm of $\mathbf{v}_{t}$ projected onto $\mathbf{e}_{1}$ direction ( $\mathcal{S}$ ), and $\varphi_{t}$ be the norm of $\mathbf{v}_{t}$ projected onto the remaining subspace ( $\mathcal{S}^{c}$ ), while $\tilde{\psi}_{t}$ be the norm of $\tilde{\mathbf{v}}_{t}$ projected onto $\mathcal{S}$ , and $\tilde{\varphi}_{t}$ be the norm of $\tilde{\mathbf{v}}_{t}$ projected onto $\mathcal{S}^{c}$ .

Equation (10) gives

[TABLE]

To obtain the lower bound of $\|\mathbf{v}_{t}\|$ , we prove the following relation as preparation:

[TABLE]

By hypothesis of lemma 4, we know $\varphi_{0}=0$ , thus the base case of induction holds. Assume equation (14) is true for $\tau\leq t$ , for $t+1\leq T$ , we have

[TABLE]

By choosing $\sqrt{c_{\max}}\leq\frac{1}{300\hat{c}+1}\min\{\frac{1}{2\sqrt{2}},\frac{1}{4\hat{c}}\}$ , and $\eta\leq\frac{c_{\max}}{L}$ , we have $4\zeta(t+1)\leq 4\zeta T\leq 4\eta\rho\mathscr{S}(300\hat{c}+1)\hat{c}\mathscr{T}=4\sqrt{\eta L}(300\hat{c}+1)\hat{c}\leq 1$ . This gives $4(1+\gamma\eta)\psi_{t}\geq 4\psi_{t}\geq(1+1)\sqrt{2\psi_{t}^{2}}\geq(1+4\zeta(t+1))\sqrt{\psi_{t}^{2}+\varphi_{t}^{2}}$ . i.e.

[TABLE]

Connecting two parts of equation (3.3), we obtain

[TABLE]

Now we switch our focus to the eigenspace of Hessian $\mathcal{H}$ . Assume the orthonormal basis for the eigensapce of $\mathcal{H}$ is $\{\mathbf{e}_{1},\mathbf{e}_{2},\cdots,\mathbf{e}_{d}\}$ . The order of dimension aligns with the increasing order of the corresponding eigenvalues. This coordinate transformation does not lead to loss of generality, as it is unitary.

By lemma 2, we know the iteration difference sequence $\mathbf{v}_{t}$ has a positive lower bound in terms of 2-norm. Therefore, by lemma 3, with the virtue of equation (17) $\sqrt{\sum_{i=2}^{d}(\mathbf{e}_{i}^{T}\tilde{\mathbf{v}}_{t+1})^{2}}\leq 4\zeta(t+1)\|\mathbf{e}_{1}^{T}\tilde{\mathbf{v}}_{t+1}\|$ , we still have the projection monotonicity on the subspace of eigenspace of $\mathcal{H}$ , i.e.

[TABLE]

Until here we finish the induction.

Recall that $4\zeta(t+1)\leq 1$ , we thus have $\varphi_{t}\leq 4\zeta t\psi_{t}\leq\psi_{t}$ , which gives

[TABLE]

where the last inequality follows from $\zeta=\eta\rho\mathscr{S}(300\hat{c}+1)\leq\sqrt{c_{\max}}(300\hat{c}+1)\gamma\eta\cdot\ln^{-1}(\frac{d\kappa}{\delta})\leq\frac{\gamma\eta}{2\sqrt{2}}$ .

Finally, combining (11) and (18), we have for all $t<T$ :

[TABLE]

This implies

[TABLE]

The last inequality is due to $\delta\in(0,\frac{d\kappa}{e}]$ , we have $\ln(\frac{d\kappa}{\delta})\geq 1$ . By choosing the constant $\hat{c}$ to be large enough to satisfy $2+\ln(400\hat{c})\leq\hat{c}$ , we will have $T<\hat{c}\mathscr{T}$ , which finishes the proof.

3.4 Combining Previous Results

Lemma 5

There exists a universal constant $c_{\max}$ , for any $\delta\in(0,\frac{d\kappa}{e}]$ , let $f(\cdot),\tilde{\mathbf{x}}$ satisfies the conditions in Lemma 6, and without loss of generality let $\mathbf{e}_{1}$ be the minimum eigenvector of $\nabla^{2}f(\tilde{\mathbf{x}})$ . Consider two gradient descent sequences $\{\mathbf{u}_{t}\},\{\mathbf{w}_{t}\}$ with initial points $\mathbf{u}_{0},\mathbf{w}_{0}$ satisfying: (denote radius $r=\mathscr{S}/(\kappa\cdot\ln(\frac{d\kappa}{\delta}))$ )

[TABLE]

Then, for any stepsize $\eta\leq c_{\max}/L$ , and any $T\geq\frac{1}{c_{\max}}\mathscr{T}$ , we have:

[TABLE]

Proof

Without losing generality, let $\tilde{\mathbf{x}}=0$ be the origin. Let $(c^{(2)}_{\max},\hat{c})$ be the absolute constant so that Lemma 4 holds, also let $c^{(1)}_{\max}$ be the absolute constant to make Lemma 1 holds based on our current choice of $\hat{c}$ . We choose $c_{\max}\leq\min\{c^{(1)}_{\max},c^{(2)}_{\max}\}$ so that our learning rate $\eta\leq c_{\max}/L$ is small enough which make both Lemma 1 and Lemma 4 hold. Let $T^{\star}\mathrel{\mathop{:}}=\hat{c}\mathscr{T}$ and define:

[TABLE]

Let’s consider following two cases:

Case $T^{\prime}\leq T^{\star}$ :

In this case, by Lemma 1, we know $\|{\mathbf{u}_{T^{\prime}-1}}\|\leq O(\mathscr{S})$ , and therefore

[TABLE]

By choosing $c_{\max}$ small enough and $\eta\leq c_{\max}/L$ , this gives:

[TABLE]

The first and second inequality exploit Hessian Lipschitz property of smooth function $f$ , and $\|\mathbf{u}_{0}-\tilde{\mathbf{x}}\|\leq O(\mathscr{S})$ , $\|\mathbf{u}_{T^{\prime}}-\mathbf{u}_{0}\|\leq O(\mathscr{S})$ . By choose $c_{\max}\leq\min\{1,\frac{1}{\hat{c}}\}$ . We know $\eta<\frac{1}{L}$ , by sufficient decrease lemma for proximal descent, we know each proximal descent iteration decreases function value. Therefore, for any $T\geq\frac{1}{c_{\max}}\mathscr{T}\geq\hat{c}\mathscr{T}=T^{\star}\geq T^{\prime}$ , we have:

[TABLE]

Case $T^{\prime}>T^{\star}$ :

In this case, by Lemma 1, we know $\|{\mathbf{u}_{t}}\|\leq O(\mathscr{S})$ for all $t\leq T^{\star}$ . Define

[TABLE]

By Lemma 4, we immediately have $T^{\prime\prime}\leq T^{\star}$ . Apply same argument as in the case $T^{\prime}\leq T^{\star}$ , we have for all $T\geq\frac{1}{c_{\max}}\mathscr{T}$ that $f(\mathbf{w}_{T})+g(\mathbf{w}_{T})-f(\mathbf{w}_{0})-g(\mathbf{w}_{0})\leq f(\mathbf{w}_{T^{\star}})+g(\mathbf{w}_{T^{\star}})-f(\mathbf{w}_{0})-g(\mathbf{w}_{0})\leq-2.7\mathscr{F}$ .

3.5 Main Lemma

Lemma 6 (Main Lemma)

There exists universal constant $c_{\max}$ , for $f(\cdot)$ satisfies A1, for any $\delta\in(0,\frac{d\kappa}{e}]$ , suppose we start with point $\tilde{\mathbf{x}}$ satisfying following conditions:

[TABLE]

Let $\mathbf{x}_{0}=\tilde{\mathbf{x}}+\bm{\xi}$ where $\bm{\xi}$ come from the uniform distribution over ball with radius $\mathscr{S}/(\kappa\cdot\ln(\frac{d\kappa}{\delta}))$ , and let $\mathbf{x}_{t}$ be the iterates of gradient descent from $\mathbf{x}_{0}$ . Then, when stepsize $\eta\leq c_{\max}/L$ , with at least probability $1-\delta$ , we have following for any $T\geq\frac{1}{c_{\max}}\mathscr{T}$ :

[TABLE]

Proof

Denote $T_{\frac{l}{L}}(\mathbf{x})=\mathrm{prox}_{\frac{1}{L}g}\big{[}\mathbf{x}-\frac{1}{L}\nabla f(\mathbf{x})\big{]}$ . The fisrt order stationary condition is equivalent to $\|\tilde{\mathbf{x}}-T_{\frac{1}{L}}(\tilde{\mathbf{x}})\|=\|\nabla f(\tilde{\mathbf{x}})+\partial g\big{(}T_{\frac{1}{L}}(\tilde{\mathbf{x}})\big{)}\|\leq\mathscr{G}$ , where $\partial g$ is the subgradient of the function $g$ .

As $g(\mathbf{x})=\lambda\|\mathbf{x}\|_{1}$ has Lipschitz constant $\lambda$ , we have

[TABLE]

Notice

[TABLE]

By adding perturbation, in worst case we increase function value by:

[TABLE]

where the last inequality follows from the fact that $\lambda\ll\min\{1,l\}$ per equation (7).

On the other hand, let radius $r=\frac{\mathscr{S}}{\kappa\cdot\ln(\frac{d\kappa}{\delta})}$ . We know $\mathbf{x}_{0}$ come froms uniform distribution over $\mathbb{B}_{\tilde{\mathbf{x}}}(r)$ . Let $\mathcal{X}_{\text{stuck}}\subset\mathbb{B}_{\tilde{\mathbf{x}}}(r)$ denote the set of bad starting points so that if $\mathbf{x}_{0}\in\mathcal{X}_{\text{stuck}}$ , then $\Phi(\mathbf{x}_{T})-\Phi(\mathbf{x}_{0})>-2.7\mathscr{F}$ (thus stuck at a saddle point); otherwise if $\mathbf{x}_{0}\in B_{\tilde{\mathbf{x}}}(r)-\mathcal{X}_{\text{stuck}}$ , we have $\Phi(\mathbf{x}_{T})-\Phi(\mathbf{x}_{0})\leq-2.7\mathscr{F}$ .

By applying Lemma 5, we know for any $\mathbf{x}_{0}\in\mathcal{X}_{\text{stuck}}$ , it is guaranteed that $(\mathbf{x}_{0}\pm\mu r\mathbf{e}_{1})\not\in\mathcal{X}_{\text{stuck}}$ where $\mu\in[\frac{\delta}{2\sqrt{d}},1]$ . Denote $I_{\mathcal{X}_{\text{stuck}}}(\cdot)$ be the indicator function of being inside set $\mathcal{X}_{\text{stuck}}$ ; and vector $\mathbf{x}=(x^{(1)},\mathbf{x}^{(-1)})$ , where $x^{(1)}$ is the component along $\mathbf{e}_{1}$ direction, and $\mathbf{x}^{(-1)}$ is the remaining $d-1$ dimensional vector. Recall $\mathbb{B}^{(d)}(r)$ be $d$ -dimensional ball with radius $r$ ; By calculus, this gives an upper bound on the volumn of $\mathcal{X}_{\text{stuck}}$ :

[TABLE]

Then, we immediately have the ratio:

[TABLE]

The second last inequality is by the property of Gamma function that $\frac{\Gamma(x+1)}{\Gamma(x+1/2)}<\sqrt{x+\frac{1}{2}}$ as long as $x\geq 0$ . Therefore, with at least probability $1-\delta$ , $\mathbf{x}_{0}\not\in\mathcal{X}_{\text{stuck}}$ . In this case, we have:

[TABLE]

which finishes the proof.

3.6 Main Theorem, and its Proof

Lemma 7 (Sufficient Decrease Lemma for Proximal Descent, [3])

Assume the function $f$ is real-valued and lower semi-continuous. Then for any $L\in(\frac{L}{2},\infty)$ where $\eta=\frac{1}{L}$ , we have $\Phi(\mathbf{x}_{t})-\Phi(\mathbf{x}_{t+1})\geq\frac{L-\frac{L}{2}}{L^{2}}\|G_{\frac{1}{L}}(\mathbf{x}_{t})\|.$

3.6.1 Proof of the Main Theorem

Proof

Denote $\tilde{c}_{\max}$ to be the absolute constant allowed in lemma 6 when it is given following parameters $\eta=\frac{c}{L}$ , $\gamma=\sqrt{\rho\varepsilon}$ , and $\delta=\frac{dL}{\sqrt{\rho\varepsilon}}\mathrm{e}^{-\chi}$ . In this theorem, we let $c_{\max}=\min\{\tilde{c}_{\max},1/2\}$ , and choose any constant $c\leq c_{\max}$ .

In this proof, we will actually achieve some point satisfying following condition:

[TABLE]

Since $c\leq 1$ , $\chi\geq 1$ , we have $\frac{\sqrt{c}}{\chi^{2}}\leq 1$ , which implies any $\mathbf{x}$ satisfy Eq.(19) is also a $\varepsilon$ -second-order stationary point.

Starting from $\mathbf{x}_{0}$ , we know if $\mathbf{x}_{0}$ does not satisfy Eq.(19), there are only two possibilities:

$\|{G(\mathbf{x}_{0})}\|>g_{\text{thres}}$ : In this case, Algorithm 1 will not add perturbation. By lemma 7:

[TABLE] 2. 2.

$\|{G(\mathbf{x}_{0})}\|\leq g_{\text{thres}}$ : In this case, Algorithm 1 will add a perturbation of radius $r$ , and will perform proximal gradient descent (without perturbations) for the next $t_{\text{thres}}$ steps. Algorithm 1 will then check termination condition. If the condition is not met, we must have:

[TABLE]

This means on average every step decreases the function value by

[TABLE]

In case 1, we can repeat this argument for $t=1$ and in case 2, we can repeat this argument for $t=t_{\text{thres}}$ . Hence, we can conclude as long as algorithm 1 has not terminated yet, on average, every step decrease function value by at least $\frac{c^{3}}{\chi^{4}}\cdot\frac{\varepsilon^{2}}{L}$ . However, we clearly can not decrease function value by more than $\Phi(\mathbf{x}_{0})-\Phi^{\star}$ , where $\Phi^{\star}$ is the function value of global minima. This means algorithm 1 must terminate within the following number of iterations:

[TABLE]

Finally, we would like to ensure when Algorithm 1 terminates, the point it finds is actually an $\varepsilon$ -second-order stationary point. The algorithm can only terminate when the gradient mapping is small, and the function value does not decrease after a perturbation and $t_{\text{thres}}$ iterations. We shall show every time when we add perturbation to iterate $\tilde{\mathbf{x}}_{t}$ , if $\lambda_{\min}(\nabla^{2}f(\tilde{\mathbf{x}}_{t}))<-\sqrt{\rho\varepsilon}$ , then we will have $\Phi(\mathbf{x}_{t+t_{\text{thres}}})-\Phi(\tilde{\mathbf{x}}_{t})\leq-\Phi_{\text{thres}}$ . Thus, whenever the current point is not an $\varepsilon$ -second-order stationary point, the algorithm cannot terminate.

According to Algorithm 1, we immediately know $\|{G(\tilde{\mathbf{x}}_{t})}\|\leq g_{\text{thres}}$ (otherwise we will not add perturbation at time $t$ ). By lemma 6, we know this event happens with probability at least $1-\frac{dL}{\sqrt{\rho\varepsilon}}e^{-\chi}$ each time. On the other hand, during one entire run of Algorithm 1, the number of times we add perturbations is at most:

[TABLE]

By the union bound, for all these perturbations, with high probability lemma 6 is satisfied. As a result Algorithm 1 works correctly. The probability of that is at least

[TABLE]

Recall our choice of $\chi=3\max\{\ln(\frac{dL\Delta_{f}}{c\varepsilon^{2}\delta}),4\}$ . Since $\chi\geq 12$ , we have $\chi^{3}e^{-\chi}\leq e^{-\chi/3}$ , this gives:

[TABLE]

which finishes the proof.

Remarks on large $\lambda$

We point out that when $\lambda$ is large enough so that the $g$ term alters the local landscape of the objective function $\Phi(\mathbf{x})$ , it is inevitable that new local minima will be introduced to the landscape of the objective function, and potentially change the stability of saddle points. We hypothesize that perturbed proximal descent will still converge to an $\varepsilon$ -second-order stationary point regardless of the magnitude of $\lambda$ .

An example for the new local minima introduced by large $\lambda$ is Fig. 3(b). We see new wrinkles are introduced to the four legs of the octopus function as $\lambda$ increases from $1$ to $10$ . If an iteration starts in the neighborhood of creases, it can converge to the bottom of the creases. Fig. 3(c) is an extreme scenario where the original landscape of the octopus function is completely altered to conform to the behavior of $\ell_{1}$ penalty term.

3.7 From $\varepsilon$ -second-order stationary point to local minimizers

Assumption A3 (Nondegenerate Saddle)

For all stationary points $\mathbf{x}_{c}$ , $\exists\,m>0$ such that $\displaystyle\min_{i=1,2,\cdots,d}|\lambda_{i}(\nabla^{2}f(\mathbf{x}_{c}))|>m>0$ , where $\lambda_{i}$ are the eigenvalues (not to be confused with the parameter $\lambda$ ).

With this nondegenerate saddle assumption, the main theorem can be strengthened to the following corollary, whose proof is immediate as one sets the $\varepsilon$ value in the main theorem as $m^{2}/\rho$ and realizes that there is no eigenvalue of $\nabla^{2}f$ existing between $-\sqrt{\rho\varepsilon}$ and the first positive eigenvalue.

Corollary 1

There exists an absolute constant $c_{\max}$ such that if $f(\cdot)$ satisfies assumptions A1, A2 and A3, then for any $\delta>0,\Delta_{\Phi}\geq\Phi(\mathbf{x}_{0})-\Phi^{\star}$ , constant $c\leq c_{\max}$ , and $\varepsilon=\frac{m^{2}}{\rho}$ , with probability $1-\delta$ , the output of $\text{PPD}(\mathbf{x}_{0},L,\rho,\varepsilon,c,\delta,\Delta_{f})$ will be a local minimizer of $f+\lambda\|\mathbf{x}\|_{1}$ , and terminate in iterations:

[TABLE]

4 Numerical Experiment

We set $f$ to be the “octopus” function described in [10] and use perturbed proximal descent to minimize the objective function $\Phi(\mathbf{x})=f(\mathbf{x})+\lambda\|\mathbf{x}\|_{1}$ . Plots of octopus function defined in $\mathbb{R}^{2}$ for various $\lambda$ are shown in Figure 3.

The “octopus” family of functions is parameterized by $\tau$ , which controls the width of the “legs,” and $M$ and $\gamma$ which characterize how sharp each side is surrounding a saddle point, related to the Lipschitz constant. The example illustrated in Fig. 3 uses parameters $M=\mathrm{e},\gamma=1,\tau=\mathrm{e}$ .

We are interested in the octopus family of functions because it can be generalized to any dimension $d$ , and it has $d-1$ saddle points (not counting the origin) which are known to slow down standard gradient descent algorithms. The usual minimization iteration sequence, if starting at the maximum value of the octopus function, will successively go through each saddle point before reaching the global minimum, thus rendering the iteration progress easy to track and visualize.

Specifics of Octopus Function

We define octopus function in first quadrant of $\mathbb{R}^{d}$ . And then, by even function reflection, the octopus can be continued to all other quadrants.

Define the auxiliary gluing functions as

[TABLE]

Define the gluing function and gluing balance constant respectively as

[TABLE]

For a given $i=1,\cdots,d-1$ , when $6\tau\geq x_{1},\cdots,x_{i-1}\geq 2\tau,\tau\geq x_{i}\geq 0,\tau\geq x_{i+1},\cdots,x_{d}\geq 0$

[TABLE]

and if $6\tau\geq x_{1},\cdots,x_{i-1}\geq 2\tau,2\tau\geq x_{i}\geq\tau,\tau\geq x_{i+1},\cdots,x_{d}\geq 0$ , we have

[TABLE]

and for $i=d$ , if $6\tau\geq x_{1},\cdots,x_{d-1}\geq 2\tau,\tau\geq x_{d}\geq 0$

[TABLE]

and if $6\tau\geq x_{1},\cdots,x_{d-1}\geq 2\tau,2\tau\geq x_{d}\geq\tau$

[TABLE]

and if $6\tau\geq x_{1},\cdots,x_{d}\geq 2\tau$ ,

[TABLE]

Remark

All saddle points happen at $(\pm 4\tau,\pm 4\tau,\cdots,\pm 4\tau,0,0,\cdots,0)$ , and the global minimum is at $(\pm 4\tau,\cdots,\pm 4\tau)$ . Regions in the form of $[2\tau,6\tau]\times\cdots\times[2\tau,6\tau]\times[\tau,2\tau]\times[0,\tau]\times\cdots\times[0,\tau]$ are transition zones described by the gluing functions which connect separate pieces to make $f$ a continuous function. The octopus function can be constructed first in the first quadrant, and then using even function reflection to define it in all other quadrants. A typical descent algorithm applied to the octopus generates iterations that take multiple turns like walking down a spiral staircase, each staircase leading to a new dimension.

4.1 Results

We apply the perturbed proximal descent (PPD) on the octopus function plus $0.01\|\mathbf{x}\|_{1}$ when the dimension varies between $d=2,5,10,20$ . We set the constant $c=3$ . For comparison, we apply perturbed gradient descent (PGD) as well since $\|\mathbf{x}\|_{1}$ is differentiable almost everywhere; for both algorithms, the norm of the perturbation $\bm{\xi}$ is $0.1$ .

We see that PPD successfully finds the local minimum in the first three cases within 1000 iterations, and in the case of $d=20$ , PPD almost finds the local minimum within 1000 iterations. In contrast, unperturbed proximal descent (PD), gradient descent (GD), and perturbed gradient descent (PGD) sequences are trapped near saddle points.

5 Conclusion

This paper provides an algorithm to minimize a non-convex function plus a $\ell_{1}$ penalty of small magnitude, with a probabilistic guarantee that the returned result is an approximate second-order stationary point, and hence for a large class of functions, a local minimum instead of a saddle point. The complexity is of $\mathcal{O}(\varepsilon^{-2})$ and the result depends on dimension in $\mathcal{O}(\ln^{4}d)$ .

The deficiency of the result is that the magnitude of $\ell_{1}$ penalty needs to be small to let our theoretical result hold. Meanwhile, we also notice that a large $\lambda$ will lead to creation of new local minima to the objective function altering the original landscape. Our future work will address the case of large $\lambda$ in the iteration process.

Bibliography21

The reference list from the paper itself. Each links out to its DOI / PubMed record.

1[1] H. Attouch, J. Bolte, and B.F. Svaiter. Convergence of descent methods for semi-algebraic and tame problems: proximal algorithms, forward–backward splitting, and regularized Gauss-Seidel methods. Mathematical Programming , pages 1–39, 2011.
2[2] H. H. Bauschke and P. L. Combettes. Convex Analysis and Monotone Operator Theory in Hilbert Spaces . Springer-Verlag, New York, 2 edition, 2017.
3[3] A. Beck. First-Order Methods in Optimization . MOS-SIAM Series on Optimization, 2017.
4[4] J. Bolte, S. Sabach, and M. Teboulle. Proximal alternating linearized minimization for nonconvex and nonsmooth problems. Math. Prog. , 146(1-2):459–494, 2014.
5[5] R.I. Bot, E.R.. Csetnek, and D-K Nguyen. A proximal minimization algorithm for structured nonconvex and nonsmooth problems. ar Xiv preprint ar Xiv:1805.11056 v 1[math.OC] , 2018.
6[6] Y. Carmon, J. Duchi, O. Hinder, and A. Sidford. Accelerated methods for nonconvex optimization. SIAM Journal on Optimization , 28(2):1751–1772, 2018.
7[7] P. L. Combettes and V. R. Wajs. Signal recovery by proximal forward-backward splitting. SIAM Multiscale Model. Simul. , 4(4):1168–1200, 2005.
8[8] F.E. Curtis, D.P. Robinson, and M. Samadi. A trust region algorithm with a worst-case iteration complexity of 𝒪 ( ϵ 3 2 ) 𝒪 superscript italic-ϵ 3 2 \mathcal{O}(\epsilon^{\frac{3}{2}}) for nonconvex optimization. Mathematical Programming , 162(1):1–32, Mar 2017.

TL;DR

Contribution

Findings

Abstract

Peer Reviews

Videos

Perturbed Proximal Descent to Escape Saddle Points for Non-convex and Non-smooth Objective Functions

Abstract

Keywords:

1 Introduction

A toy example: Gaussian Bump

A coincidence

Structure of the paper

1.1 Related literature

Second order methods for smooth objectives

First order methods for smooth objectives

Non-smooth objectives

2 Algorithm

3 Escaping Saddle Points through Perturbed Proximal Descent

Definition 1 (Gradient Mapping)

Definition 2 (First order stationary points)

Definition 3 (ε\varepsilonε-second-order stationary point)

Assumption A1** (Lipschitz Properties)**

Assumption A2** (Moderate Nonsmooth Term)**

Theorem 3.1 (Main)

Remark

Units

3.1 Lemma: Iterates remain bounded if stuck near a saddle point

Lemma 1

Proof

3.2 Preparation for Building Pillars

Lemma 2 (Existence of lower bound for the difference sequence {vt}t=1T\{\mathbf{v}_{t}\}_{t=1}^{T}{vt​}t=1T​)

Proof

Lemma 3 (Preservation of subspace projection monotonicity after prox of l1l_{1}l1​ in rotated coordinate with small λ\lambdaλ)

Proof

Remark 1 for Lemma 3

Remark 2 for Lemma 3

3.3 Lemma: Perturbed iterates will escape the saddle point

Lemma 4

Proof

3.4 Combining Previous Results

Lemma 5

Proof

Case T′≤T⋆T^{\prime}\leq T^{\star}T′≤T⋆:

Case T′>T⋆T^{\prime}>T^{\star}T′>T⋆:

3.5 Main Lemma

Lemma 6 (Main Lemma)

Proof

3.6 Main Theorem, and its Proof

Lemma 7 (Sufficient Decrease Lemma for Proximal Descent, [3])

3.6.1 Proof of the Main Theorem

Proof

Remarks on large λ\lambdaλ

3.7 From ε\varepsilonε-second-order stationary point to local minimizers

Assumption A3** (Nondegenerate Saddle)**

Corollary 1

4 Numerical Experiment

Specifics of Octopus Function

Remark

4.1 Results

5 Conclusion

Definition 3 ( $\varepsilon$ -second-order stationary point)

Assumption A1 (Lipschitz Properties)

Assumption A2 (Moderate Nonsmooth Term)

Lemma 2 (Existence of lower bound for the difference sequence $\{\mathbf{v}_{t}\}_{t=1}^{T}$ )

Lemma 3 (Preservation of subspace projection monotonicity after prox of $l_{1}$ in rotated coordinate with small $\lambda$ )

Case $T^{\prime}\leq T^{\star}$ :

Case $T^{\prime}>T^{\star}$ :

Remarks on large $\lambda$

3.7 From $\varepsilon$ -second-order stationary point to local minimizers

Assumption A3 (Nondegenerate Saddle)