Perturbed Proximal Descent to Escape Saddle Points for Non-convex and Non-smooth Objective Functions
Zhishen Huang, Stephen Becker

TL;DR
This paper introduces a novel algorithm for non-convex, non-smooth optimization that effectively escapes saddle points, extending previous results from smooth to non-smooth settings.
Contribution
It provides the first known theoretical results for escaping saddle points in non-smooth optimization using a perturbed proximal descent method.
Findings
First theoretical guarantees for non-smooth saddle point escape
Algorithm successfully finds local minima in non-smooth problems
Extends saddle point analysis to non-smooth optimization
Abstract
We consider the problem of finding local minimizers in non-convex and non-smooth optimization. Under the assumption of strict saddle points, positive results have been derived for first-order methods. We present the first known results for the non-smooth case, which requires different analysis and a different algorithm.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
11institutetext: Dept. of Applied Math., University of Colorado, Boulder, USA 11email: {zhishen.huang,stephen.becker}@colorado.edu
Perturbed Proximal Descent to Escape Saddle Points for Non-convex and Non-smooth Objective Functions
Zhishen Huang 11
Stephen Becker 11 0000-0002-1932-8159
Abstract
We consider the problem of finding local minimizers in non-convex and non-smooth optimization. Under the assumption of strict saddle points, positive results have been derived for first-order methods. We present the first known results for the non-smooth case, which requires different analysis and a different algorithm.
This is the extended version of the paper that contains the proofs.
Keywords:
Saddle-points Proximal gradient descent Non-smooth optimization.
1 Introduction
We consider the problem of finding approximate local minimizers of the problem
[TABLE]
where is not convex but smooth (and with full domain), and is convex but not smooth. Many optimization problems in engineering, signal processing and machine learning can be cast in this framework, where is a smooth loss function, and is a non-smooth regularizer such as a norm. For example, our model captures regularized neural networks [11], where the regularization can induce sparsity as an alternative to dropout. In this paper, for simplicity we restrict our discussion to , where is a constant, but many of the results apply to more general choices of . The first-order condition is , and any satisfying this condition is called a “stationary point” (see [2] for background on the subdifferential ). All local minimizers are stationary points, but not vice-versa. We define a “saddle point” to be any stationary point where the Hessian is indefinite (and therefore not a local minimizer). This paper extends a recent line of work [13] to analyze when we can expect to find a local minimizer. It has been argued that in many machine learning problems, finding any local minimizer is often enough for good performance, but finding a saddle point is not useful [9].
The fact that is non-smooth is crucially important, and it does more than just complicate the analysis, as it also requires a new algorithm. In the smooth case, is often minimized using gradient descent or an accelerated variant [16] with a fixed stepsize. Naïvely extending gradient descent to apply to (1) leads to subgradient descent with fixed-stepsize. Unfortunately, this method fails to converge as the example and shows [18] since for a generic choice of the initial point, the sequence is not Cauchy.
Instead of gradient descent, we use a perturbed version of proximal gradient descent. For a real-valued convex lower semi-continuous function , define the “proximity” operator (or “prox” for short) as the map (throughout the paper, for vectors we use to denote the Euclidean norm). Equivalently, , and thus the first-order condition is equivalent to for any . Proximal gradient descent is the iteration , so it immediately follows that if the sequence converges, it converges to a stationary point. Convergence of the sequence is known to follow from mild assumptions on and , the stepsize , and boundedness of the sequence [1].
We define a second-order stationary point to be a first-order stationary point that additionally satisfies , which is a sufficient condition for to be a local minimizer. Our main contribution is showing that under suitable assumptions, a perturbed version of proximal gradient descent will generate a sequence that converges to an approximate second-order stationary point. We make assumptions on the second-order behavior of , similar to assumptions under which it is known that gradient descent will always converge to a second-order stationary point except for adversarially chosen starting points [14] — in contrast to Newton’s method, which is attracted to all stationary points. However, even in the smooth case when the sequence converges, gradient descent converges arbitrarily slowly [10] in the presence of a saddle point, so perturbation is necessary. In the non-smooth case, perturbation is even more important due to the proximal nature of the algorithm.
A toy example: Gaussian Bump
Consider the function where is the Huber function with parameter 100 [3]. The choice of this combination of Huber parameter and the magnitude of Huber function ensures that the origin is a saddle point. The Huber function approximates the norm. The plot is show in Fig. 2.
This function has two local minima and a saddle point at . Because the Huber function is both smooth and it has a known proximity operator, we can treat it as either part of the smooth component or the non-smooth component, and therefore run either gradient descent or proximal gradient descent. We experiment with both algorithms, randomly picking initial points at where is sampled uniformly from , and varying the stepsize , with fixed maximum iteration 1000. Figure 2 shows the empirical success rate of finding a local minimizer (as opposed to converging to the saddle point at ).
We observe that the range of stable step size for the proximal descent algorithm is wider than gradient descent, and the success rate of proximal descent is as high as the gradient descent. This example motivates us to adopt proximal descent over gradient descent in real application for better stability and equivalent, if not better, accuracy.
A coincidence
In this toy example, the saddle point at happens to be a fixed point of proximal operator of . Soft thresholding, as the proximal operator of is known [7], has an attracting region that sets nearby points to [math]. The radius of the attracting region (per dimension) is , thus if for some iteration , then for all . Proximal gradient descent performs even better when the saddle point is not in the attracting region.
Structure of the paper
Section 2 states the algorithm, followed by section 3 where the theoretical guarantee is presented with proof. Section 4 shows numerical experiments.
1.1 Related literature
Second order methods for smooth objectives
Some recent second order methods, mainly based on either cubic-regularized Newton methods as in [17] or based on trust-region methods (as in Curtis et al. [8]), have been shown to converge to -approximate local minimizers of smooth non-convex objective functions in iterations. See [6, 13, 21] for a more thorough review of these methods. We do not consider these methods further due to the high-cost of solving for the Newton step in large dimensions.
First order methods for smooth objectives
We focus on first order methods because each step is cheaper and these methods are more frequently adopted by the deep learning community. Xu et al. in [20] and Allen-Zhu et al. in [21] develop Negative-Curvature (NC) search algorithms, which find descent direction corresponding to negative eigenvalues of Hessian matrix. The NC search routines avoid using either Hessian or Hessian-vector information directly, and it can be applied in both online and deterministic scenarios. In the online setting, combining NC search routine with first-order stochastic methods will give algorithms NEON- [20] and NEON2+SGD [21] with iteration cost and respectively (the latter still depends on dimension, whose induced complexity is at least ), and these methods generate a sequence that converges to an approximate local minimum with high probability. In the offline setting, Jin et al. in [13] provide a stochastic first order method that finds an approximate local minimizer with high probability at computational cost . Combining NEON2 with gradient descent or SVRG, the cost to find an approximate local minimum is , whose dependence on dimension is not specified but at least . These methods make Lipschitz continuity assumptions about the gradient and Hessian, so they do not apply to non-smooth optimization.
A recent preprint [15] approaches the problem of finding local minima using the forward-backward envelope technique developed in [19], where the assumption about the smoothness of objective function is weakened to local smoothness instead of global smoothness.
Non-smooth objectives
In the offline settings, Boţ et al. propose a proximal algorithm for minimizing non-convex and non-smooth objective functions in [5]. They show the convergence to KKT points instead of approximate second-order stationary points. Other work [1, 4] relies on the Kurdya-Lojasiewicz inequality and shows convergence to stationary points in the sense of the limiting subdifferential, which is not the same as a local minimizer or approximate second-order stationary point. In the online setting, Reddi et al. demonstrated in [12] that the proximal descent with variance reduction technique (proxSVRG) has linear convergence to a first-order stationary point, but not to a local minimizer.
2 Algorithm
The algorithm takes as input a starting vector , the gradient Lipschitz constant , the Hessian Lipschitz constant , the second-order stationary point tolerance , a positive constant , a failure probability , and estimated function value gap . The key parameter for Algorithm 1 is the constant . It should be made large enough so that the effect of perturbation will be significant enough for escaping saddle points, and at the same time not too large so that the iteration stepsize is of reasonable magnitude and the iteration will not go wild. The output of the algorithm is an -second-order stationary point (see Def. 3).
3 Escaping Saddle Points through Perturbed Proximal Descent
The main step in the algorithm is a proximal gradient descent step applied to , defined as
[TABLE]
One motivation of preferring proximal descent to gradient descent, as shown in Figure 2, is the stability of the algorithm with respect to stepsize change. The proximal step is similar to the implicit/backward Euler scheme, as equation (3) can be written as \mathbf{x}_{t+1}=\mathbf{x}_{t}-\eta\big{(}\nabla f(\mathbf{x}_{t})+\partial g(\mathbf{x}_{t+1})\big{)}. From this perspective, we expect that proximal descent will demonstrate at least the same convergence speed as gradient descent and stronger stability with respect to hyperparameter setting.
Definition 1 (Gradient Mapping)
Consider a function . The gradient mapping is defined as
In the rest of this paper, the super- and subscript of the gradient mapping are not specified, as it is always clear that represents the smooth nonconvex part of , represents , and is the stepsize used in the algorithm. Observe that the gradient map is just the gradient of if .
Definition 2 (First order stationary points)
For a function , define first order stationary points as the points which satisfy
Definition 3 (-second-order stationary point)
Consider a function . A point is an -second-order stationary point if
[TABLE]
where is the smallest eigenvalue.
The first Lipschitz assumption below is standard [3], and the assumption on the Hessian was used in [13] (for example, it is true if is quadratic).
Assumption A1** (Lipschitz Properties)**
* is -Lipschitz continuous and is Lipschitz continuous. We write as shorthand for when is clear from context.*
Assumption A2** (Moderate Nonsmooth Term)**
The magnitude of term, which is denoted by , satisfies inequalities (7) and (9).
Theorem 3.1 (Main)
There exists an absolute constant such that if satisfies A1 and A2, then for any , and constant , with probability , the output of will be a -second order stationary point, and terminate in iterations:
[TABLE]
Remark
Assuming does not lead to loss of generality. Recall the second order condition is specified as \lambda\big{(}\nabla^{2}f(\mathbf{x}^{\star})\big{)}_{\min}\geq-\sqrt{\rho\varepsilon}, since when , we always have -\sqrt{\rho\varepsilon}\leq-L\leq\lambda\big{(}\nabla^{2}f(\mathbf{x}^{\star})\big{)}_{\min}, where the second inequality follows from the fact that the Lipschitz constant is the upper bound for in norm. Consequently, when , every -second-order stationary point is automatically a first order stationary point.
For the proof of the main theorem, we introduce some notation and units for the simplicity of proof statement.
For matrices we use to denote spectral norm. The operator denotes projection onto set . Define the local approximation of the smooth part of the objective function by
[TABLE]
Units
With the conditional number of the Hessian matrix , we define the following units for the convenience of proof statement:
[TABLE]
3.1 Lemma: Iterates remain bounded if stuck near a saddle point
Lemma 1
For any constant , there exists absolute constant : for any , let satisfies the condition in Lemma 6, for any initial point with , define:
[TABLE]
then, for any , we have for all that .
Proof
We show if the function value did not decrease, then all the iteration updates must be constrained in a small ball. The proximal descent updates the solution as
[TABLE]
Without losing of generality, set to be the origin. For any ,
[TABLE]
Jin et al. prove in [13] by induction that if , then . Consequently, .
We point out that it is implicitly assumed that , so that for all , , and the relation holds.
3.2 Preparation for Building Pillars
Lemma 2 (Existence of lower bound for the difference sequence )
For iteration sequences and defined in Lemma 4, define the difference sequence as
[TABLE]
There exists a positive lower bound for when .
Proof
To show that the lower bound for iteration difference exists, we consider bounding the iteration sequence first. Define the difference between the proximal of penalty term and its coimage as , where is Hadamard product and the minimum is taken elementwise. We notice that . Thus, .
[TABLE]
As , where , we have
[TABLE]
To compare and ,
[TABLE]
Therefore, as long as
[TABLE]
the difference sequence has a positive lower bound on its norm.
Lemma 3 (Preservation of subspace projection monotonicity after prox of in rotated coordinate with small )
Denote the subspace of spanned by as , while the complement subspace spanned by as . For a given vector chosen from a lower bounded set , i.e. , for some constant , assume , where is a constant. If the parameter for the penalty term is small enough, then
[TABLE]
Proof
We want to find a constraint on such that when is small enough, if the projection in the original coordinate demonstrates the monotonicity relation , this monotonicity relation will be preserved after proximal operator of is applied on the input vector.
Naturally there exists a normal vector, denoted as , for the boundary hyperplane on which . By moving along , a point approaches the boundary most efficiently. Any vector inside the hyperplane is perpendicular to , which we denote as .
Define
[TABLE]
where is the Hadamard product, and the minimum is taken elementwise. Because , a sufficient condition to be imposed on to guarantee the preservation of projection monotonicity is that
[TABLE]
which means the moving distance caused by applying the proximal operator (soft shrinkage) projected on the direction of is less that the distance between to the boundary hyperplane, hence rendering the vector stay on the same side of the boundary after moving.
Therefore, as long as
[TABLE]
the monotonicity of projection onto subspaces can be preserved.
Remark 1 for Lemma 3
As an examples in , set , we visualise the shift caused by proximal operator and the boundary of projection-monotonicity preserving region. Assume are orthonormal basis of Cartesian coordinate in the standard position. The directional vector for region division boundary is , and is the corresponding perpendicular directional vector. For norm, is .
Remark 2 for Lemma 3
We point out that the upper bound for the parameter is related to the alignment of the eigenspace of . If the eigenspace of is aligned with canonical orthonormal basis of , then . The most stringent restriction on the upper bound of applies when is parallel to .
3.3 Lemma: Perturbed iterates will escape the saddle point
Lemma 4
There exists absolute constant such that: for any , let satisfies the condition in Lemma 6, and sequences satisfy the conditions in Lemma 6, define:
[TABLE]
then, for any , if for all , we will have .
Proof
We show that if the iterate sequence before time starting from does not provide sufficient function value decrease, the other iterate sequence, which starts from , will be able to achieve the function value decrease purpose. Ultimately, we will prove . We establish the inequality about by considering the difference between and . Define . The assumption of the lemma 4, ,
We bound from both sides for all to obtain an inequality about .
Recall that the proximal descent updates the solution as
[TABLE]
Simple algebraic computation gives
[TABLE]
where , and .
Consider and . Because , we have . With same logic in the proof for lemma 1, we see , and . (Same relation hold for and respectively.) As a result, for all . Also,
[TABLE]
Equation (11) and Hessian Lipschitz gives for , , where .
Denote be the norm of projected onto direction (), and be the norm of projected onto the remaining subspace (), while be the norm of projected onto , and be the norm of projected onto .
Equation (10) gives
[TABLE]
To obtain the lower bound of , we prove the following relation as preparation:
[TABLE]
By hypothesis of lemma 4, we know , thus the base case of induction holds. Assume equation (14) is true for , for , we have
[TABLE]
By choosing , and , we have . This gives . i.e.
[TABLE]
Connecting two parts of equation (3.3), we obtain
[TABLE]
Now we switch our focus to the eigenspace of Hessian . Assume the orthonormal basis for the eigensapce of is . The order of dimension aligns with the increasing order of the corresponding eigenvalues. This coordinate transformation does not lead to loss of generality, as it is unitary.
By lemma 2, we know the iteration difference sequence has a positive lower bound in terms of 2-norm. Therefore, by lemma 3, with the virtue of equation (17) , we still have the projection monotonicity on the subspace of eigenspace of , i.e.
[TABLE]
Until here we finish the induction.
Recall that , we thus have , which gives
[TABLE]
where the last inequality follows from .
Finally, combining (11) and (18), we have for all :
[TABLE]
This implies
[TABLE]
The last inequality is due to , we have . By choosing the constant to be large enough to satisfy , we will have , which finishes the proof.
3.4 Combining Previous Results
Lemma 5
There exists a universal constant , for any , let satisfies the conditions in Lemma 6, and without loss of generality let be the minimum eigenvector of . Consider two gradient descent sequences with initial points satisfying: (denote radius )
[TABLE]
Then, for any stepsize , and any , we have:
[TABLE]
Proof
Without losing generality, let be the origin. Let be the absolute constant so that Lemma 4 holds, also let be the absolute constant to make Lemma 1 holds based on our current choice of . We choose so that our learning rate is small enough which make both Lemma 1 and Lemma 4 hold. Let and define:
[TABLE]
Let’s consider following two cases:
Case :
In this case, by Lemma 1, we know , and therefore
[TABLE]
By choosing small enough and , this gives:
[TABLE]
The first and second inequality exploit Hessian Lipschitz property of smooth function , and , . By choose . We know , by sufficient decrease lemma for proximal descent, we know each proximal descent iteration decreases function value. Therefore, for any , we have:
[TABLE]
Case :
In this case, by Lemma 1, we know for all . Define
[TABLE]
By Lemma 4, we immediately have . Apply same argument as in the case , we have for all that .
3.5 Main Lemma
Lemma 6 (Main Lemma)
There exists universal constant , for satisfies A1, for any , suppose we start with point satisfying following conditions:
[TABLE]
Let where come from the uniform distribution over ball with radius , and let be the iterates of gradient descent from . Then, when stepsize , with at least probability , we have following for any :
[TABLE]
Proof
Denote T_{\frac{l}{L}}(\mathbf{x})=\mathrm{prox}_{\frac{1}{L}g}\big{[}\mathbf{x}-\frac{1}{L}\nabla f(\mathbf{x})\big{]}. The fisrt order stationary condition is equivalent to \|\tilde{\mathbf{x}}-T_{\frac{1}{L}}(\tilde{\mathbf{x}})\|=\|\nabla f(\tilde{\mathbf{x}})+\partial g\big{(}T_{\frac{1}{L}}(\tilde{\mathbf{x}})\big{)}\|\leq\mathscr{G}, where is the subgradient of the function .
As has Lipschitz constant , we have
[TABLE]
Notice
[TABLE]
By adding perturbation, in worst case we increase function value by:
[TABLE]
where the last inequality follows from the fact that per equation (7).
On the other hand, let radius . We know come froms uniform distribution over . Let denote the set of bad starting points so that if , then (thus stuck at a saddle point); otherwise if , we have .
By applying Lemma 5, we know for any , it is guaranteed that where . Denote be the indicator function of being inside set ; and vector , where is the component along direction, and is the remaining dimensional vector. Recall be -dimensional ball with radius ; By calculus, this gives an upper bound on the volumn of :
[TABLE]
Then, we immediately have the ratio:
[TABLE]
The second last inequality is by the property of Gamma function that as long as . Therefore, with at least probability , . In this case, we have:
[TABLE]
which finishes the proof.
3.6 Main Theorem, and its Proof
Lemma 7 (Sufficient Decrease Lemma for Proximal Descent, [3])
Assume the function is real-valued and lower semi-continuous. Then for any where , we have
3.6.1 Proof of the Main Theorem
Proof
Denote to be the absolute constant allowed in lemma 6 when it is given following parameters , , and . In this theorem, we let , and choose any constant .
In this proof, we will actually achieve some point satisfying following condition:
[TABLE]
Since , , we have , which implies any satisfy Eq.(19) is also a -second-order stationary point.
Starting from , we know if does not satisfy Eq.(19), there are only two possibilities:
: In this case, Algorithm 1 will not add perturbation. By lemma 7:
[TABLE] 2. 2.
: In this case, Algorithm 1 will add a perturbation of radius , and will perform proximal gradient descent (without perturbations) for the next steps. Algorithm 1 will then check termination condition. If the condition is not met, we must have:
[TABLE]
This means on average every step decreases the function value by
[TABLE]
In case 1, we can repeat this argument for and in case 2, we can repeat this argument for . Hence, we can conclude as long as algorithm 1 has not terminated yet, on average, every step decrease function value by at least . However, we clearly can not decrease function value by more than , where is the function value of global minima. This means algorithm 1 must terminate within the following number of iterations:
[TABLE]
Finally, we would like to ensure when Algorithm 1 terminates, the point it finds is actually an -second-order stationary point. The algorithm can only terminate when the gradient mapping is small, and the function value does not decrease after a perturbation and iterations. We shall show every time when we add perturbation to iterate , if , then we will have . Thus, whenever the current point is not an -second-order stationary point, the algorithm cannot terminate.
According to Algorithm 1, we immediately know (otherwise we will not add perturbation at time ). By lemma 6, we know this event happens with probability at least each time. On the other hand, during one entire run of Algorithm 1, the number of times we add perturbations is at most:
[TABLE]
By the union bound, for all these perturbations, with high probability lemma 6 is satisfied. As a result Algorithm 1 works correctly. The probability of that is at least
[TABLE]
Recall our choice of . Since , we have , this gives:
[TABLE]
which finishes the proof.
Remarks on large
We point out that when is large enough so that the term alters the local landscape of the objective function , it is inevitable that new local minima will be introduced to the landscape of the objective function, and potentially change the stability of saddle points. We hypothesize that perturbed proximal descent will still converge to an -second-order stationary point regardless of the magnitude of .
An example for the new local minima introduced by large is Fig. 3(b). We see new wrinkles are introduced to the four legs of the octopus function as increases from to . If an iteration starts in the neighborhood of creases, it can converge to the bottom of the creases. Fig. 3(c) is an extreme scenario where the original landscape of the octopus function is completely altered to conform to the behavior of penalty term.
3.7 From -second-order stationary point to local minimizers
Assumption A3** (Nondegenerate Saddle)**
For all stationary points , such that , where are the eigenvalues (not to be confused with the parameter ).
With this nondegenerate saddle assumption, the main theorem can be strengthened to the following corollary, whose proof is immediate as one sets the value in the main theorem as and realizes that there is no eigenvalue of existing between and the first positive eigenvalue.
Corollary 1
There exists an absolute constant such that if satisfies assumptions A1, A2 and A3, then for any , constant , and , with probability , the output of will be a local minimizer of , and terminate in iterations:
[TABLE]
4 Numerical Experiment
We set to be the “octopus” function described in [10] and use perturbed proximal descent to minimize the objective function . Plots of octopus function defined in for various are shown in Figure 3.
The “octopus” family of functions is parameterized by , which controls the width of the “legs,” and and which characterize how sharp each side is surrounding a saddle point, related to the Lipschitz constant. The example illustrated in Fig. 3 uses parameters .
We are interested in the octopus family of functions because it can be generalized to any dimension , and it has saddle points (not counting the origin) which are known to slow down standard gradient descent algorithms. The usual minimization iteration sequence, if starting at the maximum value of the octopus function, will successively go through each saddle point before reaching the global minimum, thus rendering the iteration progress easy to track and visualize.
Specifics of Octopus Function
We define octopus function in first quadrant of . And then, by even function reflection, the octopus can be continued to all other quadrants.
Define the auxiliary gluing functions as
[TABLE]
Define the gluing function and gluing balance constant respectively as
[TABLE]
For a given , when
[TABLE]
and if , we have
[TABLE]
and for , if
[TABLE]
and if
[TABLE]
and if ,
[TABLE]
Remark
All saddle points happen at , and the global minimum is at . Regions in the form of are transition zones described by the gluing functions which connect separate pieces to make a continuous function. The octopus function can be constructed first in the first quadrant, and then using even function reflection to define it in all other quadrants. A typical descent algorithm applied to the octopus generates iterations that take multiple turns like walking down a spiral staircase, each staircase leading to a new dimension.
4.1 Results
We apply the perturbed proximal descent (PPD) on the octopus function plus when the dimension varies between . We set the constant . For comparison, we apply perturbed gradient descent (PGD) as well since is differentiable almost everywhere; for both algorithms, the norm of the perturbation is .
We see that PPD successfully finds the local minimum in the first three cases within 1000 iterations, and in the case of , PPD almost finds the local minimum within 1000 iterations. In contrast, unperturbed proximal descent (PD), gradient descent (GD), and perturbed gradient descent (PGD) sequences are trapped near saddle points.
5 Conclusion
This paper provides an algorithm to minimize a non-convex function plus a penalty of small magnitude, with a probabilistic guarantee that the returned result is an approximate second-order stationary point, and hence for a large class of functions, a local minimum instead of a saddle point. The complexity is of and the result depends on dimension in .
The deficiency of the result is that the magnitude of penalty needs to be small to let our theoretical result hold. Meanwhile, we also notice that a large will lead to creation of new local minima to the objective function altering the original landscape. Our future work will address the case of large in the iteration process.
The reference list from the paper itself. Each links out to its DOI / PubMed record.
- 1[1] H. Attouch, J. Bolte, and B.F. Svaiter. Convergence of descent methods for semi-algebraic and tame problems: proximal algorithms, forward–backward splitting, and regularized Gauss-Seidel methods. Mathematical Programming , pages 1–39, 2011.
- 2[2] H. H. Bauschke and P. L. Combettes. Convex Analysis and Monotone Operator Theory in Hilbert Spaces . Springer-Verlag, New York, 2 edition, 2017.
- 3[3] A. Beck. First-Order Methods in Optimization . MOS-SIAM Series on Optimization, 2017.
- 4[4] J. Bolte, S. Sabach, and M. Teboulle. Proximal alternating linearized minimization for nonconvex and nonsmooth problems. Math. Prog. , 146(1-2):459–494, 2014.
- 5[5] R.I. Bot, E.R.. Csetnek, and D-K Nguyen. A proximal minimization algorithm for structured nonconvex and nonsmooth problems. ar Xiv preprint ar Xiv:1805.11056 v 1[math.OC] , 2018.
- 6[6] Y. Carmon, J. Duchi, O. Hinder, and A. Sidford. Accelerated methods for nonconvex optimization. SIAM Journal on Optimization , 28(2):1751–1772, 2018.
- 7[7] P. L. Combettes and V. R. Wajs. Signal recovery by proximal forward-backward splitting. SIAM Multiscale Model. Simul. , 4(4):1168–1200, 2005.
- 8[8] F.E. Curtis, D.P. Robinson, and M. Samadi. A trust region algorithm with a worst-case iteration complexity of 𝒪 ( ϵ 3 2 ) 𝒪 superscript italic-ϵ 3 2 \mathcal{O}(\epsilon^{\frac{3}{2}}) for nonconvex optimization. Mathematical Programming , 162(1):1–32, Mar 2017.
