Gradient projection and conditional gradient methods for constrained nonconvex minimization
Maxim Balashov, Boris Polyak, Andrey Tremba

TL;DR
This paper introduces gradient projection and conditional gradient methods for solving smooth non-convex minimization problems on manifolds, providing convergence guarantees under minimal assumptions and demonstrating their effectiveness.
Contribution
It develops and analyzes gradient projection and Frank-Wolfe algorithms with convergence guarantees for constrained non-convex optimization on manifolds.
Findings
Gradient projection method converges linearly under Lezanski-Polyak-Lojasiewicz condition.
Conditional gradient method achieves global convergence with linear rate under certain conditions.
The methods are applicable to optimization problems on spheres and smooth manifolds.
Abstract
Minimization of a smooth function on a sphere or, more generally, on a smooth manifold, is the simplest non-convex optimization problem. It has a lot of applications. Our goal is to propose a version of the gradient projection algorithm for its solution and to obtain results that guarantee convergence of the algorithm under some minimal natural assumptions. We use the Lezanski-Polyak-Lojasiewicz condition on a manifold to prove the global linear convergence of the algorithm. Another method well fitted for the problem is the conditional gradient (Frank-Wolfe) algorithm. We examine some conditions which guarantee global convergence of full-step version of the method with linear rate.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Gradient projection and conditional gradient methods for constrained nonconvex minimization
M. V. Balashov, B. T. Polyak, A. A. Tremba
V. A. Trapeznikov Institute of Control Sciences of Russian Academy of Sciences, 65 Profsoyuznaya street, Moscow 117997, Russia.
[email protected], [email protected], [email protected]
Abstract.
Minimization of a smooth function on a sphere or, more generally, on a smooth manifold, is the simplest non-convex optimization problem. It has a lot of applications. Our goal is to propose a version of the gradient projection algorithm for its solution and to obtain results that guarantee convergence of the algorithm under some minimal natural assumptions. We use the Ležanski-Polyak-Lojasiewicz condition on a manifold to prove the global linear convergence of the algorithm. Another method well fitted for the problem is the conditional gradient (Frank-Wolfe) algorithm. We examine some conditions which guarantee global convergence of full-step version of the method with linear rate.
Key words and phrases:
Key words: Minimization on a sphere, smooth functions, proximally smooth set, strongly convex set, gradient projection method, Ležanski-Polyak-Lojasiewicz condition, Frank-Wolfe method, nonconvex optimization
2010 Mathematics Subject Classification:
Primary: 49J53, 90C26, 90C52. Secondary: 46N10, 65K10.
1. Introduction
Consider minimization of a smooth function on a closed set in the Euclidean space
[TABLE]
Traditionally the set and the function are assumed to be convex; in such convex setting the problem is well studied and numerous algorithms are known, see e.g. [7, 8, 29, 30] for details. We plan to address the situation when the function or/and the set are nonconvex.
The function under consideration is smooth with the Lipschitz continuous gradient, but nonconvex. Regarding the set , we mostly suppose that it is proximally smooth [36, 10, 9]. In particular we consider the next important cases:
-
(minimization on the sphere),
-
, (equality type constraints),
-
is the boundary of a strongly convex set .
Minimization on the sphere has numerous applications, for instance finding minimal eigenvalue of a symmetric matrix (then , ) or choosing step-size in trust-region methods [11]. The set is obviously nonconvex thus (1) is an example of nonconvex optimization problems. The pioneering work in the field is [25]. Special case of problem (1) (for quadratic ) has been studied by Hager [16]; the solution can be reduced to solving 1D equations. Later publications include [1, 28, 26, 35, 37], in most of them is a smooth (Riemannian) manifold. But in general, research in nonconvex optimization is much less intensive than in convex case. The main approaches use generalized convexity on the set and/or consider geodesic-related steps. In contrast, we use neither of these. There are numerous methods for optimization with equality type constrains, see e.g [31, Chapter 7], [7]. However, most of them generate points which are not admissible () while our purpose is to develop methods with admissible iterations.
The contribution of the present paper is triple:
- (1)
We propose a new approach to the gradient projection algorithm for constrained optimization, based on the idea of upper approximation of the objective function. The resulting method is the gradient projection algorithm with constant step-size; it differs from versions proposed in [25]. Moreover we prove its convergence to a stationary point without any convexity-like assumptions. 2. (2)
We generalize well known in unconstrained optimization the Polyak-Lojasiewicz condition on the class of problems (1). Under this assumption we prove linear convergence of the gradient projection algorithm (with projection on the tangent subspace and a variant combined with the Newton method) to a global extremum under assumption of proximal smoothness of the manifold . As an example we consider a quadratic form on the unit sphere. 3. (3)
For approximately linear objective functions we propose a new version of the Frank-Wolfe (conditional gradient) method and establish its linear convergence to a global minimum in problem (1). We prove linear convergence of the method for a surface, which is the boundary of a strongly convex set. Note, that such surface is not necessary smooth.
In the paper [4] linear convergence of the gradient projection algorithm was proved for a proximally smooth set with constant and for a strongly convex function with constant of strong convexity and Lipschitz constant under assumption . The last inequality is essential for linear convergence of the method. In subsection 3.1 we prove convergence of the standard gradient projection algorithm to a stationary point of the problem (1) for any function with the Lipschitz continuous gradient and for any proximally smooth set.
In subsection 3.2 we extend the Polyak-Lojasiewicz condition (well known in unconstrained minimization) [27, 33, 19] for the constrained case (1) with differentiable function and smooth manifold . We want to pay attention that this generalization is in fact some variant of the error bound condition for the case of smooth set .
On the base of the new definition from subsection 3.2 we prove in subsection 3.3 convergence of the gradient projection algorithm for . For the case of general smooth and proximally smooth manifold of codimension 1 we prove it in subsection 3.4. In contrast with other approaches our algorithms represent variants of the gradient projection algorithm with admissible points and linear rate of convergence.
In subsection 3.5 we consider the situation when the gradient projection algorithm can be finalized by use of the Newton method. This is standard practice, except that we are dealing with nonconvex problem.
Concluding section 3, we prove the Ležanski-Polyak-Lojasiewicz condition for the quadratic function on the sphere. Thus we extend the result [17] by clarifying constant in this condition. It is essential for estimate of the error for the gradient projection algorithm in the case .
In section 4 we consider application of the Frank-Wolfe method for solving the problem (1) in the case when is the boundary of a strongly convex set of radius . It is well known that if the set is a convex compact and is a Lipschitz differentiable convex function then the method converges (with respect to the objective function) with sublinear rate. In [14] the authors discuss the choice of step-size in the method. In [26] the author proved that under certain assumptions in the case of convex compact set and for a (nonconvex) function with the Lipschitz continuous gradient the Frank-Wolfe method converges to a stationary point in the problem (1) with sublinear rate.
We prove linear rate of convergence with respect to the point. In subsection 4.1 we proved it for so-called linear approximative function and in subsection 4.2 we proved it for a function with gradient domination. In fact in both subsections the general idea consists in the fact that radius of ”curvature” of the level sets for our function is larger than the radius of strong convexity. This leads to the results. We only want to point out that we take the notation of *radius of ”curvature” *in the sense of supporting principles for proximally smooth and strongly convex sets.
For completeness in Appendix we prove necessary condition of minimum in the problem (1) for a proximally smooth set and a function with the Lipschitz continuous gradient.
All mentioned parts are gathered together by the possibility of certain “spherical” approximation (via the supporting principles) for the set/surface in problem (1).
The most part of mentioned results takes place in the case of the real Hilbert space. Sometimes obvious patches should be applied in the infinite dimension case, for example compactness of the set in Theorem 1.
2. Definitions and main notations
Let be the ball with center and radius . For a set the sets , , *are the closure, the interior and the boundary *of , respectively. We also denote by the edge of the surface .
Let be the operator of metric projection on the set , i.e. . In general, can be set-valued for nonconvex sets, but for proximally smooth sets (see below) it is single-valued (provided is close enough to ).
For a closed set *the normal cone of proximal normals *(or simply the normal cone) at a point is defined as follows [9]
[TABLE]
If the set is convex, then coincides with normal cone in the sense of convex analysis.
A closed set is called *proximally smooth *with constant [36, 10, 9] if the distance function is continuously Frechet differentiable on the set . The equivalent properties for a proximally smooth set with constant are
-
is a single-valued continuous function,
-
supporting principle: , , if and only if .
Note that the mapping is upper semicontinuous for a proximally smooth set with constant . For a point we have and it is sufficient to prove upper semicontinuity on the boundary . Choose , , and , , with . Suppose that , the last means that is not upper semicontinuous at the point . By the supporting principle for proximally smooth sets
[TABLE]
The sets converge to the set in the Hausdorff metric, thus for sufficiently large . A contradiction.
If we consider a continuously differentiable -dimensional surface without edge which is also proximally smooth with constant , then for any point the normal cone is 1-dimensional subspace. If and then the surface is trapped between the supporting spheres (see supporting principle 2)):
[TABLE]
A closed convex set is called *strongly convex of radius *if it can be represented in the form [36, 5]. There are few equivalent properties for strong convexity:
-
A convex compact set is strongly convex of radius if and only if for any pair of points the ball with center of radius belongs to [22, 34, 2].
-
Another equivalent property for strong convexity is *supporting principle: *for any and , , we have
[TABLE]
- The set is strongly convex of radius if and only if for any unit vectors and for supporting elements , we have the next inequality [36, Proposition 2.8]
[TABLE]
If is a smooth manifold, then the subspace is the tangent subspace to at a point , i.e.
[TABLE]
where for all .
Define *the lower level set *of the function .
We say that the function has the *Lipschitz continuous gradient *with constant if
[TABLE]
It is well known that (3) implies the upper bound for [29, Lemma 1.2.3]:
[TABLE]
for any .
We write , , if there exist and such that for all .
For a differentiable vector function denote the Jacobi matrix as
[TABLE]
and we treat as columns.
The set is the set of *stationary points *of the differentiable function on the set (which is associated with the problem (1)) if for any point we have . The last inclusion is necessary condition of optimality for a proximally smooth set and a smooth function . We prove it in Appendix for completeness.
3. The gradient projection algorithm
The gradient projection algorithm for (1) in convex case has been proposed in [18, 22]. The simplest version (with constant step-size) looks as follows: for an iteration it generates the new point as the minimizer of the upper bound on (with ) or, equivalently, as projection of the gradient step on :
Gradient Projection Algorithm (GPA1)
Step 1. Choose a constant , initial point and put .
Step 2. Repeat
[TABLE]
The condition is equivalent to choice of the constant step-size . Below we shall consider the extensions of the method for nonconvex set and nonconvex function .
One of the possible ways for extension is gradient projection along geodesics proposed by Luenberger [25]. It is not hard to design geodesics on the sphere (arcs of big circles on the sphere), but the original algorithm in [25] requires one-dimensional minimization on each iteration. Another problem is that in the case of an arbitrary manifold construction of geodesics is a hard procedure. Thus we avoid geodesics and try to deal with gradient projection method with constant step-size in the form (5).
3.1. The case of an arbitrary proximally smooth set . General algorithm
The next result shows that for any function with the Lipschitz continuous gradient and for any proximally smooth set iterations of the standard gradient projection algorithm (5) are well-defined and converge to a stationary point of the function on the set for the appropriate choice of the step-size.
Theorem 1**.**
Let be a bounded proximally smooth set with constant . Suppose that the function is the Lipschitz continuous with constant and its gradient is also the Lipschitz continuous with constant . Take with . Then for any GPA1 with converges to the set of stationary points : and
[TABLE]
for all .
**Proof. **Define for each natural the function
[TABLE]
It’s easy to see (due to the Lipschitz continuity of gradient ) that
[TABLE]
for all and
[TABLE]
Hence the distance from the point to the set is less than and the metric projection is defined uniquely by the definition of proximally smooth set. We have
[TABLE]
Assume that for some sequence which is generated by the gradient projection algorithm. Then there is a number and a subsequence with for all . Consider a converging subsequence of the sequence (that again is denoted by ) and . Then from the necessary conditions of minimum of the function on the set we get
[TABLE]
in other words
[TABLE]
Passing to the limit as , using upper semicontinuity of the normal cone and the property we have
[TABLE]
Thus , a contradiction. ∎
With the help of Theorem 1 we can find a stationary point with error , namely we can find a point with .
Stationary-point Algorithm
Step 1. Choose and . Put , , and set .
Step 2. Perform a Step 2 (5) of GPA1 with .
Step 3. If , increase and continue to the Step 2.
Otherwise stop the algorithm and return as the solution.
The algorithm do at most N=\big{\lfloor}\frac{2\Delta f}{C\delta^{2}}\big{\rfloor}+1 steps. Assume the contrary, that for . Then assumptions of Theorem 1 holds, thus and after steps we get
[TABLE]
i.e. a contradiction. When the algorithm stops with , by the optimality condition for the function we get
[TABLE]
Using the Lipschitz continuity of we obtain
[TABLE]
Note that the parameter of the step depends on Lipschitz constant of the function because the point should not go very far from the set . Moreover, conditions on constant mean that the step-size satisfies the inequality .
Example 1**.**
Sometimes projection can be found explicitly, for example for the unit sphere :
[TABLE]
thus the algorithm is the gradient-projection method with constant step-size.
By Proposition 1 the first-order optimality condition in (1) for means , and it can be immediately rewritten in the form
[TABLE]
(here is the unit matrix) or as
[TABLE]
3.2. The Ležanski-Polyak-Lojasiewicz condition on a manifold.
Now the main task is to propose conditions which guarantee convergence of the method to the global minimum in the problem (1) and to estimate the rate of convergence.
In unconstrained minimization we have such powerful tool as convexity; gradient method for convex functions converges to global minima while for strongly convex functions one has linear rate of convergence. There are extensions of convexity for minimization on manifolds, see e.g. the monograph [35] and the paper [28]. Unfortunately there exist no (globally) convex functions on compact manifolds [28], thus we need some other tools.
However in the unconstrained case there are conditions which validate convergence for nonconvex functions. Probably the first one is due to T. Ležanski [23, 24]. He considered a problem of unconstrained minimization for a Lipschitz differentiable function such that there exists a positive continuous function with
[TABLE]
Under these assumptions he proved for the convergence of the gradient descent algorithm with linear rate. The same assumption
[TABLE]
(where ) was considered in [27, 33]. Sometimes [19] this is referred as the Polyak-Lojasiewicz condition (works of Ležanski were not widely known). Thus it is fair to call the above condition as Ležanski-Polyak-Lojasiewicz (LPL) one.
Analogously we can propose the analog of LPL condition for the constrained minimization of a differentiable function on a smooth manifold . Define by the metric projection on the tangent subspace to the manifold at the point . Note that is the polar cone (subspace) for the tangent space .
Definition 1**.**
Let be a manifold and be a differentiable function. Let , , , . We shall say that the Ležanski-Polyak-Lojasiewicz LPL condition with exponent holds for the function on the set if
[TABLE]
for all .
If then we shall call (9) simply the LPL condition for the function on the manifold .
We want to admit that we consider such manifold that its edge has empty intersection with the set . For example, can be a manifold without edge.
Note that if is given by the system of full rank then for all . Here is the identity operator in .
Example 2**.**
In the case LPL condition reads
[TABLE]
*for all .
Later we shall consider quadratic case ( is a quadratic function) and confirm fulfillment of condition (10).
Now consider a special 2D example which exhibits possible situations.
Example 3**.**
Let (the set is the circle with center and radius ). Let , where is a parameter. We have
[TABLE]
and .
3.1. Suppose that . Consider and , , with . Note that is equivalent to . The angle between tangent lines to the circle and curve at the point asymptotically equals when . Substituting to the equation we obtain that , . Hence , . From the other hand . Thus the exponent in the LPL condition equals .
3.2. Suppose that . Consider and , , with . The angle between tangent lines to the circle and curve at the point asymptotically equals when and when . Hence , . From the other hand when . Thus the exponent in the LPL condition equals .
We conjecture that these two situations are the only possible ones for quadratic objects.
Conjecture 1**.**
Let be a symmetric matrix, , and . Suppose that the set is a quadric nonempty surface, i.e. , and there exists a unique point of the global minimum for the problem . Then the exponent in the LPL condition near the global minimum equals or .
3.3. The gradient projection algorithm on the unit sphere
Next we consider a special case when . In this case all projections can be explicitly calculated.
Theorem 2**.**
Suppose that is the Lipschitz function with constant , is the Lipschitz function with constant and . Under LPL condition (10) algorithm (6) with converges to a point of minimum with linear rate.
**Proof. **First, let’s describe quantitative connection between the term of optimality condition (7) and the residual .
After simple arithmetical calculations we get
[TABLE]
and
[TABLE]
and calculations are well defined for any nonstationary point , see conditions (7), (8).
Fix a point . We have
[TABLE]
By the previous formula, (4) and definition of by (6) with the next estimate holds
[TABLE]
Hence
[TABLE]
and thus implies . Denoting and using condition (10) we get
[TABLE]
or
[TABLE]
From the latter inequality follows
[TABLE]
where .
Now prove the convergence with respect to . Note that obvious condition implies . Thus for and we get
[TABLE]
Hence and is the Cauchy sequence. This implies its convergence to a point with linear rate, while inequality and continuity of provides . ∎
3.4. The gradient projection algorithm with the metric
projection on the tangent plane
The next version of the gradient projection algorithm uses the metric projection of the point on the tangent plane to the set , is a function, at the point . After this step we localize the next point on some segment and finding it by dividing the segment in half.
For a point , , denote i.e. the tangent subspace to the surface at the point .
Lemma 1**.**
Assume that the function is the Lipschitz continuous with constant and its gradient is also the Lipschitz continuous with constant . Let be a continuously differentiable function and be a surface without edge and a proximally smooth set with constant with for all . Put
[TABLE]
and fix . Let , , ,
[TABLE]
Then
[TABLE]
The maximal value of the function is , and for all .
The Lemma implements following algorithm, preferably with .
Gradient Projection on Tangent Hyperplane (GPA2)
Step 1. Let satisfy Lemma 1 condition. Set , , and .
Step 2. Make a step and project onto tangent hyperplane:
[TABLE]
,
Step 3. Find intersection of a segment and the surface (i.e. by iterative bisection of the segment)
[TABLE]
Step 4. Increase and continue to the Step 2.
**Proof. **The maximality of is obvious. Let’s prove (12). By (2) the segment
[TABLE]
has (unique) intersection with the set . The point can be found by dividing the segment in half. See Figure 1 for details.
We have
[TABLE]
,
[TABLE]
[TABLE]
and
[TABLE]
Substituting the last inequality and (15) in Formula (13) we get (12).∎
Note that if the function in Lemma 1 has the Lipschitz continuous gradient with Lipschitz constant and there exists with for all , , then the set is proximally smooth with constant [36, Proposition 4.15].
Theorem 3**.**
Suppose that conditions of Lemma 1 hold, , and the function satisfies the LPL condition with constant on the set . Then the GPA2 with initial condition , converges with linear rate to the minimum point.
**Proof. **Put , where . From the LPL condition for the function on the surface
[TABLE]
for all by Lemma 1 we have
[TABLE]
[TABLE]
Now consider the rate of convergence with respect to the point. By (14) we get
[TABLE]
Using (12) for and we obtain that
[TABLE]
[TABLE]
Due to inequalities for this implies . The end of the proof is standard (compare the proof of Theorem 2). ∎
Example 4**.**
Let be a and proximally smooth with constant manifold without edge, be a strongly convex function (with constant of strong convexity ) with the Lipschitz continuous gradient. Suppose that is the Lipschitz function with constant on the level set and . Then the function satisfies the LPL condition on the set . We shall give a sketch of proof for this fact.
By [4, Lemma 2.1] the function has unique minimum . Fix a point and put . Choose a positive number from the conditions of Lemma 1, less than from [4, Formula (8)] and . Then by strong convexity of the function by [4] we have linear rate of convergence for the GPA1 with step . From Theorem 2.3 [4] for we get
[TABLE]
where [4, Formula (8)] and does not depend upon .
Let , . Then from the definition of we have , thus we get the inequality . By Formula (14) we obtain that . Hence by the Pythagoras theorem
[TABLE]
[TABLE]
[TABLE]
where .
Suppose that . Put . By the supporting principle for proximally smooth sets
[TABLE]
[TABLE]
In the case the last formula also takes place. From (4) and (16) we obtain that
[TABLE]
[TABLE]
3.5. The gradient method combined with the Newton method on the unit sphere
Describe some symbiosis of the gradient projection algorithm and the Newton method for finding a stationary point for the problem . We shall assume that .
Consider again the problem with . Define with the help of the function . For any define the number as a solution of the extremal problem
[TABLE]
thus . Denote by the variable ,
[TABLE]
Fix , where , . Define also
[TABLE]
the minimal by absolute value element of spectrum for the matrix .
Suppose that and is the Lipschitz continuous with Lipschitz constant on the set , where . Then in the case
[TABLE]
the modified Newton method starting from the point converges with super-linear rate [21, Chapter X, §4, Theorem 1].
Note that .
Gradient Projection — Newton Method (GPA3)
Step 1. Take , and put , .
Step 2. (GPA2 phase) While , perform Steps 2-4 of GPA2, increasing . If , proceed to Step 3.
Step 3. Put , , .
Step 4. (Newton phase) do Newton steps for equation , increasing :
[TABLE]
Conditions of Lemma 1 are satisfied at Step 2. Thus we can do steps (12) of the GPA2 and decrease the function:
[TABLE]
Put . It’s easy to see that by the inequality we’ll switch to the Newton method at Step 4 after no more than steps of the gradient projection algorithm. In the case when condition (17) is valid at the point the modified Newton method starting from converges with super-linear rate.
3.6. Quadratic form
Consider homogeneous quadratic function with symmetric real matrix . Denote by eigenvalues of , — corresponding eigenvectors and suppose that . Then and two global minimizers are . All other eigenvectors are stationary points, but not local minimums. Thus the problem (1) is equivalent to the problem of finding the minimal eigenvalue and eigenvector, and algorithm (6) has the form
[TABLE]
Probably, the first gradient-like algorithm for eigenvalue problem has been proposed by Kantorovich [20, Section 3.4]. He converted eigenvalue problem to unconstrained minimization of Rayleigh quotient and obtained the algorithm where was taken from 1D minimization of . One can see that this method has the same form as (18), but has more complicated step-size rule. Kantorovich proved linear convergence of the algorithm.
We analyse iterative process (18) by use of the above presented results. For we shall prove convergence to or depending on the sign of . Suppose that (if , there is no convergence to global minimum). It is obvious that if , the same is true for decompositions of all iterates , thus all of them remain in the open half-sphere . We also introduce for .
Lemma 2**.**
Fix . Suppose that are eigenvalues of . Then the quadratic function satisfies the LPL condition on the set with ,
**Proof. **Express any point through the residual vector : . From we have . Put . Using notation for the unit vector (see (19)), and for the diagonal matrix with strictly positive diagonal elements
[TABLE]
we have the next obvious equalities below
[TABLE]
It is clear that .
From the equality we get
[TABLE]
By the latter expression has the form
[TABLE]
for all . The inequality holds for , and thus the LPL condition takes place with .∎
Thus for any we get (denominator from (11) is bounded from above by for the quadratic function), while asymptotically , and .
Condition can be weakened. If , then the function also satisfies the LPL condition on the set of the unit sphere for any in the basis from eigenvectors.
Indeed, put , , . Then
[TABLE]
For the function we have the LPL condition on the set with constant , i.e.
[TABLE]
or equivalently
[TABLE]
4. The Frank-Wolfe method
The Frank-Wolfe method (also known as the conditional gradient method) has been proposed for minimization of a convex quadratic function on a convex set [15] and later was extended for general convex objectives, see e.g. [31] and recent survey [14]. The idea of the method for problem (1) is to solve (on each step) the auxiliary problem
[TABLE]
find that minimizes and take the next point . The method requires minimization of a linear function on the admissible set at each iteration. There are also some extensions of the method for nonconvex objective functions and for matrix optimization [26, 37].
4.1. Minimization of an approximately linear function
However our problem (1) deals with non-convex admissible set . We consider a special version of the FW method for our problem as a limiting version of the gradient projection method (6). Indeed suppose that the function is approximately linear (see (23) below) on the set , i.e., informally, constant is small enough in comparison with other parameters. For this extreme case method (6) turns into the next iteration process
[TABLE]
This is exactly the FW method with : we take linearized function , find its minimum on and proceed to the minimum point. Notice that in standard versions of the FW method we make a step in the direction of the minimizer; this full-step version diverges in general case.
Full-step Frank-Wolfe method (FFW)
Step 1. Take and set .
Step 2. Solve auxiliary problem
[TABLE]
Step 3. Update
Step 4. Increase and go to Step 2.
To get the rigorous validation of method (22) we need specification of the above mentioned approach. A function defined on the ball with -Lipschitz gradient (3) is called approximately linear on if
[TABLE]
Theorem 4**.**
Suppose that (23) holds. There are just two stationary points in problem (1) , and FFW method (22) converges to for arbitrary with linear rate
[TABLE]
Theorem 4 is close enough to Theorem 4.3 from [32] where minimization on instead of has been considered. But indeed the solutions of these two problems coincide under condition (24). The proof of Theorem 4 follows from the following fact regarding strongly convex sets of radius and functions with the Lipschitz continuous gradient [3].
Suppose that is a strongly convex set of radius and a function has the Lipschitz continuous gradient with constant , and . Then the iteration process ,
[TABLE]
converges to the unique solution of the problem with linear rate:
[TABLE]
We shall further consider a closed surface in which is the boundary of some strongly convex set of radius , i.e. . It is worth to admit that is not necessary smooth.
4.2. Another gradient domination condition
We introduce a sort of the gradient domination condition, formulated at a stationary point of (1). Assume that the next inequality
[TABLE]
holds. This condition reminds sharp minimum condition [31].
Theorem 5**.**
Let be a strongly convex set of radius , , . Suppose that is a function with the Lipschitz continuous gradient with constant . If and then , i.e. is the strict global minimum of the function on the set , and hence on the boundary .
**Proof. **Put . Then by the supporting principle for strongly convex sets
[TABLE]
Fix a number and a unit vector such that
[TABLE]
We claim that .
By Formula (4)
[TABLE]
From (27) we get . Hence or . By inequality we obtain that
[TABLE]
Further
[TABLE]
and . The last two formulae and (28) gives the next estimate
[TABLE]
and taking in mind (29) we have . By inclusions (26), (27) .∎
Corollary 1**.**
If and other assumptions of Theorem 5 hold then is also a global minimum of on the set , but this minimum is not necessary strict.
The particular example is given by the set and the function .
Corollary 2**.**
If in Theorem 5 and other assumptions of Theorem 5 hold then is a stationary point of the function on the set but not necessarily the minimum point.
Example 5**.**
Consider an example in . Fix , let .
Define the function
[TABLE]
and for all .
Consider the problem . We have and thus is the Lipschitz continuous with constant .
The point is a stationary point: , , i.e. and . But is not a local minimum. The solution of the problem is with .∎
Define the function for . The function is convex, monotonically increasing and for . By convexity of we have for any that . Note that if .
For any real number define as follows
[TABLE]
Theorem 6**.**
Let be a strongly convex set of radius , , . Suppose that is a function with the Lipschitz continuous gradient with constant , and . Fix a point , and . Then the iterations
[TABLE]
of FFW method converge to the point with linear rate:
[TABLE]
for all .
Note that by Theorem 5 is the strict global minimum of on the set . Also due to choice of .
**Proof. **Put .
From the inequality (which follows by induction) and the inclusion the sine of the angle between and is estimated as follows
[TABLE]
Put , . From the triangle we have . By strong convexity of the set and inequality we obtain that
[TABLE]
∎
Theorem 7**.**
Let be a strongly convex set of radius , , . Suppose that is a function with the Lipschitz continuous gradient with constant and for any point we have the inequality . Then for any choice of the initial point the iterations
[TABLE]
converge to the global strict minimum with linear rate:
[TABLE]
for all .
**Proof. **Put and .
Suppose that for all .
Prove that .
[TABLE]
From the supporting principle for strongly convex sets
[TABLE]
or
[TABLE]
Hence we obtain the next estimate
[TABLE]
and .
For any unit vectors and numbers we have . Using the last inequality and strong convexity of the set we get
[TABLE]
Thus the sequence converges, . Passing to the limit as in the inclusion and using upper semicontinuity of the normal cone we have , . By Theorem 5 the point is the strict global minimum. ∎
Consider examples that show importance of the condition in both Theorems 6 and 7.
Example 6**.**
Suppose that and are the function and the set from Example 5. Notice that in this case . Take a starting point for the FFW algorithm with , . Then and one step of the FFW method leads us to the stationary point . But extremum is the point , see Figure 2.
Example 7**.**
Consider the set . Let . For the function we have , , i.e. ,
[TABLE]
and . Put and , , with . As we’ve seen at subsection 3.2 the angle between tangent lines to the circle and curve at the point asymptotically equals when . Starting the FFW algorithm from the point we obtain the next point . We have when . The last means that
[TABLE]
Thus for any sufficiently small we get . There is no linear rate of convergence.
5. Appendix
Proposition 1**.**
Suppose that the set is proximally smooth with constant , the function has the Lipschitz continuous gradient with constant and the point is a local minimum in the problem . Then .
**Proof. **If then by the definition of normal cone.
Assume that . Prove the proposition by contradiction. Put . Suppose that . Then by the supporting principle for a proximally smooth set we have
[TABLE]
For a set define by the intersection of all closed balls of radius each of which contains the set . From [6, Lemmata 4.13, 4.16] there exists a continuous curve with endpoints and such that . By the inclusion there exists with ,
[TABLE]
Let be a 2-dimensional plane, . Choose a point , , , . The angle between arcs and at the point (see Figure 3) is strictly positive and hence the angle between and is less than . Thus there exists (that does not depend on vector ) with .
By Formula (4)
[TABLE]
and
[TABLE]
Hence
[TABLE]
for sufficiently small . A contradiction.∎
Acknowledgements
The work was supported by Russian Science Foundation (Project 16-11-10015).
The reference list from the paper itself. Each links out to its DOI / PubMed record.
- 1[1] P.-A. Absil, R. Mahony, R. Sepulchre, Optimization Algorithms on Matrix Manifolds, Princeton University Press, Princeton and Oxford, 2008.
- 2[2] M. V. Balashov, Uniformly convex subsets of the Hilbert space with modulus of convexity of the second order, Journal of Math. Anal. Appl., 377:2 (2011), 754–761.
- 3[3] M. V. Balashov, Maximization of a function with Lipschitz continuous gradient, Journal of Mathematical Sciences, 209:1, (2015), 12–18.
- 4[4] M. V. Balashov, About the gradient projection algorithm for a strongly convex function and a proximally smooth set, Journal of Convex Analysis, 24:2, (2017), 493–500.
- 5[5] M. V. Balashov, E. S. Polovinkin, M-strongly convex subsets and their generating sets, Sbornik: Mathematics, 191:1 (2000), 25–60.
- 6[6] M. V. Balashov, G. E. Ivanov, Weakly convex and proximally smooth sets in Banach spaces, Izv. RAN. Ser. Mat., 73:3 (2009), 23–66.
- 7[7] D. Bertsekas, Mathematical Programming, Athena-Publishing, 2013.
- 8[8] S. Boyd, L. Vanderberghe, Convex Optimization, Cambridge University Press, 2004.
