Accelerated method of finding for the minimum of arbitrary Lipschitz convex function
I.M. Prudnikov

TL;DR
This paper introduces a novel optimization method for nonsmooth convex functions that achieves superlinear convergence by using a variable-dependent averaging technique, enabling second-order methods on smooth approximations.
Contribution
It develops a new approximation approach using set-valued mappings that transforms nonsmooth convex functions into twice differentiable convex functions for faster optimization.
Findings
Achieves superlinear convergence rate.
Transforms nonsmooth functions into smooth approximations.
Enables second-order optimization methods on nonsmooth problems.
Abstract
The goal of the paper is development of an optimization method with the superlinear convergence rate for a nonsmooth convex function. For optimization an approximation is used that is similar to the Steklov integral averaging. The difference is that averaging is performed over a variable-dependent set, that is called a set-valued mapping (SVM) satisfying simple conditions. Novelty approach is that with such an approximation we obtain twice continuously differentiable convex functions, for optimizations of which are applied methods of the second order. The estimation of the convergence rate of the method is given.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsOptimization and Variational Analysis
∎
11institutetext: Igor Mihailovich Prudnikov 22institutetext: Scientific Center of Smolensk Federal Medical University, Smolensk, Russia, 214000
22email: pim [email protected]
The accelerated method for finding the minimum of a
nonsmooth finite convex function
Igor M. Prudnikov
(Received: date / Accepted: date)
Abstract
The goal of the paper is development of an optimization method with a superlinear convergence rate for a nonsmooth convex function. For optimization an approximation is used that is similar to the Steklov integral averaging. The difference is that averaging is performed over a variable-dependent set, that is called a set-valued mapping (SVM) satisfying simple conditions. The novelty of the approach is that with such an approximation we obtain twice continuously differentiable convex functions, for the optimization of which of the second order methods are used. The rate of convergence of the method is estimated.
Keywords:
Lipschitz functions convex functions Generalized Gradients Necessary and Sufficient conditions of Optimality Steklov integral Clark subdifferential Lebesgue integrals generalized matrices of second derivatives Newton optimization methods for Lipschitz functions
MSC:
49J52 90C30 90C31
††journal: JOTA
1 Introduction
Nonsmooth (non-differentiable) or insufficiently smooth functions are widely used in economics, data processing, control theory, artificial intelligence and other fields. An example of such functions are functions obtained by performing operations minimum or maximum.
Nonsmooth functions may not have derivatives at some points. It is known that the Lipschitz function is differentiable almost everywhere (a.e.) in rademacher . Generalized gradients are used instead of gradients at the points of non-differentiability of a function. The optimization methods of these functions are different from the optimization methods of smooth (differentiable) functions.
In this paper the author continues research related to the construction of an optimization method of Lipschitz functions using the Steklov integrals and similar integrals, when a set, over which averaging is taken, is a function of a variable.
This approach gives twice differentiable functions, whose stationary points coincide with the stationary points of the original function in contrast to the case when averaging is doing over sets independent of . For such functions second-order optimization methods can be used that are tested for arbitrary convex functions with an estimate of the convergence rate.
If we have discontinuous gradients as functions of variables, then it is very difficult to construct optimization methods and estimate their convergence rates in the general case. Using the polynomial approximation of an original function and transition to optimization of a smooth function by the known methods pshenichnyidanilin does not allow to solve the optimization problem, since this way leads to appearance of new extremum points located far from the extremum points of the original function.
Separation of fictitious extremum points from real ones is as complex a problem as the initial one. Therefore, the development of the theory of nonsmooth functions went along the path of developing its own methods, based on the properties of generalized gradient of Lipschitz functions. Here it is worth mentioning the articles pshenichnyidanilin - clarke N. Z. Shor, B. N. Pshenichny, V. F. Demyanova, E.A. Nurminsky, F. Clark, R.T. Rokafellar, L.N. Polyakova.
To construct accelerated optimization methods for nonsmooth functions, it is necessary to determine the constructions to which second-order optimization methods are applicable. But to perform the latter, it is necessary to determine such constructions for which the extremum points do not disappear, and new ones do not appear.
The paper proposes exactly this method for smoothing of nonsmooth functions. The resulting function will be continuously differentiable. If we apply again the averaging operation to it, then we will have a twice differentiable function.
If we apply averaging over sets depending on the variable , then we obtain a continuously differentiable function whose stationary points coincide with the stationary points of the original function. If we repeat the averaging procedure, we get twice differentiable functions to which second-order optimization methods with accelerated convergence can be applied.
It is possible to move with the help of the defined functions from the local optimization of non-smooth functions to local optimization of smooth functions, and also to estimate the rate of convergence to an extremum point, that is definitely important, because it is possible to develop accelerated optimization methods for functions with discontinuous gradients. Similar constructions as far as the author knows, nobody has proposed previously.
2 Smoothing integral functions
Let be a Lipschitz function with a constant , is its local minimum (maximum) in . As it is known, the necessary extremum condition at the point for the Lipschitz function is zero belongs to the Clarke subdifferential calculated at the point , i.e.
[TABLE]
Any point, for which this condition is correct, is called a stationary point. Not all stationary points are minimum or maximum points.
Let us take an arbitrary convex compact set , . We introduce the definition of the stationary point.
Definition 1. A point is called the stationary point of the function , if the set includes a stationary point of the function .
This definition agrees with the definition of the stationary point for the convex functions rocafellar , because for the strongly convex functions the distance from the stationary point to the minimum can be evaluated by difference of values of the function calculated at these points.
Define the function
[TABLE]
where is the measure of the domain , .
Obviously, is continuous. Let us show that is a Lipschitz function with the Lipschitz constant equaled to the Lipschitz constant of the function . Really,
[TABLE]
[TABLE]
The function is Lipschitz, and therefore it is differentiable a.e. in rademacher . Let denote the set of points of differentiability of the function in . It is known that is everywhere dense in and, in particular, in , because of by assumption.
The following theorem was proved in proudintegapp1 .
Theorem 2.1
For an arbitrary Lipschitz function the function
[TABLE]
where is any domain in is the measure of the domain , , is a continuously differentiable function with the derivative
[TABLE]
Remark 2.1
We use here the Lebesque integration.
Remark 2.2
The derivatives of the function are taken at those points where they exist.
It was also proved in proudintegapp1 that if is Lipschitz, then is also Lipschitz function.
Consider the function
[TABLE]
Since is Lipschitz, we will have
[TABLE]
Since is continuous, is a continuously differentiable function. As soon as is Lipschitz, we can differentiate (2). As a result, we will have
[TABLE]
i.e. is a twice continuously differentiable function.
It can be shown proudintegapp that the function is Lipschitz with a constant , depending on the set . If is a ball or a cube in , then we can take , where is the diameter of the set , is the Lipschitz constant of the function .
Remark 2.3
The integration in (3) is understood, as before, in the sense of Lebesgue.
If is a point of local maximum or minimum of the function , then for sufficiently small and the point is also the local minimum or maximum point of the function . But unlike the function the function is continuously differentiable. Similar thing is true for the function , i.e. the point is a point of local minimum or maximum of the function . But unlike the functions and the function is twice continuously differentiable, matrix of the second mixed derivatives of which satisfies to the Lipschitz condition. To optimize we can use the methods of second order.
The functions and also retain many properties of the function . An important property for applications of the functions and is that if is convex with respect to all or some variables, then and are also convex with respect to the same variables proudintegapp .
Let us see which stationary points the function has. According to the formula (2), the stationary point of the function is such a point, for which
[TABLE]
We will show that the stationary point of the function belongs to the set .
The integral in (4) can be represented with any degree of accuracy in the form
[TABLE]
where , are subregions of the set , are their measures,
[TABLE]
The sum (5) is the convex hull of the vectors . Really,
[TABLE]
where and
According to the equality (4), the sum (6) can be made arbitrarily small for large (for small ). Since the convex hull of any vectors is a closed set and the convex hull of generalized gradients is a collinear vector to some generalized gradient of the function at a point , , we obtain that the sum (6) is a vector tending to zero generalized gradient as . In other words, there exists a point , with a zero generalized gradient of the function .
Therefore, the stationary point of the function belongs to the set . Hence, by definition, is a stationary point. Thus, the following theorem is proved.
Theorem 2.2
All stationary points of the function are the stationary points of the function .
Similar reasoning is true for the function .
Corollary 2.1
All stationary points of the function are the stationary points of the function or the stationary points of the function .
Corollary 2.2
If is a local minimum point of the function , for which there exists a neighborhood where
[TABLE]
then there exists a convex compact set and a point , where and , i.e. the point is the stationary point of the function .
The same is true for the local maximum point of the function .
To find the stationary points of the function , we must apply second-order optimization methods for the function . A numerical optimization method will be given with the rate of convergence to a stationary point of the function faster than any geometric progression.
3 Search algorithm for stationary points of the Lipschitz function
Let us take a sequence of sets with non-empty interior whose diameters tends to zero with increasing . Let be for as . We introduce a sequence of the functions
[TABLE]
and
[TABLE]
Let the inequality be true for the matrix of the second mixed derivatives of the function . It is proved in proudintegapp that We will consider instead of the function the function :
[TABLE]
for any fixed point and .
As a result, the inequality
[TABLE]
is true where is the matrix of the second mixed derivatives of the function with respect to the variable .
Note that if the function is bounded below, then the function is also bounded below for any points and from . Also, it is clear that , where is the gradient of the function at the point .
We assume that the functions and are bounded below and reach their infimum at some points.
Search method for a stationary point
Let the point at the - th step have already been built. Construct the point . We put by definition .
-
Calculate
-
Find a non-negative integer for which
[TABLE]
-
We assume , .
-
With increasing we decrease such that the inequality
[TABLE]
holds for some sequence , where . Go to the step 1.
Let us show that as and the number mentioned in operation exists. Expand the function in a neighborhood of the point in the Taylor series
[TABLE]
where is an uniformly infinitesimal function in .
As soon as , then . Consequently, . Therefore, we can rewrite (10) in the form
[TABLE]
As soon as is an uniformly infinitesimal function with respect to , then the inequality
[TABLE]
is true for large where as .
From (11) we have
[TABLE]
[TABLE]
The value tends to zero as . Therefore, for small and, consequently, for small , we get
[TABLE]
It follows from here that the inequality
[TABLE]
is true for sufficiently small and any . Therefore, tends to zero as , since otherwise, as follows from (13), the function would decrease in value along the direction at -th step. The last thing contradicts to the lower boundedness of the function for all and .
We will show that when the requirements of the step are fulfilled, the function is uniformly infinitesimal in and . From (10) for we have
[TABLE]
We will use the midpoint theorem. Then
[TABLE]
for Substitute the received expression in (14). We will have
[TABLE]
We use the midpoint theorem again for the derivatives of the function
[TABLE]
Therefore
[TABLE]
It follows from the Lipschitz quality of the gradient with the constant that the next evaluation
[TABLE]
is true if (7) is satisfied.
It follows from here that the functions and are uniformly infinitesimal with respect to and . Therefore, for small the inequality (12) will be correct for . Consequently, the inequality (8) is satisfied for and the process goes with the full step .
Theorem 3.1
Any limit point of the sequence , constructed according to the algorithm 1-4, is a stationary point of the function .
Proof. We have already proved that for small the process goes with the full step Since the functions are bounded below in aggregate on and the inequality (13) is true for all and , then and for . Therefore, the sequence has the limit points.
The following equalities
[TABLE]
are correct where all in (10) are uniformly infinitesimal in .
It follows from the definition of the function that the gradient is a convex hull of the generalized gradients of the function .
Taking into account what is said above about and , and also from uppersemicontinuity of the Clarke subdifferential mapping demrub1 , clarke we can imply that the inclusion is correct at a limit point , i.e. is the stationary point of the function . The theorem is proved.
To estimate the rate of convergence, we assume that is convex and almost everywhere
[TABLE]
From proudintegapp1 it follows that is also convex and for some
[TABLE]
Define the function
[TABLE]
for each and where is positive number depending on and tending to zero as . To search for a stationary point of the function , we use the algorithm described below.
Let
[TABLE]
Since as , we assume that for all .
We first introduce the conditions of coherence, which give to us the rules of coherent striving to infinity of the parameters and . We will write them briefly in the form of dependence . Denote by the constant bounding from above the norm of the matrix . During the process of optimization we satisfy to conditions of coherence:
as ; 2. 2.
for convergence with superlinear rate, we require that
[TABLE]
as , where is a upper bound of the function obtained from the expansion of the function at the -th step (10). It is clear that as .
The conditions 1 and 2 can be easy satisfied. At first the optimization process goes on with constant . As soon as the step size becomes quite small, that means large enough , we increase , decrease diameter and, consequently, increase so that to satisfy to the conditions of coherence 1 and 2. As we shall see below, is the coefficient of proportionality between and . Therefore, we are able to evaluate by the coefficient of proportionality between and and, therefore, to satisfy to the clause 2 of the consistency conditions.
Superlinear optimization method for finding the minimum point of any final convex function
Let a point already been found. Construct the pint .
- Calculate the -th step.
[TABLE]
- Find a non-negative integer for which
[TABLE]
-
We put , .
-
Calculate for
[TABLE]
- If
[TABLE]
for an arbitrarily chosen sequence then we increase such that the inequality
[TABLE]
remained in force.
- Go to the step 1 and continue until the step size becomes less than the specified value.
Let us prove that the sequence converges to a minimum point of the function with superlinear speed.
Theorem 3.2
The sequence , constructed according to the algorithm 1-3, converges to an unique stationary point of the function . For large the following estimate for the rate of convergence of the method is correct
[TABLE]
where as .
Proof. As above, we are able to show that for sufficiently large k the process goes with a full step, i.e. . From the decomposition
[TABLE]
for
[TABLE]
we have
[TABLE]
It is easy to check that
[TABLE]
But it is obvious that . Therefore . Since the function has the continuous second derivative, satisfying a Lipschitz condition, then are the uniformly infinitesimal functions in . From here
[TABLE]
From the expression
[TABLE]
we have the evaluation
[TABLE]
where , as . For large we achieve that the inequality
[TABLE]
is correct (the condition of coherence). Therefore, the sequence converges to a single point and
[TABLE]
As soon as
[TABLE]
then
[TABLE]
Thus, the inequality (15) is proved.
Remark 3.1
. The inequality (15) proves the superlinear convergence rate of the optimization method. Indeed, the coefficient between and is equal to , where as .
4 Conclusion
The methods for finding for a stationary point of Lipschitz function and a minimum point of arbitrary convex function are proposed in this paper. To achieve a high rate of convergence, it is necessary to make consistent reduction of the diameter of the set , which the integral averaging is doing on, with decreasing the length of step of optimization process. Rules for consistent reduction of the lengthes of steps and the diameters of the sets are given.
The reference list from the paper itself. Each links out to its DOI / PubMed record.
- 1(1) Rademacher H. Uber partielle und totale Differenzierbarkeit I. Math. Ann. 89 (1919), 340-359.
- 2(2) Pshenichny B.N., Danilin Yu.M. Numerical methods in extream problems. M.:Nauka, 1975. 319 P.
- 3(3) Pshenichny B.N. Convex analysis and extream problems. M.: Nauka, 1980. 320 P.
- 4(4) Rocafellar, R. T. Convex analysis, New York: Willey, 1972.
- 5(5) Demyanov V.F., Rubinov A.M. The basis of nonsmooth analysis The quasidifferential calculation. M.: Nauka, 1990. 432 P.
- 6(6) Prudnikov I.M. C 2 ( D ) superscript 𝐶 2 𝐷 C^{2}(D) integral approximtion of nonsmooth functions preserving ε ( D ) 𝜀 𝐷 \varepsilon(D) extreme ponts // Papers of Institute of mathematics and mechnics of Ural Branch RAN. 2010. P. 159 - 169.
- 7(7) Prudnikov I.M. Integral approximation of Lipschitz functions // Vestnik of St. Petersburg University. ser. 10. 2010. Issue 2. P. 70-83
- 8(8) Klark F. Optimization and nonsmooth analysis. M.: Nauka, 1988. 280 P.
