Accelerated method of finding for the minimum of arbitrary Lipschitz   convex function

I.M. Prudnikov

arXiv:1904.00606·math.OC·August 3, 2023

Accelerated method of finding for the minimum of arbitrary Lipschitz convex function

I.M. Prudnikov

PDF

Open Access

TL;DR

This paper introduces a novel optimization method for nonsmooth convex functions that achieves superlinear convergence by using a variable-dependent averaging technique, enabling second-order methods on smooth approximations.

Contribution

It develops a new approximation approach using set-valued mappings that transforms nonsmooth convex functions into twice differentiable convex functions for faster optimization.

Findings

01

Achieves superlinear convergence rate.

02

Transforms nonsmooth functions into smooth approximations.

03

Enables second-order optimization methods on nonsmooth problems.

Abstract

The goal of the paper is development of an optimization method with the superlinear convergence rate for a nonsmooth convex function. For optimization an approximation is used that is similar to the Steklov integral averaging. The difference is that averaging is performed over a variable-dependent set, that is called a set-valued mapping (SVM) satisfying simple conditions. Novelty approach is that with such an approximation we obtain twice continuously differentiable convex functions, for optimizations of which are applied methods of the second order. The estimation of the convergence rate of the method is given.

Equations112

0 \in \partial_{C L} f (x_{*}) .

0 \in \partial_{C L} f (x_{*}) .

φ (x) = \frac{1}{μ ( D )} \int_{D} f (x + y) d y,

φ (x) = \frac{1}{μ ( D )} \int_{D} f (x + y) d y,

∣ φ (x_{1}) - φ (x_{2}) ∣\leq \frac{1}{μ ( D )} \int_{D} ∣ f (x_{1} + y) - f (x_{2} + y) ∣ d y \leq \frac{1}{μ ( D )} \int_{D} L ∥ x_{1} - x_{2} ∥ d y \leq

∣ φ (x_{1}) - φ (x_{2}) ∣\leq \frac{1}{μ ( D )} \int_{D} ∣ f (x_{1} + y) - f (x_{2} + y) ∣ d y \leq \frac{1}{μ ( D )} \int_{D} L ∥ x_{1} - x_{2} ∥ d y \leq

\leq L ∥ x_{1} - x_{2} ∥ x_{1}, x_{2} \in R^{n} .

\leq L ∥ x_{1} - x_{2} ∥ x_{1}, x_{2} \in R^{n} .

φ (x) = \frac{1}{μ ( D )} \int_{D} f (x + y) d y,

φ (x) = \frac{1}{μ ( D )} \int_{D} f (x + y) d y,

φ^{'} (x) = \frac{1}{μ ( D )} \int_{D} f^{'} (z + x) d z .

φ^{'} (x) = \frac{1}{μ ( D )} \int_{D} f^{'} (z + x) d z .

ϕ (x) = \frac{1}{μ ( D )} \int_{D} φ (x + y) d y .

ϕ (x) = \frac{1}{μ ( D )} \int_{D} φ (x + y) d y .

ϕ^{'} (x) = \frac{1}{μ ( D )} \int_{D} φ^{'} (z + x) d z .

ϕ^{'} (x) = \frac{1}{μ ( D )} \int_{D} φ^{'} (z + x) d z .

ϕ^{''} (x) = \frac{1}{μ ( D )} \int_{D} φ^{''} (z + x) d z,

ϕ^{''} (x) = \frac{1}{μ ( D )} \int_{D} φ^{''} (z + x) d z,

φ^{'} (x_{*}) = \frac{1}{μ ( D )} \int_{D} f^{'} (z + x_{*}) d z = 0.

φ^{'} (x_{*}) = \frac{1}{μ ( D )} \int_{D} f^{'} (z + x_{*}) d z = 0.

\frac{1}{μ ( D )} i = 1 \sum N f^{'} (z_{i} + x_{*}) μ (D_{i}),

\frac{1}{μ ( D )} i = 1 \sum N f^{'} (z_{i} + x_{*}) μ (D_{i}),

i = 1 \sum N μ (D_{i}) = μ (D) .

i = 1 \sum N μ (D_{i}) = μ (D) .

\frac{1}{μ ( D )} i = 1 \sum N f^{'} (z_{i} + x_{*}) μ (D_{i}) = i = 1 \sum N \frac{μ ( D _{i} )}{μ ( D )} f^{'} (z_{i} + x_{*}) = i = 1 \sum N α_{i} f^{'} (z_{i} + x_{*}),

\frac{1}{μ ( D )} i = 1 \sum N f^{'} (z_{i} + x_{*}) μ (D_{i}) = i = 1 \sum N \frac{μ ( D _{i} )}{μ ( D )} f^{'} (z_{i} + x_{*}) = i = 1 \sum N α_{i} f^{'} (z_{i} + x_{*}),

f (z) \geq f (x_{*}) \forall z \in S,

f (z) \geq f (x_{*}) \forall z \in S,

φ_{s} (x) = \frac{1}{μ ( D _{s} )} \int_{D_{s}} f (x + y) d y

φ_{s} (x) = \frac{1}{μ ( D _{s} )} \int_{D_{s}} f (x + y) d y

Φ_{s} (x) = \frac{1}{μ ( D _{s} )} \int_{D_{s}} φ_{s} (x + y) d y .

Φ_{s} (x) = \frac{1}{μ ( D _{s} )} \int_{D_{s}} φ_{s} (x + y) d y .

\tilde{Φ}_{s} (y, x) = Φ_{s} (y) + L_{s} ∥ y - x ∥^{2},

\tilde{Φ}_{s} (y, x) = Φ_{s} (y) + L_{s} ∥ y - x ∥^{2},

L_{s} ∥ z ∥^{2} \leq (\nabla^{2} \tilde{Φ}_{s} (x, x) z, z) \leq 3 L_{s} ∥ z ∥^{2} \forall z \in R^{n},

L_{s} ∥ z ∥^{2} \leq (\nabla^{2} \tilde{Φ}_{s} (x, x) z, z) \leq 3 L_{s} ∥ z ∥^{2} \forall z \in R^{n},

\tilde{Φ}_{s, k} (x_{k} + 2^{- l_{k}} △_{k}) \leq \tilde{Φ}_{s, k} (x_{k}) - 2^{- 2 l_{k}} \frac{L _{s}}{2} ∥ △_{k} ∥^{2} .

\tilde{Φ}_{s, k} (x_{k} + 2^{- l_{k}} △_{k}) \leq \tilde{Φ}_{s, k} (x_{k}) - 2^{- 2 l_{k}} \frac{L _{s}}{2} ∥ △_{k} ∥^{2} .

\frac{3∥ △ _{k} ∥}{d ( D _{s} )} < ε_{k}

\frac{3∥ △ _{k} ∥}{d ( D _{s} )} < ε_{k}

\tilde{Φ}_{s, k} (x_{k} + α △_{k}) = \tilde{Φ}_{s, k} (x_{k}) + α (\nabla \tilde{Φ}_{s, k} (x_{k}), △_{k}) + o_{s, k} (∥ α △_{k} ∥),

\tilde{Φ}_{s, k} (x_{k} + α △_{k}) = \tilde{Φ}_{s, k} (x_{k}) + α (\nabla \tilde{Φ}_{s, k} (x_{k}), △_{k}) + o_{s, k} (∥ α △_{k} ∥),

\tilde{Φ}_{s, k} (x_{k} + α △_{k}) = \tilde{Φ}_{s, k} (x_{k}) - α (\nabla^{2} \tilde{Φ}_{s, k} (x_{k}) △_{k}, △_{k}) + o_{s, k} (∥ α △_{k} ∥) .

\tilde{Φ}_{s, k} (x_{k} + α △_{k}) = \tilde{Φ}_{s, k} (x_{k}) - α (\nabla^{2} \tilde{Φ}_{s, k} (x_{k}) △_{k}, △_{k}) + o_{s, k} (∥ α △_{k} ∥) .

o_{s, k} (α ∥ △_{k} ∥) \leq \frac{α ∥ △ _{k} ∥}{N _{s} ( α ∥ △ _{k} ∥ )}

o_{s, k} (α ∥ △_{k} ∥) \leq \frac{α ∥ △ _{k} ∥}{N _{s} ( α ∥ △ _{k} ∥ )}

\tilde{Φ}_{s, k} (x_{k} + α △_{k}) \leq \tilde{Φ}_{s, k} (x_{k}) - α L_{s} ∥ △_{k} ∥^{2} + \frac{α ∥ △ _{k} ∥}{N _{s} ( α ∥ △ _{k} ∥ )} =

\tilde{Φ}_{s, k} (x_{k} + α △_{k}) \leq \tilde{Φ}_{s, k} (x_{k}) - α L_{s} ∥ △_{k} ∥^{2} + \frac{α ∥ △ _{k} ∥}{N _{s} ( α ∥ △ _{k} ∥ )} =

= Φ_{s, k} (x_{k}) - α ∥ △_{k} ∥ (L_{s} ∥ △_{k} ∥ - \frac{1}{N _{s} ( α ∥ △ _{k} ∥ )}) .

= Φ_{s, k} (x_{k}) - α ∥ △_{k} ∥ (L_{s} ∥ △_{k} ∥ - \frac{1}{N _{s} ( α ∥ △ _{k} ∥ )}) .

L_{s} ∥ △_{k} ∥ - \frac{1}{N _{s} ( α ∥ △ _{k} ∥ )} \geq \frac{L _{s} ∥ △ _{k} ∥}{2} .

L_{s} ∥ △_{k} ∥ - \frac{1}{N _{s} ( α ∥ △ _{k} ∥ )} \geq \frac{L _{s} ∥ △ _{k} ∥}{2} .

\tilde{Φ}_{s, k} (x_{k} + α △_{k}) \leq \tilde{Φ}_{s, k} (x_{k}) - α \frac{L _{s}}{2} ∥ △_{k} ∥^{2}

\tilde{Φ}_{s, k} (x_{k} + α △_{k}) \leq \tilde{Φ}_{s, k} (x_{k}) - α \frac{L _{s}}{2} ∥ △_{k} ∥^{2}

o_{s, k} (∥ △_{k} ∥) = \tilde{Φ}_{s, k} (x_{k} + △_{k}) - \tilde{Φ}_{s, k} (x_{k}) - (\nabla \tilde{Φ}_{s, k} (x_{k}), △_{k}) .

o_{s, k} (∥ △_{k} ∥) = \tilde{Φ}_{s, k} (x_{k} + △_{k}) - \tilde{Φ}_{s, k} (x_{k}) - (\nabla \tilde{Φ}_{s, k} (x_{k}), △_{k}) .

\tilde{Φ}_{s, k} (x_{k} + △_{k}) - \tilde{Φ}_{s, k} (x_{k}) = (\nabla \tilde{Φ}_{s, k} (x_{k} + ξ △_{k}), △_{k})

\tilde{Φ}_{s, k} (x_{k} + △_{k}) - \tilde{Φ}_{s, k} (x_{k}) = (\nabla \tilde{Φ}_{s, k} (x_{k} + ξ △_{k}), △_{k})

o_{s, k} (∥ △_{k} ∥) = (\nabla \tilde{Φ}_{s, k} (x_{k} + ξ △_{k}), △_{k}) - (\nabla \tilde{Φ}_{s, k} (x_{k}), △_{k}) .

o_{s, k} (∥ △_{k} ∥) = (\nabla \tilde{Φ}_{s, k} (x_{k} + ξ △_{k}), △_{k}) - (\nabla \tilde{Φ}_{s, k} (x_{k}), △_{k}) .

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsOptimization and Variational Analysis

Full text

∎

11institutetext: Igor Mihailovich Prudnikov 22institutetext: Scientific Center of Smolensk Federal Medical University, Smolensk, Russia, 214000

22email: pim [email protected]

The accelerated method for finding the minimum of a

nonsmooth finite convex function

Igor M. Prudnikov

(Received: date / Accepted: date)

Abstract

The goal of the paper is development of an optimization method with a superlinear convergence rate for a nonsmooth convex function. For optimization an approximation is used that is similar to the Steklov integral averaging. The difference is that averaging is performed over a variable-dependent set, that is called a set-valued mapping (SVM) satisfying simple conditions. The novelty of the approach is that with such an approximation we obtain twice continuously differentiable convex functions, for the optimization of which of the second order methods are used. The rate of convergence of the method is estimated.

Keywords:

Lipschitz functions convex functions Generalized Gradients Necessary and Sufficient conditions of Optimality Steklov integral Clark subdifferential Lebesgue integrals generalized matrices of second derivatives Newton optimization methods for Lipschitz functions

MSC:

49J52 90C30 90C31

††journal: JOTA

1 Introduction

Nonsmooth (non-differentiable) or insufficiently smooth functions are widely used in economics, data processing, control theory, artificial intelligence and other fields. An example of such functions are functions obtained by performing operations minimum or maximum.

Nonsmooth functions may not have derivatives at some points. It is known that the Lipschitz function is differentiable almost everywhere (a.e.) in $R^{n}$ rademacher . Generalized gradients are used instead of gradients at the points of non-differentiability of a function. The optimization methods of these functions are different from the optimization methods of smooth (differentiable) functions.

In this paper the author continues research related to the construction of an optimization method of Lipschitz functions using the Steklov integrals and similar integrals, when a set, over which averaging is taken, is a function of a variable.

This approach gives twice differentiable functions, whose stationary points coincide with the stationary points of the original function in contrast to the case when averaging is doing over sets independent of $x$ . For such functions second-order optimization methods can be used that are tested for arbitrary convex functions with an estimate of the convergence rate.

If we have discontinuous gradients as functions of variables, then it is very difficult to construct optimization methods and estimate their convergence rates in the general case. Using the polynomial approximation of an original function and transition to optimization of a smooth function by the known methods pshenichnyidanilin does not allow to solve the optimization problem, since this way leads to appearance of new extremum points located far from the extremum points of the original function.

Separation of fictitious extremum points from real ones is as complex a problem as the initial one. Therefore, the development of the theory of nonsmooth functions went along the path of developing its own methods, based on the properties of generalized gradient of Lipschitz functions. Here it is worth mentioning the articles pshenichnyidanilin - clarke N. Z. Shor, B. N. Pshenichny, V. F. Demyanova, E.A. Nurminsky, F. Clark, R.T. Rokafellar, L.N. Polyakova.

To construct accelerated optimization methods for nonsmooth functions, it is necessary to determine the constructions to which second-order optimization methods are applicable. But to perform the latter, it is necessary to determine such constructions for which the extremum points do not disappear, and new ones do not appear.

The paper proposes exactly this method for smoothing of nonsmooth functions. The resulting function will be continuously differentiable. If we apply again the averaging operation to it, then we will have a twice differentiable function.

If we apply averaging over sets depending on the variable $x$ , then we obtain a continuously differentiable function whose stationary points coincide with the stationary points of the original function. If we repeat the averaging procedure, we get twice differentiable functions to which second-order optimization methods with accelerated convergence can be applied.

It is possible to move with the help of the defined functions from the local optimization of non-smooth functions to local optimization of smooth functions, and also to estimate the rate of convergence to an extremum point, that is definitely important, because it is possible to develop accelerated optimization methods for functions with discontinuous gradients. Similar constructions as far as the author knows, nobody has proposed previously.

2 Smoothing integral functions

Let $f(\cdot):R^{n}\rightarrow R$ be a Lipschitz function with a constant $L$ , $x_{*}$ is its local minimum (maximum) in $R^{n}$ . As it is known, the necessary extremum condition at the point $x_{*}$ for the Lipschitz function $f(\cdot)$ is zero belongs to the Clarke subdifferential $\partial_{CL}f(\cdot),$ calculated at the point $x_{*}$ , i.e.

[TABLE]

Any point, for which this condition is correct, is called a stationary point. Not all stationary points are minimum or maximum points.

Let us take an arbitrary convex compact set $D\subset R^{n}$ , $0\in\mbox{int}\,D$ . We introduce the definition of the $\varepsilon(D)$ stationary point.

Definition 1. A point $x_{\varepsilon}$ is called the $\varepsilon(D)$ stationary point of the function $f(\cdot)$ , if the set $x_{\varepsilon}+D$ includes a stationary point of the function $f(\cdot)$ .

This definition agrees with the definition of the $\varepsilon$ stationary point for the convex functions rocafellar , because for the strongly convex functions the distance from the $\varepsilon$ stationary point to the minimum can be evaluated by difference of values of the function $f(\cdot)$ calculated at these points.

Define the function $\varphi(\cdot):R^{n}\rightarrow R$

[TABLE]

where $\mu(D)$ is the measure of the domain $D$ , $\mu(D)>0$ .

Obviously, $\varphi(\cdot)$ is continuous. Let us show that $\varphi(\cdot)$ is a Lipschitz function with the Lipschitz constant equaled to the Lipschitz constant of the function $f(\cdot)$ . Really,

[TABLE]

The function $f(\cdot)$ is Lipschitz, and therefore it is differentiable a.e. in $R^{n}$ rademacher . Let $N(f)$ denote the set of points of differentiability of the function $f(\cdot)$ in $R^{n}$ . It is known that $N(f)$ is everywhere dense in $R^{n}$ and, in particular, in $D$ , because of $\mu(D)>0$ by assumption.

The following theorem was proved in proudintegapp1 .

Theorem 2.1

For an arbitrary Lipschitz function $f(\cdot):R^{n}\rightarrow R$ the function

[TABLE]

where $D$ is any domain in $R^{n},0\in intD,\mu(D)$ is the measure of the domain $D$ , $\mu(D)>0$ , is a continuously differentiable function with the derivative

[TABLE]

Remark 2.1

We use here the Lebesque integration.

Remark 2.2

The derivatives of the function $f(\cdot)$ are taken at those points where they exist.

It was also proved in proudintegapp1 that if $f(\cdot)$ is Lipschitz, then $\varphi^{\prime}(\cdot)$ is also Lipschitz function.

Consider the function

[TABLE]

Since $\varphi(\cdot)$ is Lipschitz, we will have

[TABLE]

Since $\varphi^{\prime}(\cdot)$ is continuous, $\phi(\cdot)$ is a continuously differentiable function. As soon as $\varphi^{\prime}(\cdot)$ is Lipschitz, we can differentiate (2). As a result, we will have

[TABLE]

i.e. $\phi(\cdot)$ is a twice continuously differentiable function.

It can be shown proudintegapp that the function $\phi^{\prime\prime}(\cdot)$ is Lipschitz with a constant $\tilde{L}$ , depending on the set $D$ . If $D$ is a ball or a cube in $\mathbb{R}^{n}$ , then we can take $\tilde{L}=\frac{2L}{d^{2}}$ , where $d$ is the diameter of the set $D$ , $L$ is the Lipschitz constant of the function $f(\cdot)$ .

Remark 2.3

The integration in (3) is understood, as before, in the sense of Lebesgue.

If $x$ is a point of local maximum or minimum of the function $f(\cdot)$ , then for sufficiently small $r>0$ and $D=S^{n-1}_{r}(0)=\{z\in R^{n}\mid\|z\|\leq r\}$ the point $x$ is also the local minimum or maximum point of the function $\varphi(\cdot)$ . But unlike the function $f(\cdot)$ the function $\varphi(\cdot)$ is continuously differentiable. Similar thing is true for the function $\phi(\cdot)$ , i.e. the point $x$ is a point of local minimum or maximum of the function $\phi(\cdot)$ . But unlike the functions $f(\cdot)$ and $\varphi(\cdot)$ the function $\phi(\cdot)$ is twice continuously differentiable, matrix of the second mixed derivatives of which satisfies to the Lipschitz condition. To optimize $\phi(\cdot)$ we can use the methods of second order.

The functions $\varphi(\cdot)$ and $\phi(\cdot)$ also retain many properties of the function $f(\cdot)$ . An important property for applications of the functions $\varphi(\cdot)$ and $\phi(\cdot)$ is that if $f(\cdot)-$ is convex with respect to all or some variables, then $\varphi(\cdot)$ and $\phi(\cdot)$ are also convex with respect to the same variables proudintegapp .

Let us see which stationary points the function $\varphi(\cdot)$ has. According to the formula (2), the stationary point $x_{*}$ of the function $\varphi(\cdot)$ is such a point, for which

[TABLE]

We will show that the stationary point of the function $f(\cdot)$ belongs to the set $x_{*}+D$ .

The integral in (4) can be represented with any degree of accuracy $\delta>0$ in the form

[TABLE]

where $N=N(\delta)$ , $D_{i}\subset D,i\in 1:N,$ are subregions of the set $D$ , $\mu(D_{i})$ are their measures,

[TABLE]

The sum (5) is the convex hull of the vectors $f^{\prime}(z_{i}+x_{*})$ . Really,

[TABLE]

where $\alpha_{i}=\frac{\mu(D_{i})}{\mu(D)},\alpha_{i}\geq 0,$ and $\sum_{i=1}^{N}\alpha_{i}=1.$

According to the equality (4), the sum (6) can be made arbitrarily small for large $N=N(\delta)$ (for small $\delta$ ). Since the convex hull of any vectors is a closed set and the convex hull of generalized gradients is a collinear vector to some generalized gradient of the function $f(\cdot)$ at a point $x_{*}+\bar{z}\in x_{*}+D$ , $\bar{z}\in D$ , we obtain that the sum (6) is a vector tending to zero generalized gradient as $N\rightarrow\infty$ . In other words, there exists a point $x_{*}+\bar{z}\in x_{*}+D$ , with a zero generalized gradient of the function $f(\cdot)$ .

Therefore, the stationary point $x_{*}+\bar{z}$ of the function $f(\cdot)$ belongs to the set $x_{*}+D$ . Hence, by definition, $x_{*}$ is a $\varepsilon(D)$ stationary point. Thus, the following theorem is proved.

Theorem 2.2

All stationary points of the function $\varphi(\cdot)$ are the $\varepsilon(D)$ stationary points of the function $f(\cdot)$ .

Similar reasoning is true for the function $\phi(\cdot)$ .

Corollary 2.1

All stationary points of the function $\phi(\cdot)$ are the $\varepsilon(D)$ stationary points of the function $\varphi(\cdot)$ or the $\varepsilon(2D)$ stationary points of the function $f(\cdot)$ .

Corollary 2.2

If $x_{*}$ is a local minimum point of the function $f(\cdot)$ , for which there exists a neighborhood $S,x_{*}\in\mbox{int}\,S,$ where

[TABLE]

then there exists a convex compact set $D$ and a point $y\in S$ , where $\varphi^{\prime}(y)=0$ and $x_{*}\in y+D\subset S$ , i.e. the point $y$ is the $\varepsilon(D)$ stationary point of the function $f(\cdot)$ .

The same is true for the local maximum point of the function $f(\cdot)$ .

To find the $\varepsilon(2D)$ stationary points of the function $f(\cdot)$ , we must apply second-order optimization methods for the function $\phi(\cdot)$ . A numerical optimization method will be given with the rate of convergence to a stationary point of the function $f(\cdot)$ faster than any geometric progression.

3 Search algorithm for stationary points of the Lipschitz function $f(\cdot)$

Let us take a sequence of sets $\{D_{s}\},s=1,2,\dots$ with non-empty interior whose diameters $d(D_{s})$ tends to zero with increasing $s$ . Let be $D_{s}=B^{n}_{r_{s}}(0)=\{v\in\mathbb{R}^{n}\mid\|v\|\leq r_{s}\}$ for $r_{s}\rightarrow+0$ as $s\rightarrow\infty$ . We introduce a sequence of the functions

[TABLE]

and

[TABLE]

Let the inequality $\|\Phi^{\prime\prime}_{s}(\cdot)\|\leq{L}_{s}$ be true for the matrix of the second mixed derivatives of the function $\Phi_{s}(\cdot)$ . It is proved in proudintegapp that $L_{s}=\frac{L}{d(D_{s})}.$ We will consider instead of the function $\Phi_{s}(\cdot)$ the function $\tilde{\Phi}_{s}(\cdot):\mathbb{R}^{n}\rightarrow\mathbb{R}$ :

[TABLE]

for any fixed point $x\in\mathbb{R}^{n}$ and $y\in\mathbb{R}^{n}$ .

As a result, the inequality

[TABLE]

is true where $\nabla^{2}\tilde{\Phi}_{s}(\cdot,x)=\tilde{\Phi^{\prime\prime}}_{s}(\cdot,x)$ is the matrix of the second mixed derivatives of the function $\tilde{\Phi}_{s}(\cdot,x)$ with respect to the variable $y$ .

Note that if the function $\Phi_{s}(\cdot)$ is bounded below, then the function $\tilde{\Phi}_{s}(\cdot,x)$ is also bounded below for any points $x$ and $y$ from $\mathbb{R}^{n}$ . Also, it is clear that $\nabla\tilde{\Phi}_{s}(x,x)=\nabla\Phi_{s}(x)$ , where $\nabla\tilde{\Phi}_{s}(x,x)$ is the gradient of the function $\tilde{\Phi}_{s}(\cdot,x)$ at the point $y=x$ .

We assume that the functions $f(\cdot)$ and $\Phi_{s}(\cdot)$ are bounded below and reach their infimum at some points.

Search method for a stationary point

Let the point $x_{k}$ at the $k$ - th step have already been built. Construct the point $x_{k+1}$ . We put by definition $\tilde{\Phi}_{s,k}(\cdot)=\tilde{\Phi}_{s}(\cdot,x_{k})$ .

Calculate $\triangle_{k}=-(\nabla^{2}\tilde{\Phi}_{s,k}(x_{k}))^{-1}\nabla\tilde{\Phi}_{s,k}(x_{k}).$
Find a non-negative integer $l_{k}$ for which

[TABLE]

We assume $x_{k+1}=x_{k}+2^{-l_{k}}\triangle_{k}$ , $k=k+1$ .
With increasing $k$ we decrease $d(D_{s})$ such that the inequality

[TABLE]

holds for some sequence $\{\varepsilon_{k}\}$ , where $\varepsilon_{k}\rightarrow+0$ . Go to the step 1.

Let us show that $\|\triangle_{k}\|\rightarrow+0$ as $k\rightarrow\infty$ and the number $l_{k}$ mentioned in operation $1$ exists. Expand the function $\tilde{\Phi}_{s,k}(\cdot)$ in a neighborhood of the point $x_{k}$ in the Taylor series

[TABLE]

where $o_{s,k}(\|\cdot\|)$ is an uniformly infinitesimal function in $k$ .

As soon as $\triangle_{k}=-(\nabla^{2}\tilde{\Phi}_{s,k}(x_{k}))^{-1}\nabla\tilde{\Phi}_{s,k}(x_{k})$ , then $\nabla\tilde{\Phi}_{s,k}(x_{k})=-\nabla^{2}\tilde{\Phi}_{s,k}(x_{k})\triangle_{k}$ . Consequently, $(\nabla\tilde{\Phi}_{s,k}(x_{k}),\triangle_{k})=-(\nabla^{2}\tilde{\Phi}_{s,k}(x_{k})\triangle_{k},\triangle_{k})$ . Therefore, we can rewrite (10) in the form

[TABLE]

As soon as $o_{s,k}(\|\cdot\|)$ is an uniformly infinitesimal function with respect to $k$ , then the inequality

[TABLE]

is true for large $k$ where $N_{s}(\alpha\|\triangle_{k}\|)\rightarrow\infty$ as $\alpha\|\triangle_{k}\|\rightarrow 0$ .

From (11) we have

[TABLE]

The value $\frac{1}{N_{s}(\alpha\|\triangle_{k}\|)}$ tends to zero as $\alpha\|\triangle_{k}\|\rightarrow 0$ . Therefore, for small $\|\triangle_{k}\|$ and, consequently, for small $\|\nabla\tilde{\Phi}_{s,k}(x_{k})\|$ , we get

[TABLE]

It follows from here that the inequality

[TABLE]

is true for sufficiently small $\|\nabla\tilde{\Phi}_{s,k}(x_{k})\|$ and any $\alpha\in[0,1]$ . Therefore, $\|\nabla\tilde{\Phi}_{s,k}(x_{k})\|$ tends to zero as $k\rightarrow\infty$ , since otherwise, as follows from (13), the function $\tilde{\Phi}_{s,k}(\cdot)$ would decrease in value $\alpha\frac{{L_{s}}}{2}\|\triangle_{k}\|^{2}$ along the direction $\triangle_{k}$ at $k$ -th step. The last thing contradicts to the lower boundedness of the function $\tilde{\Phi}_{s,k}(\cdot)$ for all $k$ and $s$ .

We will show that when the requirements of the step $4$ are fulfilled, the function $o_{s,k}(\cdot)$ is uniformly infinitesimal in $k$ and $s$ . From (10) for $\alpha=1$ we have

[TABLE]

We will use the midpoint theorem. Then

[TABLE]

for $\xi\in[0,1].$ Substitute the received expression in (14). We will have

[TABLE]

We use the midpoint theorem again for the derivatives of the function $\tilde{\Phi}_{s,k}(\cdot):$

[TABLE]

Therefore

[TABLE]

It follows from the Lipschitz quality of the gradient $\nabla\Phi_{s}(\cdot)$ with the constant $L_{s}=\frac{L}{d(D_{s})}$ that the next evaluation

[TABLE]

is true if (7) is satisfied.

It follows from here that the functions $o_{s,k}(\cdot)$ and $\frac{1}{N_{s}(\alpha\|\triangle_{k}\|)}$ are uniformly infinitesimal with respect to $k$ and $s$ . Therefore, for small $\|\triangle_{k}\|$ the inequality (12) will be correct for $\alpha=1$ . Consequently, the inequality (8) is satisfied for $l_{k}=0$ and the process goes with the full step $\triangle_{k}$ .

Theorem 3.1

Any limit point of the sequence $\{x_{k}\}$ , constructed according to the algorithm 1-4, is a stationary point of the function $f(\cdot)$ .

Proof. We have already proved that for small $\|\triangle_{k}\|$ the process goes with the full step $\triangle_{k}.$ Since the functions $\tilde{\Phi}_{s,k}(\cdot)$ are bounded below in aggregate on $k,s$ and the inequality (13) is true for all $k$ and $s$ , then $\|\triangle_{k}\|\rightarrow 0$ and $\nabla\tilde{\Phi}_{s,k}(\cdot)\rightarrow 0$ for $s,k\rightarrow\infty$ . Therefore, the sequence $\{x_{k}\}$ has the limit points.

The following equalities

[TABLE]

are correct where all $o_{s,k}(\cdot)$ in (10) are uniformly infinitesimal in $k,s$ .

It follows from the definition of the function ${\Phi}_{s}(\cdot)$ that the gradient $\nabla{\Phi}_{s}(\cdot)$ is a convex hull of the generalized gradients of the function $f(\cdot)$ .

Taking into account what is said above about $\|\triangle_{k}\|$ and $\nabla\tilde{\Phi}_{s,k}(\cdot)$ , and also from uppersemicontinuity of the Clarke subdifferential mapping demrub1 , clarke we can imply that the inclusion $0\in\partial_{CL}f(x^{*})$ is correct at a limit point $x_{*}$ , i.e. $x_{*}$ is the stationary point of the function $f(\cdot)$ . The theorem is proved. $\Box$

To estimate the rate of convergence, we assume that $f(\cdot)$ is convex and almost everywhere

[TABLE]

From proudintegapp1 it follows that $\Phi_{s}(\cdot)$ is also convex and for some $m_{s},M_{s}>0$

[TABLE]

Define the function

[TABLE]

for each $k$ and $y\in\mathbb{R}^{n}$ where $\tilde{L}_{k}>0$ is positive number depending on $k$ and tending to zero as $k\rightarrow\infty$ . To search for a stationary point of the function $f(\cdot)$ , we use the algorithm described below.

Let

[TABLE]

Since $m_{s,k}\rightarrow m$ as $s(k)\rightarrow\infty$ , we assume that $m_{s.k}\leq m$ for all $s(k)$ .

We first introduce the conditions of coherence, which give to us the rules of coherent striving to infinity of the parameters $k$ and $s$ . We will write them briefly in the form of dependence $s=s(k)$ . Denote by $L_{s(k)}$ the constant bounding from above the norm of the matrix $\nabla^{2}\tilde{\Phi}_{s(k),k}(\cdot):\|\nabla^{2}\tilde{\Phi}_{s(k),k}(\cdot)\|\leq L_{s,k}$ . During the process of optimization we satisfy to conditions of coherence:

$L_{s,k}\|\Delta_{k}\|\rightarrow 0$ as $s(k)\rightarrow\infty$ ; 2. 2.

for convergence with superlinear rate, we require that

[TABLE]

as $s(k),k\rightarrow\infty$ , where $\frac{\|\Delta_{k}\|}{N_{s,k}(\|\Delta_{k}\|)}$ is a upper bound of the function $o_{s(k),k}(\cdot),$ obtained from the expansion of the function $\tilde{\Phi}_{s,k}(\cdot)$ at the $k$ -th step (10). It is clear that $N_{s(k),k}(\|\Delta_{k}\|)\rightarrow\infty$ as $k\rightarrow\infty$ .

The conditions 1 and 2 can be easy satisfied. At first the optimization process goes on with constant $s$ . As soon as the step size $\|\Delta_{k}\|$ becomes quite small, that means large enough $N_{s(k),k}(\|\Delta_{k}\|)$ , we increase $s$ , decrease diameter $d(D_{s})$ and, consequently, increase $L_{s(k)}$ so that to satisfy to the conditions of coherence 1 and 2. As we shall see below, $q_{s(k)}$ is the coefficient of proportionality between $\|\Delta_{k+1}\|$ and $\|\Delta_{k}\|$ . Therefore, we are able to evaluate $q_{s(k)}$ by the coefficient of proportionality between $\|\Delta_{k+1}\|$ and $\|\Delta_{k}\|$ and, therefore, to satisfy to the clause 2 of the consistency conditions.

Superlinear optimization method for finding the minimum point of any final convex function $f(\cdot)$

Let a point $x_{k}$ already been found. Construct the pint $x_{k+1}$ .

Calculate the $k$ -th step.

[TABLE]

Find a non-negative integer $l_{k}$ for which

[TABLE]

We put $x_{k+1}=x_{k}+2^{-l_{k}}\triangle_{k}$ , $k=k+1$ .
Calculate for $k=k+1$

[TABLE]

If

[TABLE]

for an arbitrarily chosen sequence $\{\varepsilon_{k}\},\varepsilon_{k}\rightarrow+0,$ then we increase $s$ such that the inequality

[TABLE]

remained in force.

Go to the step 1 and continue until the step size becomes less than the specified value.

Let us prove that the sequence $\{x_{k}\}$ converges to a minimum point of the function $f(\cdot)$ with superlinear speed.

Theorem 3.2

The sequence $\{x_{k}\}$ , constructed according to the algorithm 1-3, converges to an unique stationary point $x_{*}$ of the function $\Phi(\cdot)$ . For large $k$ the following estimate for the rate of convergence of the method is correct

[TABLE]

where $\nu(\triangle_{k})\rightarrow_{k}0$ as $\|\triangle_{k}\|\rightarrow_{k}0$ .

Proof. As above, we are able to show that for sufficiently large k the process goes with a full step, i.e. $l_{k}=0$ . From the decomposition

[TABLE]

for

[TABLE]

we have

[TABLE]

It is easy to check that

[TABLE]

But it is obvious that $\nabla\Phi_{s}(x_{k+1})=\nabla\tilde{\Phi}_{s,k+1}(x_{k+1})$ . Therefore $\nabla\tilde{\Phi}_{s,k+1}(x_{k+1})=\hat{o}_{s,k}(\|\triangle_{k}\|)$ . Since the function $\tilde{\Phi}_{s,k}(\cdot)$ has the continuous second derivative, satisfying a Lipschitz condition, then $o_{s,k}(\cdot),\tilde{o}_{s,k}(\cdot),\hat{o}_{s,k}(\cdot)$ are the uniformly infinitesimal functions in $k$ . From here

[TABLE]

From the expression

[TABLE]

we have the evaluation

[TABLE]

where $N_{s,k}(\|\triangle_{k}\|)\rightarrow\infty$ , as $\|\triangle_{k}\|\rightarrow 0$ . For large $k$ we achieve that the inequality

[TABLE]

is correct (the condition of coherence). Therefore, the sequence $\{x_{k}\}$ converges to a single point $x_{*}$ and

[TABLE]

As soon as

[TABLE]

then

[TABLE]

Thus, the inequality (15) is proved. $\Box$

Remark 3.1

. The inequality (15) proves the superlinear convergence rate of the optimization method. Indeed, the coefficient between $\|x_{k+1}-x_{*}\|$ and $\|x_{1}-x_{*}\|$ is equal to $q_{k}^{k}$ , where $q_{k}\rightarrow 0,$ as $k\rightarrow\infty$ .

4 Conclusion

The methods for finding for a stationary point of Lipschitz function and a minimum point of arbitrary convex function are proposed in this paper. To achieve a high rate of convergence, it is necessary to make consistent reduction of the diameter $d(D)$ of the set $D$ , which the integral averaging is doing on, with decreasing the length of step of optimization process. Rules for consistent reduction of the lengthes of steps and the diameters of the sets $D_{s}$ are given.

Bibliography8

The reference list from the paper itself. Each links out to its DOI / PubMed record.

1(1) Rademacher H. Uber partielle und totale Differenzierbarkeit I. Math. Ann. 89 (1919), 340-359.
2(2) Pshenichny B.N., Danilin Yu.M. Numerical methods in extream problems. M.:Nauka, 1975. 319 P.
3(3) Pshenichny B.N. Convex analysis and extream problems. M.: Nauka, 1980. 320 P.
4(4) Rocafellar, R. T. Convex analysis, New York: Willey, 1972.
5(5) Demyanov V.F., Rubinov A.M. The basis of nonsmooth analysis The quasidifferential calculation. M.: Nauka, 1990. 432 P.
6(6) Prudnikov I.M. C 2 ( D ) superscript 𝐶 2 𝐷 C^{2}(D) integral approximtion of nonsmooth functions preserving ε ( D ) 𝜀 𝐷 \varepsilon(D) extreme ponts // Papers of Institute of mathematics and mechnics of Ural Branch RAN. 2010. P. 159 - 169.
7(7) Prudnikov I.M. Integral approximation of Lipschitz functions // Vestnik of St. Petersburg University. ser. 10. 2010. Issue 2. P. 70-83
8(8) Klark F. Optimization and nonsmooth analysis. M.: Nauka, 1988. 280 P.

TL;DR

Contribution

Findings

Abstract

Peer Reviews

Videos

Taxonomy

The accelerated method for finding the minimum of a

Abstract

Keywords:

MSC:

1 Introduction

2 Smoothing integral functions

Theorem 2.1

Remark 2.1

Remark 2.2

Remark 2.3

Theorem 2.2

Corollary 2.1

Corollary 2.2

3 Search algorithm for stationary points of the Lipschitz function f(⋅)f(\cdot)f(⋅)

Theorem 3.1

Theorem 3.2

Remark 3.1

4 Conclusion

3 Search algorithm for stationary points of the Lipschitz function $f(\cdot)$