Advances in Low-Memory Subgradient Optimization
Pavel Dvurechensky, Alexander Gasnikov, Evgeni Nurminsky, Fedor, Stonyakin

TL;DR
This paper reviews recent advances in low-memory subgradient algorithms for non-smooth convex optimization, highlighting techniques that improve speed and efficiency for large-scale problems with minimal storage requirements.
Contribution
It introduces modern methods like Nesterov smoothing, Universal Mirror Prox, and adaptive Mirror Descent, emphasizing their theoretical complexity bounds and practical applications.
Findings
Universal Mirror Prox algorithm for variational inequalities.
Primal-dual Mirror Descent method optimal for Lipschitz problems.
Application to sparse Truss Topology Design problem.
Abstract
This chapter is devoted to the black-box subgradient algorithms with the minimal requirements for the storage of auxiliary results, which are necessary to execute these algorithms. It starts with the original result of N.Z. Shor which open this field with the application to the classical transportation problem. To discuss the fundamentals of non-smooth optimization the theoretical complexity bounds for smooth and non-smooth convex and quasi-convex optimization problems are briefly exposed with the special attention given to adaptive step-size policy. Than this chapter contains descriptions of different modern techniques that allow to solve non-smooth convex optimization problems faster then lower complexity bounds: Netserov smoothing technique, Netserov Universal approach, Legendre (saddle point) representation approach. We also describe recent Universal Mirror Prox algorithm for…
| convex | ||
| strongly convex in -norm |
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsOptimization and Variational Analysis
∎
11institutetext: Pavel E. Dvurechensky 22institutetext: Weierstrass Institute for Applied Analysis and Stochastic, Mohrenstr. 39, Berlin, 10117, Germany and Institute for Information Transmission Problems RAS, Bolshoy Karetny per. 19, build.1, Moscow, 127051, Russia 22email: [email protected] 33institutetext: Alexander V. Gasnikov 44institutetext: Moscow Institute of Physics and Technology, 9 Institutskiy per., Dolgoprudny, Moscow Region, 141701, Russia 44email: [email protected] 55institutetext: Evgeni A. Nurminski 66institutetext: Far Eastern Federal University, Russky ostrov, Vladivostok, 690000, Russia 66email: [email protected] 77institutetext: Fedor S. Stonyakin 88institutetext: V.I. Vernadsky Crimean Federal University, 4 V. Vernadsky Ave, Simferopol, 295007 and Moscow Institute of Physics and Technology, 9 Institutskiy per., Dolgoprudny, Moscow Region, 141701 88email: [email protected]
Advances in Low-Memory Subgradient Optimization
Pavel E. Dvurechensky
Alexander V. Gasnikov
Evgeni A. Nurminski and Fedor S. Stonyakin
Abstract
One of the main goals in the development of non-smooth optimization is to cope with high dimensional problems by decomposition, duality or Lagrangian relaxation which greatly reduces the number of variables at the cost of worsening differentiability of objective or constraints. Small or medium dimensionality of resulting non-smooth problems allows to use bundle-type algorithms to achieve higher rates of convergence and obtain higher accuracy, which of course came at the cost of additional memory requirements, typically of the order of , where is the number of variables of non-smooth problem. However with the rapid development of more and more sophisticated models in industry, economy, finance, et all such memory requirements are becoming too hard to satisfy. It raised the interest in subgradient-based low-memory algorithms and later developments in this area significantly improved over their early variants still preserving memory requirements. To review these developments this chapter is devoted to the black-box subgradient algorithms with the minimal requirements for the storage of auxiliary results, which are necessary to execute these algorithms. To provide historical perspective this survey starts with the original result of N.Z. Shor which opened this field with the application to the classical transportation problem. The theoretical complexity bounds for smooth and non-smooth convex and quasi-convex optimization problems are briefly exposed in what follows to introduce to the relevant fundamentals of non-smooth optimization. Special attention in this section is given to the adaptive step-size policy which aims to attain lowest complexity bounds. Unfortunately the non-differentiability of objective function in convex optimization essentially slows down the theoretical low bounds for the rate of convergence in subgradient optimization compared to the smooth case but there are different modern techniques that allow to solve non-smooth convex optimization problems faster then dictate lower complexity bounds. In this work the particular attention is given to Nesterov smoothing technique, Nesterov Universal approach, and Legendre (saddle point) representation approach. The new results on Universal Mirror Prox algorithms represent the original parts of the survey. To demonstrate application of non-smooth convex optimization algorithms for solution of huge-scale extremal problems we consider convex optimization problems with non-smooth functional constraints and propose two adaptive Mirror Descent methods. The first method is of primal-dual variety and proved to be optimal in terms of lower oracle bounds for the class of Lipschitz-continuous convex objective and constraints. The advantages of application of this method to sparse Truss Topology Design problem are discussed in certain details. The second method can be applied for solution of convex and quasi-convex optimization problems and is optimal in a sense of complexity bounds. The conclusion part of the survey contains the important references that characterize recent developments of non-smooth convex optimization.
Introduction
We consider a finite-dimensional non-differentiable convex optimization problem (COP)
[TABLE]
where denotes a finite-dimensional space of primal variables and is a finite convex function, not necessarily differentiable. For a given point the subgradient oracul returns value of objective function at that point and subgradient . We do not make any assumption about the choice of from . As we are interested in computational issues related to solving (1) mainly we assume that this problem is solvable and has nonempty and bounded set of solutions .
This problem enjoys a considerable popularity due to its important theoretical properties and numerous applications in large-scale structured optimization, discrete optimization, exact penalization in constrained optimization, and others. Non-smooth optimization theory made it possible to solve in an efficient way classical discrete min-max problems ddm2002 , -approximation and others, at the same time opening new approaches in bi-level, monotropic programming, two-stage stochastic optimization, to name a few.
As a major steps in:the development of different algorithmic ideas we can start with the subgradient algorithm due to Shor (see Shor2012 for the overview and references to earliest works).
1 Example Application: Transportation Problem and The First Subgradient Algorithm
From utilitarian point of view the development of non-smooth (convex) optimization started with the classical transportation problem
[TABLE]
which is widely used in many applications.
By dualizing this problem with respect to balancing constrains we can convert (2) into dual problem of the kind
[TABLE]
where are dual variables associated with the balancing constraints in (2) and is the dual function defined as
[TABLE]
and is the Lagrange function of the problem:
[TABLE]
By rearranging terms in this expression we can obtain the following expression for the dual function
[TABLE]
where
[TABLE]
is the indicator function of the set which is the feasible set of the dual problem.
Of course, by explicitely writing feasibility constraints for (3) we obtain the linear dual transportation problem with a fewer variables but with much higher number of constraints. This is bad news for textbook simplex method so many specialized algorithms were developed, one of them was simple-minded method of generalized gradient which started the development of non-smooth optimization.
This method relies on subgradient of concave function which we can transform into convex just by changing signs and replacing with
[TABLE]
and ask for its minimization.
According to convex analysis TR1970 the subdifferential exists for any , and in this case just equals to the (constant) vector of a linear objective in the interior of . The situation becomes more complicated when happens to be at the boundary of , the subdifferential set ceases to be a singleton and becomes even unbounded, roughly speaking certain linear manifolds are added to but we will not go into details here. The difficulty is that if we mimic gradient method of the kind
[TABLE]
with a certain step-size , we inevitably violate the dual feasibility constraints as Than the dual function (1) becomes undefined and correspondently the subdifferential set becomes undefined as well.
There are at least two simple ways to overcome this difficulty. One is to incorporate in the gradient method certain operations which restore feasibility and the appropriate candidate for it is the orthogonal projection operation where one can make use of the special structure of constraints and sparsity. However it will still require computing projection operator of the kind for basis matrices with rather uncertain number of iteration and of matrices of the size around . Neither computers speed nor memory sizes at that time where not up to demands to solve problems of which was required by GOSPLAN!
The second ingenious way was to take into account that if , which is required anyway for solvability of transportation problem in a closed form. The flow variables may be uniformally bounded by and the dual function can be redefined as
[TABLE]
where the penalty function is easily computed by component-wise maximization:
[TABLE]
where . Than the dual objective function becomes finite, the optimization problem — unconstrained and we can use simple subgradient method with very low requirements for memory and computations.
Actually even tighter bounds can be imposed on the flow variables which may be advantageous for computational reasons.
In both cases there is a fundamental problem of recovering optimal primal primal solution from dual. This problem was studied by many authors and recent advances in this area can be studied from the excellent paper by A. Nedic and A. Ozdoglar nedoz09 . Theoretically speaking, nonzero positive values of , where are the exact optimal solutions of the dual problem (3) signal that the corresponding optimal primal flow is equal to zero. Hopefully after excluding these variables we obtain nondegenerate basis and can compute the remaining variables by simple and efficient linear algebra, especially taking into account the uni-modularity of basis.
However the theoretical gap between zeros and non-zeros is exponentially small even for modest length integer data therefore we need an accuracy unattainable by modern 64-128 bits hardware. Also the real life computations are always accompanied by numerical noise and we face the hard choice in fact guessing which dual constraints are active and which are not.
To connect the transportation problem with non-smooth optimization notice that the penalty function is finite with the subdifferential which can be represented as a set of matrices
[TABLE]
so the subdifferential set is a convex hull of up to extreme points — enormous number even for a modest size transportation problem. Nevertheless it is easy to get at least single member of subdifferential and consider the simplest version of subgradient method:
[TABLE]
where is a given starting point, — fixed step-size and is a normalized subgradient . Of course we assume that otherwise is already a solution.
Of course, there is no hope of classical convergence result such that , but the remarkable theorem of Shor Shor79 establishes that this simplest algorithm determines at least the approximate solution. As a major step in the development of different algorithmic ideas we can start with the subgradient algorithm due to Shor (see Shor2012 for the overview and references to earliest works). Of course, there is no hope of classical convergence result such that , but the remarkable theorem of Shor Shor79 establishes that this very simple algorithm provides an approximate solution of (1) at least theoretically.
Theorem 1.1
Let is a finite convex function with a subdifferential and the sequence is obtained by the recursive rule
[TABLE]
with and is a normalized subgradient at the point . Then for any there is an infinite set such that for any
[TABLE]
The statement of the theorem is illustrated on Fig. 1 together with the idea of the proof.
The detailed proof of the theorem goes like following: Let and estimate
[TABLE]
The last term in fact equals
[TABLE]
where is a hyperplane, orthogonal to and passing through the point , so
[TABLE]
If for any then
[TABLE]
therefore
[TABLE]
when . This contradiction proves that there is such that or .
To complete the proof notice that by convexity and therefore
[TABLE]
By setting we obtain .
By replacing in (11) by and repeating the reasoning above we obtain such that , then in the same manner and so on with which complete the proof.
2 Complexity Results for Convex Optimization
At this section we describe the complexity results for non-smooth convex optimization problems. Most of the results mentioned below can be found in books nemirovsky1983problem ; polyak1987introduction ; nesterov2018lectures ; bubeck2015 ; ben-tal2015lectures . We start with the ‘small dimensional problems’, when
[TABLE]
where is a number of oracle calls (number of subgradient calculations or/and calculations of separation hyperplane to some simple set at a given point).
Let’s consider convex optimization problem
[TABLE]
where – is a compact and simple set (it’s significant here!). Based on at least subgradient calculations (in general, oracle calls) we would like to find such a point that
[TABLE]
where is an optimal value of function in (13), – the solution of (13). The lower and the upper bounds for the oracle complexity is (up to a multiplier, which has logarithmic dependence on some characteristic of the set )
[TABLE]
where \Delta f=\mathop{\sup}\limits_{{\color[rgb]{0,0,0}\definecolor[named]{pgfstrokecolor}{rgb}{0,0,0}\pgfsys@color@gray@stroke{0}\pgfsys@color@gray@fill{0}{\vec{x}},{\vec{y}}}\in Q}\left\{{f\left({\vec{y}}\right)-f\left({\vec{x}}\right)}\right\}. The center of gravity method Levin1965 ; Newman1965 converges according to this estimate. The center of gravity method in is a simple binary search method brent1973algorithms . But in this method is hard to implement. The complexity of iteration is too high, because we required center of gravity oracle bubeck2015 . Well known ellipsoid method Shor85 ; nemirovsky1983problem requires111Here and below for all (large) : with some constants and . Typically, . If , then . oracle calls and iteration complexity. In Vaidya1989 ; bubeck2015 a special version of cutting plane method was proposed. This method (Vayda’s method) requires oracle calls and has iteration complexity . In the work lee2015faster there proposed a method with oracle calls and iteration complexity . Unfortunately, for the moment it’s not obvious that this method is very practical one due to the large log-factors in .
Based on ellipsoid method in the late 70-th Leonid Khachyan showed khachiyan1979polynomial that LP is in P in byte complexity. Let us shortly explain the idea. The main question is whether is solvable or not, where , and all elements of and are integers. We would like also to find one of the exact solutions . This problem up to a logarithmic factor in complexity is equivalent to the problem to find the exact solution of LP problem with integer , and . We consider only inequality constraints as it is known that to find the exact solution of one can use polynomial Gauss elimination algorithm with arithmetic operations (a.o.) complexity.
Let us introduce
[TABLE]
If is compatible, then there exists such that , otherwise
[TABLE]
Thus, the question of compatibility of is equivalent to the problem of finding minimum of the following non-smooth convex optimization problem
[TABLE]
The approach of khachiyan1979polynomial is to apply ellipsoid method for this problem with . From the complexity of this method, it follows that in -bit arithmetic with cost of PC memory one can find (if it exists) in a.o.
Note, that in the ideal arithmetic with real numbers it is still an open question blum2012complexity whether it is possible to find the exact solution of LP an problem (with the data given by real numbers) in polynomial time in the ideal arithmetic ( – costs .
Now let us consider ‘large dimensional problems’
[TABLE]
Table 1 describes (for more details see ben-tal2015lectures ; bubeck2015 ; nesterov2018lectures ) optimal estimates for the number of oracle calls for convex optimization problem (13) in the case when Now is not necessarily compact set.
Here is a “distance” (up to a -factor) between starting point and the nearest solution
[TABLE]
Let’s describe optimal method in the most simple case: , polyak1987introduction ; nesterov2009primal-dual . Define
[TABLE]
The main iterative process is (for simplicity we’ll denote arbitrary element of as
[TABLE]
Assume that under
[TABLE]
where .
Hence, from (14), (15) we have
[TABLE]
[TABLE]
[TABLE]
Here we choose (if {\color[rgb]{0,0,0}\definecolor[named]{pgfstrokecolor}{rgb}{0,0,0}\pgfsys@color@gray@stroke{0}\pgfsys@color@gray@fill{0}{\vec{x}}}_{\ast} isn’t unique, we choose the nearest to
[TABLE]
[TABLE]
[TABLE]
If
[TABLE]
then
[TABLE]
Note that the precise lower bound for fixed steps first-order methods for the class of convex optimization problems with (15) Drori-Teboulle2016
[TABLE]
Inequality (18) means that (see also Table 1)
[TABLE]
So, one can mentioned that if we will use in (14)
[TABLE]
the result (18) holds with nesterov2009primal-dual
[TABLE]
If we put in (19),
[TABLE]
like in (17), the result similar to (18) also holds
[TABLE]
not only for the convex functions, but also for quasi-convex functions Polyak1969 ; nesterov1989 :
[TABLE]
Note that
[TABLE]
Hence, for all ,
[TABLE]
therefore
[TABLE]
Inequality (20) justifies that we need assumption (15) holds only with .
For the general (constrained) case (13) we introduce a norm and some prox-function , which is continuous and 1-strongly convex with respect to , i.e. , for all . We also introduce Bregman’s divergence ben-tal2015lectures
[TABLE]
We set , where – is solution of (13) (if isn’t unique then we assume that is minimized . The natural generalization of iteration process (14) is Mirror Descent algorithm nemirovskii1979efficient ; ben-tal2015lectures which iterates as
[TABLE]
For this iteration process instead of (16) we have
[TABLE]
where for all x:{\color[rgb]{0,0,0}\definecolor[named]{pgfstrokecolor}{rgb}{0,0,0}\pgfsys@color@gray@stroke{0}\pgfsys@color@gray@fill{0}V[{\vec{x}}]({\vec{x}}_{*})\leq 2V[{\vec{x}}^{0}]({\vec{x}}_{*})}{\color[rgb]{0,0,0}\definecolor[named]{pgfstrokecolor}{rgb}{0,0,0}\pgfsys@color@gray@stroke{0}\pgfsys@color@gray@fill{0}=2R^{2}} , see also Section 4.
Analogues of formulas (17), (18), (20) are also valid
[TABLE]
where
[TABLE]
and
[TABLE]
In ben-tal2015lectures authors discus how to choose for different simple convex sets . One of these examples (unit simplex) will considered below. Note, that in all these examples one can guarantees that ben-tal2015lectures :
[TABLE]
Note, that if , then , ,
[TABLE]
[TABLE]
that corresponds to the standard gradient-type iteration process (14).
Example (unit simplex). We have
[TABLE]
[TABLE]
For ,
[TABLE]
The main result here is
[TABLE]
Note, that if we use -norm and here, we will have higher iteration complexity (2-norm projections on unit simplex) and
[TABLE]
Since typically , it is worth to use -norm.
Assume now that in (13) is additionally -strongly convex in norm:
[TABLE]
Let
[TABLE]
where
[TABLE]
Then Lacost-Julien2012
[TABLE]
Hence (see also Table 1),
[TABLE]
This bound is also un-improvable up to a constant factor nemirovsky1983problem ; nesterov2018lectures .
3 Looking into the Black-Box
In this section we consider how problem special structure can be used to solve non-smooth optimization problems with the convergence rate , which is faster than the lover bound for general class of non-smooth convex problems nemirovsky1983problem . Nevertheless, there is no contradiction as additional structure is used and we are looking inside the black-box.
3.1 Nesterov’s smoothing
In this subsection, following nesterov2005smooth , we consider the problem
[TABLE]
where is a linear operator, is a continuous convex function on , are convex compacts, is convex function with -Lipschitz-continuous gradient.
Let us consider an example of with . Then,
[TABLE]
, , and is the ball in 1-norm.
The main idea of Nesterov is to add regularization inside the definition of in (21). More precisely, a prox-function (see definition in Section 2) is introduced for the set and a smoothed counterpart for is defined as
[TABLE]
and is the optimal solution of this maximization problem.
Theorem 3.1 (nesterov2005smooth )
The function is well defined, convex and continuously differentiable at any with . Moreover, is Lipschitz continuous with constant .
Here the adjoint operator is defined by equality , and the norm of the operator is defined by .
Since is bounded, is a uniform approximation for the function , namely, for all ,
[TABLE]
where .
Then, the idea is to choose sufficiently small and apply accelerated gradient method to minimize on . We use accelerated gradient method from dvurechensky2017adaptive ; dvurechensky2018computational which is different from the original method of nesterov2005smooth .
Theorem 3.2 (dvurechensky2017adaptive ; dvurechensky2018computational )
Let the sequences , be generated by Algorithm 1. Then, for all , it holds that
[TABLE]
Following the same steps as in the proof of Theorem 3 in nesterov2005smooth , we obtain
Theorem 3.3
Let Algorithm 1 be applied to minimize on with , where . Then, after iterations, we have
[TABLE]
Proof
Applying Theorem 3.2 to , and using (22), we obtain
[TABLE]
Substituting the value of from the theorem statement, we finish the proof.
A generalization of the smoothing technique for the case of non-compact sets , which is especially interesting when dealing with problems dual to problems with linear constraints, can be found in tran-dinh2015smooth . Ubiquitous entropic regularization of optimal transport cuturi2013sinkhorn can be seen as a particular case of the application of smoothing technique, especially in the context of Wasserstein barycenters cuturi2014fast ; uribe2018distributed ; dvurechensky2018decentralize .
3.2 Nemirovski’s Mirror Prox
In his paper nemirovski2004prox , Nemirovski considers problem (21) in the following form
[TABLE]
pointing to the fact that this problem is as general as (21). Indeed, the change of variables and the feasible set allows to make linear. His idea is to consider problem (29) directly as a convex-concave saddle point problem and associated weak variational inequality (VI).
[TABLE]
where the operator
[TABLE]
is monotone, i.e. , and Lipschitz-continuous, i.e. . With the appropriate choice of norm on and prox-function for , see Section 5 in nemirovski2004prox , the Lipschitz constant for can be estimated as .
Theorem 3.4 (nemirovski2004prox )
Assume that is monotone and -Lipschitz-continuous. Then, for any and any ,
[TABLE]
Moreover, if the VI is associated with a convex-concave saddle point problem, i.e.
- •
,
- •
* with convex compact sets , *
- •
\Phi({\vec{z}})=\Phi({\vec{x}},{\vec{u}})=\left(\begin{array}[]{c}\nabla_{{\vec{x}}}f({\vec{x}},{\vec{u}})\\ -\nabla_{{\vec{u}}}f({\vec{x}},{\vec{u}})\end{array}\right)* for a continuously differentiable function which is convex in and concave in ,*
then
[TABLE]
Choosing appropriately the norm in the space and applying Mirror Prox algorithm to solve problem (29) as a saddle point problem, we obtain that the saddle point error in the l.h.s. of (35) decays as . This is slightly worse than the rate in (27) since the accelerated gradient method allows the faster decay for the smooth part . An accelerated Mirror Prox method with the same rate as in (27) can be found in chen2017accelerated .
4 Non-Smooth Optimization in Large Dimensions
The optimization of non-smooth functionals with constraints attracts widespread interest in large-scale optimization and its applications ben-tal1997robust ; nesterov2014primal-dual . Subgradient methods for non-smooth optimization have a long history starting with the method for deterministic unconstrained problems and Euclidean setting in shor1967generalized and the generalization for constrained problems in polyak1967general , where the idea of steps switching between the direction of subgradient of the objective and the direction of subgradient of the constraint was suggested. Non-Euclidean extension, usually referred to as Mirror Descent, originated in nemirovskii1979efficient ; nemirovsky1983problem and was later analyzed in beck2003mirror . An extension for constrained problems was proposed in nemirovsky1983problem , see also recent version in beck2010comirror . To prove faster convergence rate of Mirror Descent for strongly convex objective in an unconstrained case, the restart technique nemirovskii1985optimal ; nemirovsky1983problem ; nesterov1983method was used in juditsky2012first-order . Usually, the step-size and stopping rule for Mirror Descent requires to know the Lipschitz constant of the objective function and constraint, if any. Adaptive step-sizes, which do not require this information, are considered in nemirovskii1979efficient for problems without inequality constraints, and in beck2010comirror for constrained problems.
Formally speaking, we consider the following convex constrained minimization problem
[TABLE]
where is a convex closed subset of a finite-dimensional real vector space , , are convex functions.
We assume to be a non-smooth Lipschitz-continuous function and the problem (3) to be regular. The last means that there exists a point in relative interior of the set , such that .
Note that, despite problem (36) contains only one inequality constraint, considered algorithms allow to solve more general problems with a number of constraints given as . The reason is that these constraints can be aggregated and represented as an equivalent constraint given by , where .
We consider two adaptive Mirror Descent methods bayandina2018mirror for the problem (36). Both considered methods have complexity and optimal.
We consider algorithms, which are based on Mirror Descent method. Thus, we start with the description of proximal setup and basic properties of Mirror Descent step. Let be a finite-dimensional real vector space and be its dual. We denote the value of a linear function at by . Let be some norm on , be its dual, defined by \|{\vec{g}}\|_{E,*}=\max\limits_{{\vec{x}}}\big{\{}\langle{\vec{g}},{\vec{x}}\rangle,\|{\vec{x}}\|_{E}\leq 1\big{\}}. We use to denote any subgradient of a function at a point .
Given a vector , and a vector , the Mirror Descent step is defined as
[TABLE]
We make the simplicity assumption, which means that is easily computable.
The following lemma ben-tal2015lectures describes the main property of the Mirror Descent step.
Lemma 1
Let be some convex function over a set , be a step-size, . Let the point be defined by . Then, for any ,
[TABLE]
The following analog of Lemma 1 for -subgradient holds.
Lemma 2
Let be some convex function over a set , be a step-size, . Let the point be defined by . Then, for any ,
[TABLE]
We consider problem (36) in two different settings, namely, non-smooth Lipschitz-continuous objective function and general objective function , which is not necessarily Lipschitz-continuous, e.g. a quadratic function. In both cases, we assume that is non-smooth and is Lipschitz-continuous
[TABLE]
Let be a solution to (36). We say that a point is an -solution to (36) if
[TABLE]
All considered in this section methods (Algorithms 3 and 4) are applicable in the case of using -subgradient instead of usual subgradient. For this case we can get an -solution :
[TABLE]
The methods we describe are based on the of Polyak’s switching subgradient method polyak1967general for constrained convex problems, also analyzed in nesterov2010introduction , and Mirror Descent method originated in nemirovsky1983problem ; see also nemirovskii1979efficient .
4.1 Convex Non-Smooth Objective Function
In this subsection, we assume that is a non-smooth Lipschitz-continuous function
[TABLE]
Let be a solution to (36) and assume that we know a constant such that
[TABLE]
For example, if is a compact set, one can choose .
Theorem 4.1
Assume that inequalities (40) and (43) hold and a known constant is such that . Then, Algorithm 3 stops after not more than
[TABLE]
iterations and is an -solution to (36) in the sense of (41).
Let us now show that Algorithm 3 allows to reconstruct an approximate solution to the problem, which is dual to (36). We consider a special type of problem (36) with given by
[TABLE]
Then, the dual problem to (36) is
[TABLE]
where are Lagrange multipliers.
We slightly modify the assumption (44) and assume that the set is bounded and that we know a constant such that
[TABLE]
As before, denote , . Let . Then a subgradient of is used to make the -th step of Algorithm 3. To find this subgradient, it is natural to find an active constraint such that and use to make a step. Denote the number of active constraint, whose subgradient is used to make a non-productive step at iteration . In other words, and . We define an approximate dual solution on a step as
[TABLE]
and modify Algorithm 3 to return a pair .
Theorem 4.2
Assume that the set is bounded, the inequalities (40) and (43) hold and a known constant is such that . Then, modified Algorithm 3 stops after not more than
[TABLE]
iterations and the pair returned by this algorithm satisfies
[TABLE]
Now we consider an interesting example of huge-scale problem nesterov2014subgradient ; nesterov2014primal-dual with a sparse structure. We would like to illustrate two important ideas. Firstly, consideration of the dual problem can simplify the solution, if it is possible to reconstruct the solution of the primal problem by solving the dual problem. Secondly, for a special sparse non-smooth piece-wise linear functions we suggest a very efficient implementation of one subgradient iteration nesterov2014subgradient . In such cases simple subgradient methods (for example, Algorithm 3) can be useful due to the relatively inexpensive cost of iterations.
Recall (see e.g. nesterov2014primal-dual ) that Truss Topology Design problem consists in finding the best mechanical structure resisting to an external force with an upper bound for the total weight of construction. Its mathematical formulation looks as follows:
[TABLE]
where is a vector of external forces, is a vector of virtual displacements of nodes in , is a vector of bars, and is the total weight of construction. The compliance matrix has the following form:
[TABLE]
where are the vectors describing the interactions of two nodes connected by an arc. These vectors are very sparse: for 2D-model they have at most 4 nonzero elements.
Let us rewrite the problem (50) as a Linear Programming problem.
[TABLE]
Note that for the inequality in the third line we do not need any assumption.
Denote by the optimal solution of the optimization problem in the brackets. Then there exist multipliers such that
[TABLE]
where , and . Multiplying the first equation in (52) by , we get
[TABLE]
Note that the first equation in (52) can be written as
[TABLE]
Let us reconstruct now the solution of the primal problem. Denote
[TABLE]
Then, in view of (54) we have , and . Thus, the pair (55) is feasible for the primal problem. On the other hand,
[TABLE]
Thus, the duality gap in the chain (51) is zero, and the pair , defined by (55) is the optimal solution of the primal problem.
The above discussion allows us to concentrate on the following (dual) Linear Programming problem:
[TABLE]
which we can solve by the primal-dual Algorithm 3.
Assume that we have local truss: each node is connected only with few neighbors. It allows to apply the property of sparsity for vectors (). In this case the computational cost of each iteration grows as nesterov2014subgradient ; nesterov2014primal-dual .
In nesterov2014subgradient a special class of huge-scale problems with sparse subgradient was considered. According to nesterov2014subgradient for smooth functions this is a very rare feature. For example, for quadratic function the gradient usually is dense even for a sparse matrix .
However, the subgradient of non-smooth function (see (56) above) are sparse provided that all vectors share this property. This fact is based on the following observation. For the function with sparse matrix the vector is a subgradient at point . Then the standard subgradient step
[TABLE]
changes only a few entries of vector and the vector differs from also in a few positions only. Thus, the function value can be easily updated provided that we have an efficient procedure for recomputing the maximum of values.
Note the objective functional in (56) is linear and the costs of iteration of Algorithm 3 and considered in nesterov2014subgradient switching simple subgradient scheme is comparable. At the same time, the step productivity condition is simpler for Algorithm 3 as considered in nesterov2014subgradient switching subgradient scheme. Therefore main observations for nesterov2014subgradient are correct for Algorithm 3.
4.2 General Convex and Quasi-Convex Objective Functions
In this subsection, we assume that the objective function in (36) might not satisfy (43) and, hence, its subgradient could be unbounded. One of the examples is a quadratic function. We also assume that inequality (44) holds.
We further consider ideas in nesterov2010introduction ; nesterov2015subgradient and adapt them for problem (36), in a way that our algorithm allows to use non-Euclidean proximal setup, as does Mirror Descent, and does not require to know the constant . Following nesterov2010introduction , given a function for each subgradient at a point , we define
[TABLE]
The following result gives complexity estimate for Algorithm 4 in terms of . Below we use this theorem to establish complexity result for smooth objective .
Theorem 4.3
Assume that inequality (40) holds and a known constant is such that . Then, Algorithm 4 stops after not more than
[TABLE]
iterations and it holds that and .
To obtain the complexity of our algorithm in terms of the values of the objective function , we define non-decreasing function
[TABLE]
and use the following lemma from nesterov2010introduction .
Lemma 3
Assume that is a convex function. Then, for any ,
[TABLE]
Corollary 1
Assume that the objective function in (36) is given as , where , are differentiable with Lipschitz-continuous gradient
[TABLE]
Then is -solution to (36) in the sense of (41), where
[TABLE]
Remark 1
According to nesterov1989 ; nesterov2018lectures main lemma 3 holds for quasi-convex objective functions Polyak1969 too:
[TABLE]
This means that results of this subsection are valid for quasi-convex objectives.
Remark 2
In view of the Lipschitzness and, generally speaking, non-smoothness of functional limitations, the obtained estimate for the number of iterations means that the proposed method is optimal from the point of view of oracle evaluations: iterations are sufficient for achieving the required accuracy of solving the problem for the class of target functionals considered in this section of the article. Note also that the considered algorithm 3 applies to the considered classes of problems with constraints with convex objective functionals of different smoothness levels. However, the non-fulfillment, generally speaking, of the Lipschitz condition for the objective functional does not allow one to substantiate the optimality of the algorithms 3 in the general situation (for example, with a Lipschitz-continuous gradient). More precisely, situations are possible when the productive steps of the norm (sub)gradients of the objective functional are large enough and this will interfere with the speedy achievement of the stopping criterion of the 3.
5 Universal Methods
In this section we consider problem
[TABLE]
where is a convex set and is a convex function with Hölder-continuous subgradient
[TABLE]
with . The case corresponds to non-smooth optimization and the case corresponds to smooth optimization. The goal of this section is to present the Universal Accelerated Gradient method first proposed by Nesterov nesterov2015universal . This method is a black-box method which does not require the knowledge of constants and works in accordance with the lower complexity bound obtained in nemirovsky1983problem .
The main idea is based on the observation that a non-smooth convex function can be upper bounded by a quadratic objective function slightly shifted above. More precisely, for any ,
[TABLE]
where
[TABLE]
The next idea is to apply an accelerated gradient method with backtracking procedure to adapt for the unknown with appropriately chosen . The method we present is based on accelerated gradient method from dvurechensky2017adaptive ; dvurechensky2018computational and, thus is different from the original method of nesterov2015universal .
Inequality (64) guarantees that the backtracking procedure in the inner cycle is finite.
Theorem 5.1 (nesterov2015universal )
Let satisfy (63). Then,
[TABLE]
Moreover, the number of oracle calls is bounded by
[TABLE]
Translating this rate of convergence to the language of complexity, we obtain that to obtain a solution with an accuracy the number of iterations is no more than
[TABLE]
i.e. is optimal.
In his paper, Nesterov considers a more general composite optimization problem
[TABLE]
where is a simple convex function, and obtains the same complexity guarantees. Universal methods were extended for the case of strongly convex problems by a restart technique in roulet2017sharpness , for non-convex optimization in ghadimi2015generalized and for the case of non-convex optimization with inexact oracle in dvurechensky2017gradient . As we can see from (64), universal accelerated gradient method is connected to smooth problems with inexact oracle. The study of accelerated gradient methods with inexact oracle was first proposed in aspremont2008smooth and was very well developed in devolder2014first ; dvurechensky2016stochastic ; bogolubsky2016learning ; dvurechensky2017gradient including stochastic optimization problems and strongly convex problems. A universal method with inexact oracle can be found in dvurechensky2017universal . Experiments show nesterov2015universal that universal method accelerates to rate for non-smooth problems with a special ”smoothing friendly” (see Section 3) structure. This is especially interesting for traffic flow modeling problems, which possess such structure baimurzina2017universal .
Now we consider universal analog of A.S. Nemirovski’s proximal mirror method for variational inequalities with a Holder-continuous operator. More precisely, we consider universal extension of Algorithm 2 which allows to solve smooth and non-smooth variational inequalities without the prior knowledge of the smoothness. Main idea of the this method is the adaptive choice of constants and level of smoothness in minimized prox-mappings at each iteration. These constants are related to the Hölder constant of the operator and this method allows to find a suitable constant at each iteration.
Theorem 5.2 (dvurechensky2018generalized )
For any and any ,
[TABLE]
Note that if , we can construct the following adaptive stopping criterion for our algorithm
[TABLE]
Next, we consider the case of Hölder-continuous operator and show that Algorithm 6 is universal. Assume for some and
[TABLE]
holds. The following inequality is a generalization of (64) for VI. For any and ,
[TABLE]
[TABLE]
where
[TABLE]
So, we have
[TABLE]
Let us consider estimates of the necessary number of iterations are obtained to achieve a given quality of the variational inequality solution.
Corollary 2 (Universal Method for VI)
Assume that the operator is Hölder continuous with constant for some and . Also assume that the set is bounded. Then, for all , we have
[TABLE]
As it follows from (77), if , (74) holds. Thus, for all , we have and
[TABLE]
(78) holds. Here is defined in (76). ∎
Let us add some remarks.
Remark 3
Since the algorithm does not use the values of parameters and , we obtain the following iteration complexity bound
[TABLE]
to achieve
[TABLE]
Using the same reasoning as in nesterov2015universal , we estimate the number of oracle calls for Algorithm 6. The number of oracle calls on each iteration is equal to . At the same time, and, hence, . Thus, the total number of oracle calls is
[TABLE]
where we used that .
Thus, the number of oracle calls of the Algorithm 6 does not exceed:
[TABLE]
Remark 4
We can apply this method to convex-concave saddle problems of the form
[TABLE]
where are convex compacts in , is convex in and concave in , there is and constants :
[TABLE]
[TABLE]
for all .
It is possible to achieve an acceptable approximation :
[TABLE]
for the saddle point of the (80) problem in no more than
[TABLE]
iterations, which indicates the optimality of the proposed method, at least for and . However, in practice experiments show that (81) can be achieved much faster due to the adaptability of the method.
6 Concluding remarks
Modern numerical methods for non-smooth convex optimization problems are typically based on the structure of the problem. We start with one of the most powerful example of such type. For geometric median search problem there exists efficient method that significantly outperform described above lower complexity bounds cohen2016geometric . In Machine Learning we typically meet the problems with hidden affine structure and small effective dimension (SVM) that allow us to use different smoothing techniques allen2016optimal . Description of one of these techniques (Nesterov’s smoothing technique) one can find in this survey. The other popular technique is based on averaging of the function around the small ball with the center at the point in consideration duchi2012randomized . A huge amount of data since applications lead to composite optimization problems with non smooth composite (LASSO). For this class of problems accelerated (fast) gradient methods are typically applied beck2009fast , nesterov2013gradient , lan2016gradient . This approach (composite optimization) have been recently expanded for more general class of problems tyurin2017fast . In different Image Processing applications one can find a lot of non-smooth problems formulations with saddle-point structure. That is the goal function has Legendre representation. In this case one can apply special versions of accelerated (primal-dual) methods chambolle2011first-order , chen2014optimal , lan2016accelerated . Universal Mirror Prox method described above demonstrates the alternative approach which can be applied in rather general context. Unfortunately, the most of these tricks have proven to be beyond the scope of this survey. But we include in the survey the description of the Universal Accelerated Gradient Descent algorithm tyurin2017fast which in the general case can also be applied to a wide variety of problems.
Another important direction in Non-smooth Convex Optimization is huge-scale optimization for sparse problems nesterov2014subgradient . The basic idea that reduce huge dimension to non-smoothness is as follows:
[TABLE]
is equivalent to the single non-smooth constraint:
[TABLE]
We demonstrated this idea above on Truss Topology Design example.
One should note that we concentrate in this survey only on deterministic convex optimization problems, but the most beautiful things in non smooth optimization is that stochasticity nemirovsky1983problem , duchi2016introductory , juditsky2011firstI , juditsky2011firstII and online context hazan2016introduction in general doesn’t change (up to a logarithmic factor in the strongly convex case) anything in complexity estimates. As an example, of stochastic (randomized) approach one can mentioned the work anikin2015efficient where one can find reformulation of Google problem as non smooth convex optimization problem. Special randomized Mirror Descent algorithm allows to solve this problem almost independently on the number of vertexes.
Finally, let’s note that in the decentralized distributed non smooth (stochastic) convex optimization for the last few years there appear optimal methods lan2017communication , uribe2017optimal , bubeckoptimal .
Acknowledgements.
The article was supported in its major parts by the grant 18-29-03071 mk from Russian Foundation for Basic Research. E. Nurminski acknowledges the partial support from the project 1.7658.2017/6.7 of Ministry of Science and Higher Professional Education in Section 1. The work of A. Gasnikov, P. Dvurechensky and F. Stonyakin in Section 3 was partially supported by Russian Foundation for Basic Research grant 18-31-20005 mol_a_ved. The work of F. Stonyakin in Subsection 4.1 was supported by Russian Science Foundation grant 18-71-00048.
The reference list from the paper itself. Each links out to its DOI / PubMed record.
- 1(1) Allen-Zhu, Z., Hazan, E.: Optimal black-box reductions between optimization objectives. In: Advances in Neural Information Processing Systems, pp. 1614–1622 (2016)
- 2(2) Anikin, A., Gasnikov, A., Gornov, A., Kamzolov, D., Maximov, Y., Nesterov, Y.: Efficient numerical methods to solve sparse linear equations with application to pagerank. ar Xiv preprint ar Xiv:1508.07607 (2015)
- 3(3) Baimurzina, D., Gasnikov, A., Gasnikova, E., Dvurechensky, P., Ershov, E., Kubentaeva, M., Lagunovskaya, A.: Universal similar triangulars method for searching equilibriums in traffic flow distribution models. ar Xiv:1701.02473 (2017)
- 4(4) Bayandina, A., Dvurechensky, P., Gasnikov, A., Stonyakin, F., Titov, A.: Mirror descent and convex optimization problems with non-smooth inequality constraints. In: P. Giselsson, A. Rantzer (eds.) Large-Scale and Distributed Optimization, chap. 8, pp. 181–215. Springer International Publishing (2018). DOI 10.1007/978-3-319-97478-1˙8 . Ar Xiv:1710.06612
- 5(5) Beck, A., Ben-Tal, A., Guttmann-Beck, N., Tetruashvili, L.: The comirror algorithm for solving nonsmooth constrained convex problems. Operations Research Letters 38 (6), 493 – 498 (2010). DOI https://doi.org/10.1016/j.orl.2010.08.005 . URL http://www.sciencedirect.com/science/article/pii/S 0167637710001094
- 6(6) Beck, A., Teboulle, M.: Mirror descent and nonlinear projected subgradient methods for convex optimization. Oper. Res. Lett. 31 (3), 167–175 (2003). DOI 10.1016/S 0167-6377(02)00231-6 . URL http://dx.doi.org/10.1016/S 0167-6377(02)00231-6
- 7(7) Beck, A., Teboulle, M.: A fast iterative shrinkage-thresholding algorithm for linear inverse problems. SIAM Journal on Imaging Sciences 2 (1), 183–202 (2009). DOI 10.1137/080716542 . URL https://doi.org/10.1137/080716542
- 8(8) Ben-Tal, A., Nemirovski, A.: Robust truss topology design via semidefnite programming. SIAM J. Optim. 7 (4), 991 – 1016 (1997)
