Random minibatch subgradient algorithms for convex problems with functional constraints
Angelia Nedich, Ion Necoara

TL;DR
This paper introduces randomized minibatch subgradient algorithms for non-smooth convex optimization problems with complex functional constraints, providing convergence analysis and demonstrating minibatch benefits.
Contribution
It proposes novel subgradient algorithms with random minibatch feasibility updates for convex problems with level set constraints, analyzing their convergence behavior.
Findings
Convergence rates are sublinear and optimal for the class of problems.
Rates explicitly depend on minibatch size, showing when minibatching improves performance.
Algorithms handle constraints given as convex level sets, not just simple sets.
Abstract
In this paper we consider non-smooth convex optimization problems with (possibly) infinite intersection of constraints. In contrast to the classical approach, where the constraints are usually represented as intersection of simple sets, which are easy to project onto, in this paper we consider that each constraint set is given as the level set of a convex but not necessarily differentiable function. For these settings we propose subgradient iterative algorithms with random minibatch feasibility updates. At each iteration, our algorithms take a step aimed at only minimizing the objective function and then a subsequent step minimizing the feasibility violation of the observed minibatch of constraints. The feasibility updates are performed based on either parallel or sequential random observations of several constraint components. We analyze the convergence behavior of the proposed…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
∎
11institutetext: A. Nedić 22institutetext: School of Electrical, Computer and Energy Engineering
Arizona State University, Tempe, USA
22email: [email protected]. 33institutetext: I. Necoara 44institutetext: Department of Automatic Control and Systems Engineering
University Politehnica Bucharest, 060042 Bucharest, Romania
44email: [email protected] (corresponding author).
Random minibatch subgradient algorithms for convex problems with functional constraints
Angelia Nedić
Ion Necoara
(Received: 28 February 2019 / Accepted: date)
Abstract
In this paper we consider non-smooth convex optimization problems with (possibly) infinite intersection of constraints. In contrast to the classical approach, where the constraints are usually represented as intersection of simple sets, which are easy to project onto, in this paper we consider that each constraint set is given as the level set of a convex but not necessarily differentiable function. For these settings we propose subgradient iterative algorithms with random minibatch feasibility updates. At each iteration, our algorithms take a subgradient step aimed at only minimizing the objective function and then a subsequent subgradient step minimizing the feasibility violation of the observed minibatch of constraints. The feasibility updates are performed based on either parallel or sequential random observations of several constraint components. We analyze the convergence behavior of the proposed algorithms for the case when the objective function is strongly convex and with bounded subgradients, while the functional constraints are endowed with a bounded first-order black-box oracle. For a diminishing stepsize, we prove sublinear convergence rates for the expected distances of the weighted averages of the iterates from the constraint set, as well as for the expected suboptimality of the function values along the weighted averages. Our convergence rates are known to be optimal for subgradient methods on this class of problems. Moreover, the rates depend explicitly on the minibatch size and show when minibatching helps a subgradient scheme with random feasibility updates.
Keywords:
Convex minimization functional constraints subgradient algorithms random minibatch projection algorithms convergence rates.
1 Introduction
The large sum of functions in the objective function and/or the large number of constraints in most of the practical optimization applications led the stochastic optimization field to become an essential tool for many applied mathematics areas, such as machine learning and statistics MouBac:11 ; Vap:98 , constrained control PatNec:17 , sensor networks BlaHer:06 , computer science KunBac:18 , inverse problems BotHei:12 , operations research and finance RocUry:00 . For example, in machine learning applications the optimization algorithms involve numerical computation of parameters for a system designed to make decisions based on yet unseen data MouBac:11 ; Vap:98 . In particular, in support vector machines one maps the data into a higher dimensional input space and constructs an optimal separating hyperplane in this space by learning, eventually online, the hyperplanes corresponding to each data in the training set Vap:98 . This leads to a convex optimization problem with a large number of functional constraints.
Contributions. To deal with such optimization problems having (possibly) infinite number of functional constraints, we propose subgradient methods with random feasibility updates. At each iteration, the algorithms take a subgradient step aimed at only minimizing the objective function, followed by a feasibility step for minimizing the feasibility violation of the observed minibatch of convex constraints achieved through the Polyak’s subgradient iteration Pol:67 ; Pol:01 . The feasibility updates in the first algorithm are performed using parallel random observations of several constraint components, while in the second algorithm we consider sequential random observations of constraints. Both algorithms are reminiscent of a learning process where we try to learn the constraint set while simultaneously minimizing an objective function. The proposed algorithms are applicable to the situation where the whole constraint set of the problem is not known in advance, but it is rather learned in time through observations. Also, these algorithms are of interest for (non-smooth) constrained optimization problems where the constraints are known but their number is either large or not finite.
We study the convergence properties of the proposed random minibatch subgradient algorithms for the case when the objective function need not be differentiable but it is strongly convex, while the functional constraints are accessed trough a bounded first-order black-box oracle. In doing so, we can avoid the need for projections to the set of constraints, which may be expensive computationally. For a diminishing stepsize, we prove sublinear convergence rates of order , where is the iteration counter, for the expected distances of the weighted averages of the iterates from the constraint set, as well as for the expected suboptimality of the function values along the weighted averages. Our convergence rates are known to be optimal for this class of subgradient schemes for solving non-smooth convex problems with functional constraints. Moreover, our rates depend explicitly on the minibatch size and show when minibatching works for a subgradient method with random feasibility updates. To the best of our knowledge, this is the first work proving that subgradient methods with random minibatch feasibility steps are better than their non-minibatch variants. More explicitly, the convergence estimate for the parallel algorithm depends on a key parameter (see eq. (15) below), which determines whether minibatching helps () or not () and how much (the smaller , the better is the complexity), see Theorem 2. For the sequential variant, we show that minibatching always helps and the complexity depends exponentially on the minibatch size (see Theorem 3).
Related works. In spite of its wide applicability, the study on efficient solution methods for optimization problems with many constraints is still limited. The most prominent work is the stochastic gradient descent (SGD) MouBac:11 ; NemJud:09 ; PolJud:92 . Even though SGD is a well-developed methodology, it only applies to optimization problems with simple constraints, requiring the whole feasible set to be projectable. A line of work that is known as alternating projections, focuses on applying random projections for solving problems that are involving the intersection of a (infinite) number of sets. The case when the objective function is not present in the formulation, which corresponds to the convex feasibility problem, is studied e.g. in BauBor:96 ; KeyZho:16 ; Ned:10 ; NecRic:18 . For this particular setting, Ned:10 ; NecRic:18 combines the smoothing technique with (minibatch) SGD, leading to stochastic alternating projection algorithms having linear convergence rates. In PatNec:17 stochastic proximal point type steps are combined with alternating projections for solving stochastic optimization problems with infinite intersection of sets. Stochastic forward-backward algorithms have been also applied to solve optimization problems with many constraints. However, the papers introducing those general algorithms focus on proving only assymptotic convergence results and do not derive convergence rates, or they assume the number of constraints is finite, which is more restricted than our settings BiaHac:17 ; SheTeb:14 ; WanChe:15 . In the case where the number of constraints is finite and the objective function is deterministic, Nesterov’s smoothing framework is studied in BotHen:13 ; OuyGra:12 ; TraFer:18 in the setting of accelerated proximal gradient methods. Incremental subgradient methods or primal-dual approaches were also proposed for solving convex optimization problems with finite intersection of simple sets through an exact penalty reformulation in Ber:11 ; KunBac:18 .
The paper most related to our work is Ned:11 , see also Pol:67 ; Pol:69 ; Pol:01 , where iterative subgradient methods with random feasibility steps are proposed for solving convex problems with functional constraints. Our algorithms are minibatch extensions of the algorithm proposed in Ned:11 . Moreover, in Ned:11 only sublinear convergence rates of order are established for convex objective functions, while in this paper we show that rates are valid under a relaxed strong convexity condition. Finally, since we deal with minibatching and a relaxed strong convexity assumption, our convergence analysis requires additional insights that differ from that of Ned:11 . Similarly, in PatNec:17 a stochastic optimization problem with infinite intersection of sets is considered and stochastic proximal point steps are combined with alternating projections for solving it. However, in order to prove sublinear convergence rates , PatNec:17 requires strongly convex and smooth objective functions, while our results are valid for a more relaxed strong convexity condition and possible non-smooth fuctions. Lastly, PatNec:17 assumes the projectability of individual sets, whereas in our case, the constraints might not be projectable.
Notation. The inner product of two vectors and in is denoted by , while denotes the standard Euclidean norm. We write for the distance of a vector from a closed convex set , i.e., , while denotes the projection of onto , i.e., . For a scalar , we write . For a convex function , we denote a subgradient of at and denote the set of all subgradients of at . If is differentiable at , then its gradient is denoted . We write and to denote respectively the probability distribution and the expectation of a random variable . Finally, the big notation, i.e. , means that there exist such that for all .
Outline. The content of the paper is as follows. In Section 1.1 we introduce our problem of interest and the main assumptions. In Section 2 we propose a parallel random minibatch subgradient algorithm and derive its convergence rate, while in Section 3 the sequential variant is analyzed. Finally, in Section 4 we discuss some extensions, while in Section 5 we report some preliminary numerical results.
1.1 Problem formulation
In this paper we are interested in solving the following convex constrained minimization problem:
[TABLE]
where is an arbitrary collection of indices and is a closed convex set. The objective function and all constraint functions are assumed convex. We also assume that the optimization problem (1) has finite optimum and we let and denote the optimal value and the optimal set, respectively,
[TABLE]
We work under the premise that the collection is large, possibly infinite (even uncountable). Such problems have many applications in engineering, machine learning, computer science, operations research and finance MouBac:11 ; Vap:98 ; PatNec:17 ; BotHei:12 ; RocUry:00 . Let us now formally state the assumptions on the functions and , with , of problem (1).
Assumption 1
Let the following hold:
- (a)
The set is closed, convex and simple (i.e., easy for projection). The constraint set and the optimal set are non-empty.
- (b)
The objective function is strongly convex on the set with a constant , i.e.:
[TABLE]
The subgradients of the function are uniformly bounded on the set , i.e., there is such that
[TABLE]
- (c)
The functional constraints are convex, not necessarily differentiable, and have bounded subgradients on the set , i.e., there is such that
[TABLE]
We assume, that the domains of definition of the functions and contain . It follows immediately from Assumption 1(b) that (see e.g., NecNes:15 ):
[TABLE]
Note that the conditions of Assumption 1(b) may look contradictory since the following relations need to hold:
[TABLE]
where the second inequality follows from the convexity of and the third one from the Cauchy-Schwartz inequality. This implies that for any . Note that this inequality is always valid provided that the set is compact and our optimization model (1) allows us to impose such an assumption on the set . Moreover, when the sets are simple for projection operation, then one may choose an alternative equivalent description of the constraint sets by letting for all . Note that in this case for all . Furthermore, , thus the subgradients are bounded with in this case. Therefore, our approach is more general than those from most of the existing works, which usually assume projectability of each (see also Related works paragraph from Section 1).
2 Parallel random minibatch subgradient algorithm
To solve the convex problem with functional constraints (1), we first propose a subgradient method with parallel random minibatch feasibility updates. More precisely, our first algorithm is a parallel minibatch extension of the algorithm proposed in Ned:11 , leading to the following iterative process:
**Algorithm (parallel case) **
Choose , minibatch size , and stepsizes and . For repeat:
Draw samples .
Compute the following updates:
(3a)
(3b)
(3c)
Here, and are deterministic stepsizes and recall that denotes a subgradient of at and . The method takes one subgradient step for the objective function, followed by feasibility updates in parallel, which are then averaged and projected onto the set . In a parallel implementation, we assume available cores collocated on the same machine, of which one is designated as a central core; the central core sends to all other cores, which perform the update (3b) and send their updates to the central core; finally the central core performs the average step (3c) and the optimality step (3a). We note that at each of the feasibility update step random constraints are selected from the collection of the constraint sets according to the probability distribution P, i.e., the index variable is random with values in the set . The vector is chosen as if and for some if . When , we have for any choice of . Note that the feasibility step (3b) has the special form of Polyak’s subgradient iteration, see e.g., Pol:67 ; Pol:01 . Moreover, when are projectable, then one chooses for all and the update (3b) becomes a usual projection step:
[TABLE]
The initial point is selected randomly with an arbitrary distribution. The projection on the set in the updates (3a) and (3c) is used to ensure that each and remain in the set , over which the functions and are assumed to have bounded subgradients. Our next assumption deals with the random variables for chosen according to the probability distribution P. For this, we introduce the sigma-field induced by the history of the method, i.e., by the realizations of the initial point and the variables up to main iteration :
[TABLE]
which contains the same information as the set . For notational convenience, we will allow by letting . We impose the following assumption.
Assumption 2
There exists a constant such that
[TABLE]
Assumption 2 does not require that are conditionally independent, given . For example, when the collection is finite, the indices can be selected randomly without replacement, i.e., given the realizations of , the index can be random with realizations in . As another example, the index set can be partitioned in disjoint sets , and each can be uniformly distributed over the index set . Such a sampling allows for a parallel computation of all in the algorithm (3). One can also combine the preceding two possibilities, by using a smaller partition of the set , and in each of the partitions choose the corresponding sequentially, without replacement. Assumption 2 is crucial in our convergence analysis of method (3). It summarizes all the information we need regarding the distributions of the random variables and the initial point . A discussion on the equivalence between the Assumption 2 and the linear regularity condition for the sets can be found in Ned:10 ; Ned:11 ; NecRic:18 . When each set is given by a linear inequality , one can verify that the intersection of these halfspaces over any arbitrary index set is linearly regular provided that the sequence is bounded, see BurFer:93 ; FerNec:19 . Hence, Assumption 2 is also satisfied in this case. Moreover, Assumption 2 holds provided that the interior of the intersection over the arbitrary index set has an interior point Pol:01 . However, Assumption 2 holds for more general sets, e.g., when a strengthened Slater condition holds for a collection of convex functional constraints , such as the generalized Robinson condition, as detailed in Corollary 2 of LewPan:98 .
2.1 Preliminary results
In this section, we derive some preliminary results for later use in the convergence analysis of method (3). We start by recalling a basic property of the projection operation on a closed convex set Ned:10 :
[TABLE]
We now show that the parameter in Assumption 2 satisfies the following inequality:
Lemma 1
Let Assumption 1(c) and Assumption 2 hold. Then, we have:
[TABLE]
Proof
Let be such that . Then, there exists such that the convex function satisfies . Consequently, for any we also have , and using convexity of , we obtain:
[TABLE]
or equivalently
[TABLE]
On the other hand for those for which we automatically have
[TABLE]
In conclusion, for any there holds:
[TABLE]
Combining the preceding inequality and Assumption 2, we obtain:
[TABLE]
which proves our relation .
We now derive a relation between the iterates and .
Lemma 2
Let Assumptions 1(a) and 1(b) hold. Let be obtained via equation (3a) for a given . Then, for the unique optimal solution of the problem (1) and , we have:
[TABLE]
Proof
Using the standard analysis of the projected subgradient method and the fact that the subgradients of are uniformly bounded on , we have for the optimal solution of (1) the following inequality, see e.g., Pol:67 ; Pol:69 :
[TABLE]
We provide a lower bound on . We consider two choices, namely, one is based on the strong convexity of and the other is based on considering another intermittent point. By the strong convexity of , we have
[TABLE]
where the second inequality follows from the optimality conditions for and the last inequality follows from the Cauchy-Schwartz and boundedness of the subgradients of on . The other choice consists of adding and subtracting , which yields
[TABLE]
where the last inequality follows by the convexity of and the Cauchy-Schwarz inequality. By Assumption 1(b), the subgradients of are uniformly bounded on and hence, also on , implying that
[TABLE]
We now let be arbitrary. By multiplying relation (2.1) with and relation (7) with , and by adding the resulting relations, we obtain
[TABLE]
By using the estimate (2.1) in relation (5), we obtain
[TABLE]
and after re-arranging some of the terms we get the relation of the lemma.
Remark 1
The best choice for the parameter is not apparent at this point. It is important to have it in order to have the function value involved in the expression, but it can be that will just do fine.
We next state a result that will be used to provide a basic relation between the iterates and . The relation is stated in a generic form, and its proof can be found in Pol:69 ; Pol:67 .
Lemma 3
Let be a convex function over a closed convex set , and let be given by*
[TABLE]
where . Then, for any such that , we have
[TABLE]
In the analysis, we will also make use of the relation for averages, stating that for given vectors and their average , the following relation is valid for any vector :
[TABLE]
Now we provide a basic relation for the iterate upon completion of the randomly sampled feasibility updates.
Lemma 4
*Let Assumption 1(a) hold. Let be obtained via updates (3b) and (3c) for a given and . Then, the following relation holds: *
[TABLE]
where is the total variation of the minibatch subgradients, i.e.,
[TABLE]
Proof
By the projection property (4) and the definition of , we have for any that:
[TABLE]
By the definition we have . Thus, by using relation (11) for the collection , we have for any ,
[TABLE]
Letting in the preceding relation and combining the resulting relation with (12), we obtain
[TABLE]
Now, we use the definition of the iterates in algorithm (3) and Lemma 3, with . Thus, we obtain for any (for which we would have for any realization of ) and for any ,
[TABLE]
Hence, it follows that for any ,
[TABLE]
From the definition of the iterates in algorithm (3), we see that
[TABLE]
By defining
[TABLE]
we have
[TABLE]
Therefore, we obtain for any ,
[TABLE]
The statement of the lemma follows by letting in the preceding relation and using the fact that .
Let us define the following parameters:
[TABLE]
From Jensen’s inequality it follows that . However, there are also convex functions such that . We postpone the derivation of such examples of functional constraints satisfying condition until Section 2.3. The parameter will play a key role in our derivations below. In particular, we obtain the following simplification for Lemma 4.
Lemma 5
Let Assumptions 1(a) and 1(c) hold. Let as defined in (15) and be obtained via updates (3b) and (3c) for a given and extrapolated stepsize . Then, the following relation holds:
[TABLE]
Proof
Note that the total variation of the minibatch subgradients can be written equivalently as:
[TABLE]
Using the previous expression of and the definitions of and from (15) in Lemma 4, we get:
[TABLE]
By Assumption 1(c) each function has bounded subgradients uniformly on . Hence, we have , which used in the previous inequality implies the statement of the lemma.
Note that the previous result shows that we can use extrapolated stepsize in minibatch settings instead of the typical used e.g. in Ned:11 . Clearly, when we have and consequently, such extrapolation will accelerate convergence of the parallel algorithm. This can be also observed in simulations (see e.g. Fig. 3 below). Moreover, the largest decrease in Lemma 5 is obtained by maximizing , that is, the optimal stepsize is . We now combine Lemma 2 and Lemma 5 to provide a basic relation for the subsequent analysis.
Lemma 6
Consider the method in (3), and let Assumption 1 hold. Let the stepsize be such that for all and stepsize , with defined in (15). Then, the iterates of the method (3) satisfy the following recurrence for the optimal solution and for all :
[TABLE]
where is arbitrary.
Proof
Let be the unique optimal solution of problem (1). Then, we use Lemma 2 for so that for all , we have
[TABLE]
Using the same reasoning as in the proof of Lemma 5 for the inequality (14) with gives:
[TABLE]
Combining the preceding two relations yields
[TABLE]
We next approximate the term that is linear in , i.e. , with a sum of two quadratic terms, one of which is in the order of , as:
[TABLE]
for any arbitrary . Substituting the preceding estimate in (16), we obtain the stated relation.
2.2 Convergence rates
In this section we derive the convergence rates of Algorithm (3). For this, we first provide a recurrence relation for the iterates in expectation, which is the key relation for our convergence rate results. Note that according to Lemma 1 and . In the sequel we provide a detailed convergence analysis for the non-trivial case . The other case, i.e. , implies almost sure feasibility for any generated by the parallel algorithm, with , and it will be discussed in Remark 3.
Theorem 2.1
Consider the iterative process (3), and let Assumption 1 and Assumption 2 hold. Let the stepsizes be such that for all and , with defined in (15), and assume . Then, for the algorithm (3), by defining , we have almost surely for all ,
[TABLE]
Proof
From Lemma 6, by taking the conditional expectation on the past , we have almost surely for all ,
[TABLE]
where is arbitrary. By Assumption 2, it follows that
[TABLE]
Hence
[TABLE]
Taking the conditional expectation on the past in the relation of Lemma 5, and using relation (20), we obtain almost surely
[TABLE]
where we denote
[TABLE]
Recall that we assume , then (since ). Hence, . By dividing with , we further obtain
[TABLE]
Substituting the preceding estimate in relation (20), yields
[TABLE]
We now use estimate (23) in relation (17), and thus obtain
[TABLE]
By the definition of (see (22)), we have
[TABLE]
Hence,
[TABLE]
and by letting , the desired relation follows.
We now turn our attention to the stepsize . We consider of the form:
[TABLE]
for some diminishing sequence as detailed below. Indeed, for this choice, the recurrence from Theorem 2.1 becomes:
[TABLE]
where recall that . Let be given by
[TABLE]
Since the sequence is decreasing, we have
[TABLE]
implying that
[TABLE]
Using this estimate in (24), we obtain
[TABLE]
Next, we note that
[TABLE]
Dividing (2.2) by and using the preceding inequality we have for all , after taking total expectations and rearranging terms:
[TABLE]
Summing these over , for some , we obtain
[TABLE]
Using the definition of , (28) implies
[TABLE]
We finally obtain by the linearity of the expectation operation:
[TABLE]
Define for the sum
[TABLE]
Define also the following weighted averages (convex combinations)
[TABLE]
with , hence satisfying . Using convexity of the function and of the norm-squared, we have
[TABLE]
If we define , then (33) becomes:
[TABLE]
Next theorem summarizes the convergence rates followed from the previous discussion. For simplicity of the exposition, we omit the constants and express the rates only in terms of the dominant powers of :
Theorem 2.2
Let Assumption 1 and Assumption 2 hold and the stepsizes and , with defined in (15). Let also assume . Then, and . Moreover, the following sublinear rates for suboptimality and feasibility violation hold for the average sequence generated by the parallel algorithm (3):
[TABLE]
Proof
From the recurrence (35), omitting the constants but keeping the terms depending on , we get the following convergence rates in terms of these weighted averages and :
[TABLE]
Since and using the Jensen’s inequality we get the following convergence rate for the feasibility violation of the constraints:
[TABLE]
Since and , by the subgradient boundedness of on , it follows that
[TABLE]
which combined with , yields also the following convergence rate for suboptimality
[TABLE]
which proves our theorem.
We observe that the convergence estimate for the feasibility violation depends explicitly on the minibatch size via the key parameter . For the optimal stepsize we get and . Hence, is large provided that (small). Note that if , then does not depend on and hence complexity does not improve with minibatch size . However, as long as (and it can be also the case that ), then becomes large, which shows that minibatching improves complexity. To the best of our knowledge, this is the first time that a subgradient method with random minibatch feasibility updates is shown to be better than its non-minibatch variant. We have identified as the key quantity determining whether minibatching helps () or not (), and how much (the smaller , the more it helps). Note also that the suboptimality estimate contains a term which does not depend on the minibatch size as it happens for feasibility violation estimate. This is natural, since the minibatch feasibility steps have no effect on the minimization step of the objective function.
Remark 2
Note that the convergence rates for feasibility and suboptimality are known to be optimal for the stochastic subgradient method for solving the optimization problem (1) under Assumption 1, see NemYud:83 ; Nes:04 . Moreover, the iterative process (3) does not require knowledge of the subgradient norm bounds and from Assumption 1, nor the constant from Assumption 2. These values are only affecting the constants in the convergence rates, they are not needed for the stepsize selection. The stepsize requires only knowledge of some estimate of the strong convexity constant . Moreover, since , we can use e.g., stepsize . Of course, a larger stepsize leads to a faster convergence. Hence, if and it can be computed, then we should choose an extrapolated steplength for some small. When cannot be computed explicitly, we propose to approximate it online with , and use at each iteration an adaptive extrapolated stepsize of the form for some (see also the discussion from Section 4, equation (48)).
Remark 3
The convergence rates from Theorem 2.2 hold for the non-trivial case . Note that the inequality is always satisfied, provided that . On the other hand, the case (e.g., and ) turns out to be the ideal case, since then we have from (21) that . Therefore, in this ideal case we achieve almost sure feasibility for the sequence generated by the parallel algorithm (see (3)) after one step:
[TABLE]
Using this feasibility relation in the same derivations from Section 2.2 we also get a suboptimality estimate for the average sequence as in Theorem 2.2:
[TABLE]
Clearly, from Jensen’s inequality we also have almost sure feasibility for the average sequence :
[TABLE]
We skip these details since the proof is the same as for the non-trivial case.
2.3 Example of functional constraints having
Let us recall the definition of the parameters and from (15):
[TABLE]
From Jensen’s inequality we have and consequently . On the other hand, Theorem 2.2 shows that is beneficial for a subgradient scheme with minibatch feasibility updates. In this section we provide an example of functional constraints for which . Let us consider linear inequality constraints for the convex problem (1):
[TABLE]
Without loss of generality we assume for all . Let us define the matrix and the subset of indexes selected at the current iteration . We also denote and denote the submatrix of having the rows indexed in the set . With these notations and using that for all , then can be written explicitly as (assuming that ):
[TABLE]
where the first inequality follows from the definition of the maximal eigenvalue of a matrix, the second inequality follows from the fact that , and the third inequality holds strictly provided that the submatrix has at least rank two. In conclusion, if the matrix has e.g. full row rank and consider a sampling of based on a given probability P, then satisfies:
[TABLE]
Note that for particular sampling rules we can compute efficiently, such as when we consider a uniform distribution over a fixed partition of of equal size. The reader may find other examples of functional constraints satisfying and we believe that this paper opens a window of opportunities for algorithmic research in this direction.
3 Sequential random minibatch subgradient algorithm
In this section we consider a sequential variant of the algorithm (3) defined in terms of the following iterative process:
**Algorithm (sequential case) **
Choose , minibatch size , and stepsizes and . For repeat:
Draw samples .
Compute the following updates:
(38a)
(38b)
(38c)
This method takes, as for the parallel variant, one subgradient step for the objective function, followed by sequential feasibility updates. As before, the vector is chosen as if , and for some if . Note that in this variant, the feasibility updates use the projection on in order to confine the intermittent iterates and to the set , where ’s and (for the last step) are assumed to have uniformly bounded subgradients.
In this section we analyze the convergence properties of this new algorithm (38). Given , the update of is the same as in the parallel method (3), thus Lemma 2 still applies here. We need an analog of Lemma 5.
Lemma 7
Let Assumptions 1(a) and 1(c) hold. Let be generated by algorithm (38) with . Then, the following relations are valid:
[TABLE]
Proof
We start with the definition of in (38b) and Lemma 3, with . Thus, we obtain for all (which satisfies for any realization of ) and for all ,
[TABLE]
By using , we have for all ,
[TABLE]
The distance relation for -iterates follows by taking the minimum over on both sides of inequality (39). By summing relations (39) over , and by using and , we obtain for any ,
[TABLE]
The distance relation follows by taking the minimum over on both sides of the preceding inequality.
Taking in Lemma 2 we get:
[TABLE]
and using the inequality for from Lemma 7 in , yields:
[TABLE]
Taking the conditional expectation on and , and using Assumption 2, give
[TABLE]
Using the iterated expectation rule, we obtain
[TABLE]
which, when combined with the distance relation of Lemma 7 gives for all
[TABLE]
Recall that according to Lemma 1. In the subsequent analysis we consider the non-trivial case . The ideal case will allow to get feasibility in expectation in one step and obtain a similar convergence rate result as in Remark 3. Hence, using the definition of , i.e., , and letting (since we assume and ), we have for all ,
[TABLE]
implying that for all ,
[TABLE]
[TABLE]
By summing over
[TABLE]
However,
[TABLE]
Finally, we get
[TABLE]
and consequently
[TABLE]
Let us denote . It is clear that as . Taking expectation in (40) and using the previous inequality we get an analog of Lemma 6:
[TABLE]
for any . Let us consider the same stepsize as for the parallel scheme, i.e. , choose , and take the full expectation, to get the following recurrence (analog to Theorem 2.1):
[TABLE]
Using now , then and we get:
[TABLE]
Since, for all , dividing (43) by and using the preceding inequality we have for all :
[TABLE]
Summing these over , for some , we obtain the following recurrence relation for the algorithm (38):
[TABLE]
Using the same definition for the weighted averages and from (32) and in (45), we get the main recurrence for the sequential variant (38):
[TABLE]
Next theorem summarizes the convergence rates that follow from the recurrence relation (46) of the sequential algorithm (38).
Theorem 3.1
Let Assumption 1 and Assumption 2 hold and the stepsizes and . Let also and . Then, the following sublinear rates for suboptimality and feasibility violation hold for the average sequence from (32) generated by the sequential algorithm (38):
[TABLE]
Proof
Defining the same average sequences and as in (32), we get the following convergence rates (omitting the constants but keeping the terms depending on ):
[TABLE]
Hence, we get the following convergence rate for the feasibility violation of the constraints that depends explicitly on the minibatch size via the term :
[TABLE]
Using the same reasoning as in the proof of Theorem 2.2, we also get the following convergence rate for suboptimality:
[TABLE]
which proves the statements of the theorem.
We observe that also for the sequential algorithm (38) the convergence estimate for the feasibility violation depends explicitly on the minibatch size via the term (recall that as ). Since is an increasing sequence in , it follows that the larger is the minibatch size the better is also the complexity of the sequential algorithm (38) in terms of constraints feasibility. In conclusion, for the sequential variant our rates prove that minibatching always helps and the feasibility estimate depends exponentially on the minibatch size . On the other hand, the suboptimality estimate contains a term which does not depend on the minibatch size as it happens for feasibility violation estimate. Recall that for the parallel algorithm we proved that minibatching works only for and the estimates depend linearly on .
4 Extensions
In this section we discuss some possible extensions of the framework presented in this paper related to the objective function, algorithms and stepsizes. Some of these extensions will be considered in our future work.
First, from our convergence analysis it is easy to note that the derivations still remain valid for a larger class of objective functions in the model (1). More precisely, we can replace the boundedness on the subgradients of (Assumption 1(b)), i.e. , with a more general assumption, that is there exist two constants such that the (sub)gradients of satisfy the following inequality:
[TABLE]
Clearly, this condition covers the class of functions with bounded subgradients, e.g. take , and also the class of functions with Lipschitz continuous gradients Nes:04 . Indeed, if there is such that the gradients satisfy:
[TABLE]
then we immediately get
[TABLE]
which proves our inequality for and . Our convergence analysis can be easily adapted for this more general assumption, however, the recurrence relations will be more cumbersome. For example, the recurrence from Lemma 2 becomes now:
[TABLE]
Second, when the objective function has an easy proximal operator we can replace the subgradient steps (3a) and (38a) by a proximal point step:
[TABLE]
An algorithm combining the proximal point step with a single feasibility step (i.e., ) has been considered in PatNec:17 and convergence rates of order have been proved provided that the objective function is smooth (i.e., it has Lipschitz continuous gradient) and strongly convex. Note that it is easy to extend that convergence analysis to the minibatch settings following the framework developed in this paper.
Third extension is still related to the objective function, by considering in the composite form, i.e.:
[TABLE]
where is smooth and can be non-smooth but admits an easy proximal operator. Note that if the set is present in the optimization model (1), then it can be included in as the indicator function. For this composite objective function, steps (3a) and (38a) can be replaced by:
[TABLE]
Note that for , the indicator function of the convex set , we recover the updates (3a) and (38a). Hence, it will be interesting to extend our convergence analysis to this general composite objective function .
Finally, in the parallel algorithm (see (3)) the feasibility steps depend on an extrapolated stepsize . When cannot be computed explicitly, we propose to approximate it online with , and use at each iteration an adaptive extrapolated stepsize of the form:
[TABLE]
for some sufficiently small. The convergence rate of the parallel algorithm for this adaptive choice (48) of the stepsize will be analyzed in our future work (see e.g., NecNed:19 for some preliminary results related to the convex feasibility problem).
5 Preliminary numerical results
Many data-driven optimization applications can be formulated as convex optimization problems with the objective function composed of a quadratic term and a regularizer and constraints (so-called constrained Lasso) of the form:
[TABLE]
where the problem is parametrised by the data (measurements) , is an appropriate linear operator (e.g., the forward operator, the circular convolution) and is another linear operator (e.g., the identity, the finite difference or the Wavelet transform). Additionally, we impose constraints of the form , where . The constrained Lasso arises e.g., in image deblurring or denoising, computerised tomography or some inverse problems, see.e.g BotHei:12 . Note that for this formulation the strong convexity Assumption 1(b) holds for full column matrices (see e.g., NecNes:15 ) and also the linear regularity Assumption 2 holds (see e.g. NecRic:18 ). Moreover, the set is compact so that the objective function has bounded subgradients and the functional constraints are linear and, consequently, Assumption 1 also holds.
In our experiments we use synthetic data, where is Toeplitz-like matrix and the finite difference operator (as in image deblurring BotHei:12 ). We also generate randomly, with constraints. We consider a partition of of equal size , i.e., . Hence, . We compute as in (37) for this partition. We consider full iterations, i.e. we plot the behavior of the algorithms over epochs (number of passes over all the rows of matrix ).
In the first set of experiments we compare the parallel (see (3)) and sequential (see (38)) algorithms for different minibatch sizes and on a constrained Lasso problem with . The plots in Fig. 1 present the convergence behavior of these algorithms in terms of feasibility violation of the average point over full iterations : parallel algorithm (left) and sequential algorithm (right). As we can see from Fig. 1, increasing the minibatch size usually leads to better convergence for both algorithms.
Then, we compare the parallel algorithm with the extrapolated stepsize and the sequential algorithm with . The results on a problem of dimension and minibatch size are displayed in Fig. 2: suboptimality (left) and feasibility violation (right) in the average point over full iterations. We observe a faster convergence for the sequential algorithm, as our theory also predicted.
Finally, we compare the parallel algorithm (3) based on our extrapolated stepsize and a variant with fixed stepsize . The results on a constrained Lasso problem of dimension and minibatch size are displayed in Fig. 3: suboptimality (left) and feasibility violation (right). We observe that extrapolation accelerates substantially the parallel algorithm in terms of feasibility criterion. Note also that all the plots show a rate for the average sequence in the feasibility criterion, thus supporting our theoretical findings.
6 Conclusions
In this paper we have considered (non-smooth) convex optimization problems with (possibly) infinite intersection of constraints. For solving this general class of convex problems we have proposed subgradient algorithms with random minibatch feasibility steps. At each iteration, our algorithms take first a step for minimizing the objective function and then a subsequent step minimizing the feasibility violation of the observed minibatch of constraints. The feasibility updates were performed based on either parallel or sequential random observations of several constraint components. For a diminishing stepsize and for strongly convex objective functions, we have proved sublinear convergence rates for the expected distances of the weighted averages of the iterates from the constraint set, as well as for the expected suboptimality of the function values along the weighted averages. Our convergence rates are optimal for subgradient methods with random feasibility steps for solving this class of non-smooth convex problems. Moreover, the rates depend explicitly on the minibatch size. From our knowledge, this work is the first deriving conditions when minibatching works for subgradient methods with random minibatch feasibility updates and proving how better is their complexity compared to the non-minibatch variants. Finally, our convergence analysis shows that for the sequential algorithm minibatching always helps and the feasibility estimate depends exponentially on the minibatch size, while for the parallel algorithm we proved that minibatching works only when some parameter of the optimization problem is strictly less than 1. The numerical results also support the convergence results.
Acknowledgements.
This research was supported by the National Science Foundation under CAREER grant CMMI 07-42538 and by the Executive Agency for Higher Education, Research and Innovation Funding (UEFISCDI), Romania, PNIII-P4-PCE-2016-0731, project ScaleFreeNet, no. 39/2017.
The reference list from the paper itself. Each links out to its DOI / PubMed record.
- 1(1) D. Blatt and A.O. Hero, Energy based sensor network source localization via projection onto convex sets , IEEE Transactions on Signal Processing, 54(9): 3614–3619, 2006.
- 2(2) H. Bauschke and J. Borwein, On projection algorithms for solving convex feasibility problems , SIAM Review 38(3): 367–376, 1996.
- 3(3) R.I. Bot and C. Hendrich, A double smoothing technique for solving unconstrained nondifferentiable convex optimization problems , Computational Optimization and Applications, 54(2): 239–262, 2013.
- 4(4) R.I. Bot and T. Hein, Iterative regularization with general penalty term - theory and application to L 1 subscript 𝐿 1 L_{1} and TV regularization , Inverse Problems, 28(10): 1–19, 2012.
- 5(5) J. Burke and M. Ferris, Weak sharp minima in mathematical programming , SIAM Journal of Control and Optimization, 31(6): 1340–1359, 1993.
- 6(6) P. Bianchi, W. Hachem, and A. Salim, A constant step forward-backward algorithm involving random maximal monotone operators , arxiv preprint (ar Xiv:1702.04144), 2017.
- 7(7) D.P. Bertsekas, Incremental proximal methods for large scale convex optimization , Mathematical Programming, 129(2): 163–195, 2011.
- 8(8) O. Fercoq, A. Alacaoglu, I. Necoara and V. Cevher, Almost surely constrained convex optimization , International Conference on Machine Learning (ICML), 2019.
