Parallel Random Block-Coordinate Forward-Backward Algorithm: A Unified Convergence Analysis
Saverio Salzo, Silvia Villa

TL;DR
This paper introduces a unified convergence analysis for a parallel, random block-coordinate forward-backward algorithm, demonstrating its convergence properties and rates under various conditions.
Contribution
It provides a comprehensive convergence analysis for a flexible parallel block-coordinate algorithm with arbitrary update probabilities and stepsizes.
Findings
Almost sure weak convergence in convex and infinite-dimensional settings
O(1/n) convergence rate for mean function values
Linear convergence under strong convexity and error bounds
Abstract
We study the block-coordinate forward-backward algorithm in which the blocks are updated in a random and possibly parallel manner, according to arbitrary probabilities. The algorithm allows different stepsizes along the block-coordinates to fully exploit the smoothness properties of the objective function. In the convex case and in an infinite dimensional setting, we establish almost sure weak convergence of the iterates and the asymptotic rate o(1/n) for the mean of the function values. We derive linear rates under strong convexity and error bound conditions. Our analysis is based on an abstract convergence principle for stochastic descent algorithms which allows to extend and simplify existing results.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Parallel Random Block-Coordinate Forward-Backward Algorithm:
A Unified Convergence Analysis
Saverio Salzo and Silvia Villa Istituto Italiano di Tecnologia, Via Melen, 83, 16152 Genova, Italy ([email protected]).Università degli Studi di Genova, Via Dodecaneso, 35, 16146 Genova, Italy ([email protected]). Supported by the H2020-MSCA-RISE project NoMADS-GA No. 777826 and by Gruppo Nazionale per l’Analisi Matematica, la Probabilità e le loro Applicazioni (GNAMPA) of the Istituto Nazionale di Alta Matematica (INdAM).
Abstract
We study the block-coordinate forward-backward algorithm in which the blocks are updated in a random and possibly parallel manner, according to arbitrary probabilities. The algorithm allows different stepsizes along the block-coordinates to fully exploit the smoothness properties of the objective function. In the convex case and in an infinite dimensional setting, we establish almost sure weak convergence of the iterates and the asymptotic rate for the mean of the function values. We derive linear rates under strong convexity and error bound conditions. Our analysis is based on an abstract convergence principle for stochastic descent algorithms which allows to extend and simplify existing results.
Keywords. Convex optimization, parallel algorithms, random block-coordinate descent, arbitrary sampling, error bounds, stochastic quasi-Fejér sequences, forward-backward algorithm, convergence rates.
AMS Mathematics Subject Classification: 65K05, 90C25, 90C06, 49M27
1 Introduction and problem setting
Random block-coordinate descent algorithms are nowadays among the methods of choice for solving large scale optimization problems [28, 34, 41]. Indeed, they have low complexity and low memory requirements and, additionally, they are amenable for distributed and parallel implementations [32, 34]. In the last decade a number of works have appeared on the topic which address several aspects, that is: the way the block sampling is performed, the composite structure, the partial separability, and the smoothness/geometrical properties of the objective function, accelerations, and iteration complexity [4, 5, 13, 21, 23, 24, 28, 30, 31, 33, 34, 38].
In this work we consider the following optimization problem
[TABLE]
where is the direct sum of separable real Hilbert spaces , that is,
[TABLE]
and the following assumptions hold:
- \rmH1
is convex and differentiable, 2. \rmH2
for every , is proper, convex, and lower semicontinuous.
The objective of this study is a stochastic algorithm, called parallel random block-coordinate forward-backward algorithm, that depends on a random variable satisfying the following hypothesis
- \rmH3
is a random variable with values in such that, for every , and \mathsf{P}\big{(}\varepsilon=(0,\dots,0)\big{)}=0.
Algorithm 1.1**.**
Let be a sequence of independent copies of . Let and be a constant random variable. Iterate
[TABLE]
For every , we denote by the sigma-algebra generated by .
In Algorithm 1.1, the role of the random variable is to select, at iteration , the blocks to update in parallel (those indexed in ). When all block-coordinates are simultaneously updated at each iteration, Algorithm 1.1 reduces to the (deterministic) forward-backward algorithm, which converges only if the stepsizes are appropriately set. More specifically, if is -Lipschitz continuous, then convergence is ensured if the stepsizes are all equal and strictly less than [6, 7]. This fact is proved by using the so called descent lemma, i.e.,
[TABLE]
Indeed, (1.3) is itself an assumption concerning the smoothness of , since it is well-known to be equivalent to the Lipschitz continuity of the gradient of [1, Theorem 18.15]. By contrast, when the block-coordinates are updated one by one in a serial manner, it is desirable to allow moving along the block-coordinates with different stepsizes, depending on the Lipschitz constants of the partial gradients of across the block-coordinates [2, 28]. So, in this case it is more appropriate to assume that a descent lemma holds on each block-coordinate subspace individually, that is,
[TABLE]
where ( occurring at the -th position), for some positive constants ’s. In the setting of Algorithm 1.1, multiple block-coordinates may be updated in parallel at each iteration, according to the random sampling . Therefore, it is reasonable to assume that there exists so that one of the generalized smoothness conditions below holds
- \rmS1
, 2. \rmS2
,
where . Conditions 1 and 2 can be interpreted as descent lemmas on random block-coordinate subspaces, depending on the chosen random sampling of the block-coordinates. They reduce to (1.4), with , if the sampling selects only one block at a time almost surely (see Section 3.2). We call the smoothness parameters of . Then, similarly to the deterministic case, we will adopt the following stepsize rule
[TABLE]
Another smoothness condition suitable for Algorithm 1.1, which was considered in [23], is
- \rmS3
.
Note that, 1 \ \Rightarrow\ 2 \ \Rightarrow\ 1, which in turn implies the Lipschitz continuity of the gradient of (see Theorem 3.1iv). So, possibly with different values of the ’s, the above conditions are all equivalent. The point is that in the parallel setting (where multiple blocks are updated in parallel at each iteration), 1 may be fulfilled with values of that are much smaller than those related to the other two conditions, ultimately allowing to significantly increase the stepsizes and hence speeding up the convergence. Moreover, 1 makes parallelization particularly effective on problems with a sparse structure and superior to the serial strategy (which updates a single block per iteration). See the discussion after Theorem 4.9. The critical role played by assumption 1 in the analysis of parallel randomized block-coordinate descent methods was pointed out in [31, 34, 35, 38]. There, it was called expected separable overapproximation (ESO) inequality. Condition 2 is new and serves to guarantee that Algorithm 1.1 is almost surely descending (Proposition 4.7), which is a property that is especially relevant when error bound conditions hold (see Section 4.3). Note that in [34] the issue of monotonicity of the algorithm was addressed for each sampling separately without any general guidance. Finally, we stress that, except for [4, 5] (which study the convergence of the iterates only), in all previous works the stepsizes ’s are set equal to . This is an unnecessary limitation that we remove, so to match the standard stepsize rule of the forward-backward algorithm [6, 7].
Remark 1.2**.**
For every , the canonical embedding of into is the operator , , where occurs in the -th position. Then Algorithm 1.1 can be written as
[TABLE]
1.1 Main contributions and comparison to previous work
In the following we summarize the main contributions of this paper, where, for the sake of brevity, we set . We assume that 1–1 are satisfied and that 1 is met. Then, the following hold.
- •
Algorithm 1.1 is descending in expectation and , even if the infimum is not attained. If , then . In addition, a nonasymptotic bound for of order holds. Finally, there exists a random variable with values in such that -a.s. See Theorem 4.9.
- •
If is strongly convex or satisfies an error bound condition of Luo-Tseng type (see condition 1), then the iterates as well as the corresponding function values generated by Algorithm 1.1, converge linearly in expectation. See Theorem 4.10, Theorem 4.16, and Theorem 4.19.
Our results advance the state-of-the-art in the study of random block-coordinate descent methods under several aspects. We comment on this below.
- While convergence of the function values has been intensively studied in the related literature (see e.g., [16, 21, 23, 28, 29, 34, 35, 38]), surprisingly, in a convex setting, convergence of the iterates has been investigated only recently in [4], but with stepsizes set according to the global Lipschitz constant of . See also [12] which addresses the convergence of the iterates in the framework of primal-dual algorithms with a serial and uniform block sampling. We improve the existing results, since we show convergence of the iterates for Algorithm 1.1 in an infinite dimensional setting even when the stepsizes are chosen according to the condition 1, which can incorporate the block Lipschitz constants of the gradient of and is at the basis of the effectiveness of the parallel block-coordinatewise approach.
- The worst case asymptotic rate for the mean of the function values is new in the setting of stochastic algorithms.
- Our analysis spotlights an abstract convergence principle for stochastic descent algorithms (Theorem 4.1) which is essentially a special form of the stochastic quasi-Fejér monotonicity property, involving also the values of the objective functions. This principle, previously investigated in a deterministic setting in [36], allows to prove in a unified way both the almost sure convergence of the iterates and rates of convergence for the mean of the function values.
- As a by-product of the above analysis we single out an inequality (Proposition 4.4) which is pivotal for studying the convergence under error bound conditions, improving the results and simplifying the analysis in [23].
- We allow for parallel and arbitrary sampling of the blocks in a composite setting. The benefit of such sampling in terms of convergence rate have been first investigated in [35] for a strongly convex and smooth objective function. In [30] a composite objective optimization problem was analyzed but for a slightly different algorithm. The rest of the studies deal either with parallel uniform sampling of the blocks [34], or with the case where a single block is updated at each iteration [21, 28].
- We also allow for stepsizes larger than those considered in literature [16, 21, 23, 28, 29, 34, 35, 38], since we can let the stepsizes go beyond and be arbitrarily close to , matching the standard rule for the forward-backward algorithm. This provides additional flexibility to the algorithm. Indeed, in the strongly convex case we show that the optimal stepsizes are strictly larger than .
The rest of the paper is organized as follows. In Section 2 we give notation and basic facts. Section 3 shows how to determine the smoothness parameters when features a partially separable structure. In Section 4 we carry out the convergence analysis and give the related theorems. Finally, Section 5 shows three applications and Section 6 provides some numerical experiments.
2 Notation and background
Notation.
We define , , for every integer , , and for every , . Scalar products and norms in Hilbert spaces are denoted by and respectively. If is a bounded linear operator between real Hilbert spaces, is its transpose operator, that is, the one satisfying , for every . Let be separable real Hilbert spaces and let be their direct sum. For every and we set . We will consider random variables with underlying probability space taking values in or . We use the default font for random variables and sans serif font for their realizations. The expected value operator is denoted by . A copy of a random variable is random variable having the same distribution of the given one. Let . The direct sum operator , where is the identity operator on , is the positive bounded linear operator on acting as . defines an equivalent inner product on
[TABLE]
which gives the norm . If and , we set . Let be proper, convex, and lower semicontinuous. The domain of is and the set of minimizers of is . The subdifferential of in the metric is the multivalued operator
[TABLE]
In case , it is simply denoted by . Clearly . If the function is differentiable, then, for every , and for all , . The proximity operator of in the metric is defined as
[TABLE]
Referring to the functions in (1.1), we denote by and the moduli of strong convexity of and respectively, in the norm , where and the ’s are the stepsizes occurring in Algorithm 1.1. This means that and that, for every ,
[TABLE]
Note that, since is separable, by taking in (2.2), we have
[TABLE]
Remark 2.1**.**
If 1 is satisfied, the ’s are chosen as in (1.5), and (according to the convergence theorems), then we have
[TABLE]
Indeed, let , and , . It follows from (2.1) with (where is defined in Remark 1.2) and 1 that
[TABLE]
Thus, (2.4) follows.
Fact 2.2** ([10, Example 5.1.5]).**
Let and be independent random variables with values in the measurable spaces and respectively. Let be measurable and suppose that . Then , where for all , .
Fact 2.3**.**
Let be a random variable with values in and, for all , . Then and, for every , .
Fact 2.4** ([17]).**
Let be a decreasing sequence in . If , then, for every , and a_{n}=o\big{(}1/(n+1)\big{)}.
3 Determining the smoothness parameters
In this section we provide few scenarios for which the relaxed smoothness conditions 1 and 2 can be fully exploited, attaining tight values for the ’s. This ultimately allows to take larger stepsizes and improves rates of convergence. In [31, 38] an extensive analysis of cases in which 1 is satisfied is presented.
3.1 General estimates.
We consider the following setting.
- \rmH4
The function is such that
[TABLE]
where, for every , is a convex differentiable function defined on a real Hilbert space and, for every , is a bounded linear operator. Moreover, , where, for all , I_{k}=\big{\{}i\in[m]\,|\,\mathsf{U}_{k,i}\neq 0\big{\}}, and .
We will also consider one of the following conditions.
- \rmL1
For every there exists such that, for every , the function is -Lipschitz continuous. 2. \rmL2
For every , is -Lipschitz continuous and for every , , the ranges of and are orthogonal.
Assumption 1 concerns the partial separability of the function . Depending on the number of the nonzero operators , might depend only on few block-variables ’s: if , is fully separable, whereas if , is not separable. Note that 1 is equivalent to (1.4) and, since is convex, implies the global Lipschitz continuity of the gradient of (Corollary A.2). So either 1 or 2 implies the global Lipschitz smoothness of . However, considering the constants ’s or ’s leads in general to a finer analysis of the smoothness properties of , eventually determining parameters that are smaller than the global Lipschitz constant of . Instances of problem (1.1) where has the structure shown in 1, occur very often in applications. In particular, a prominent example is that of the Lasso problem which will be discussed in Section 5.1. The following theorem, which is proved in Appendix B, relates the smoothness parameters to the block Lipschitz constants of the partial gradients of and to the Lipschitz constants of the gradients of its components ’s in (3.1), as well as to the distribution of the random variable .
Theorem 3.1**.**
Assume 1 and 1 and let . Then the following hold.
- (i)
1\ \Rightarrow\* 1 provided that*
[TABLE] 2. (ii)
1\ \Rightarrow\* 2 provided that*
[TABLE] 3. (iii)
2\ \Rightarrow\* 1 provided that*
[TABLE] 4. (iv)
1\ \Rightarrow\1* with, for all , . In particular, 1 implies that is Lipschitz smooth.*
Remark 3.2**.**
- (i)
Suppose that in 1, for all , , is -Lipschitz smooth, and, for all , , the canonical embedding of into (see Remark 1.2). Then, 2 holds and, for every , . Hence, in view of Theorem 3.1iii, 1 is met with . This setting was studied in [23]. 2. (ii)
If is -Lipschitz continuous, then 1 is satisfied with, for every , . Therefore, we cover the analysis of the random block-coordinate forward-backward algorithm given in [4, 5] which set the stepsizes as . 3. (iii)
Let, for every , {\mathsf{f}}_{k}(\mathsf{x})={\mathsf{g}}_{k}\big{(}\sum_{i=1}^{m}\mathsf{U}_{k,i}\mathsf{x}_{i}\big{)}. If, for every , 1 (resp. 2) holds for with , then 1 (resp. 2) holds for with . 4. (iv)
Using similar ideas as in the proof of [34, Theorem 12] we show in Appendix B that item i in Theorem 3.1 remains true with
[TABLE]
Remark 3.3**.**
Referring to Theorem 3.1, for all , we have , where
[TABLE]
is the maximum number of blocks processed in parallel. Indeed, since we have . Moreover, since \max_{1\leq k\leq p}\big{(}\sum_{i\in I_{k}}\varepsilon_{i}\big{)}\geq\max_{1\leq i\leq m}\varepsilon_{i}, we have . The inequality is immediate, while the last one derives from the following
[TABLE]
3.2 The smoothness parameters for some special block samplings.
Here we show how to compute (or estimate) the constants and in Theorem 3.1 and Remark 3.2iv, and the related , in some relevant scenarios, when 1 and 1 are satisfied.
Arbitrary parallel sampling.
It follows from Theorem 3.1ii and Remark 3.3 that for an arbitrary (possibly nonuniform) block sampling , 2 is satisfied provided that , for every . Additionally, if we denote by the blockwise Lipschitz constants of the gradient of the function \mathsf{x}\mapsto{\mathsf{g}}_{k}\big{(}\sum_{i=1}^{m}\mathsf{U}_{k,i}\mathsf{x}_{i}\big{)}, then we derive from Remark 3.2iii and the above discussion that 2 holds with111If , then . . However, the above estimates are rather conservative and can be improved for special choices of the block sampling as we will show below. We refer to [31, 35] for further results on nonuniform samplings.
Serial sampling or full separability.
Suppose that or . Then, Remark 3.3 yields . Moreover, recalling Theorem 3.1ii-iv, this also shows that 1, 2, and 1 are indeed equivalent with the same smoothness parameters . So, conditions 1 or 2 find their justification only in the parallel case () and when is not fully separable ().
Fully Parallel.
If , then for every . This yields a fully parallel (deterministic) algorithm. Moreover, since \mathsf{P}\big{(}\varepsilon=(1,\dots,1)\big{)}=1, we have and hence 2 holds with . Actually, also 1 holds with (see Corollary A.2iv).
Uniform samplings.
Suppose that . The sampling is uniform if , with . In this case if we denote by the average number of block updates per iteration, we have and hence , for every . In [34] several types of uniform samplings are studied. In the following we single out two of them. The sampling is said to be doubly uniform if any two sets of blocks with the same number of blocks have the same probability to be chosen. In formula, this means that for every such that , . For such sampling one directly derives from (3.3) in Remark 3.2iv (see Appendix B) that
[TABLE]
A special type of doubly uniform sampling is the -nice sampling in which -a.s. for some . In this case (3.4) reduces to
[TABLE]
Now, according to Remark 3.2iv, if we set, for every , , then condition 1 holds. Additionally, if we denote by the blockwise Lipschitz constants of the gradient of the function \mathsf{x}\mapsto{\mathsf{g}}_{k}\big{(}\sum_{i=1}^{m}\mathsf{U}_{k,i}\mathsf{x}_{i}\big{)}, then we derive from Remark 3.2iii and (3.5) that 1 is satisfied with . This result provides possibly even smaller values for the parameters and was given, in the special setting of Remark 3.2i, in [13, 38].
4 Convergence analysis
In the rest of the paper, referring to Algorithm 1.1, we set
[TABLE]
where is the identity operator on , and
[TABLE]
Then, we have
[TABLE]
and, recalling (1.2), that for every such that ,
[TABLE]
Note that and are functions of the random variables only, hence they are both discrete random variables, which are measurable with respect to .
4.1 An abstract principle for stochastic convergence
We provide an abstract convergence principle for stochastic descent algorithms in the same spirit of [36, Theorem 3.10]. It simultaneously addresses the convergence of the iterates and that of the function values.
Theorem 4.1**.**
Let be a separable real Hilbert space with norm . Let be a proper, lower semicontinuous, and convex function and set and . Let be a sequence of -valued random variables such that and, for every , is -summable. Consider the following conditions
- \rmP1
* is decreasing.* 2. \rmP2
There exist a sequence of sub-sigma algebras of such that, and is -measurable, a sequence of -measurable real-valued positive random variables such that , and such that, for every and ,
[TABLE] 3. \rmP3
There exist and , sequences of -valued random variables, such that , , and -a.s.
Assume 1 and that (\inf_{n\in\mathbb{N}}\mathsf{E}[{\mathsf{\Phi}}(x^{n})]>-\infty)\Rightarrow\2. Then, the following hold.
- (i)
. 2. (ii)
Suppose that . Then and,
[TABLE] 3. (iii)
Suppose that 3 holds and . Then, there exists a random variable taking values in such that -a.s.
Proof.
Taking the expectation in (4.5), we obtain
[TABLE]
i: Since is decreasing, . Thus, the statement is true if . Suppose that and let . Then, 2 holds and the right hand side of (4.6), being summable, converges to zero. Therefore, . Since is arbitrary in , .
ii: Let . Then, . Hence 2 holds and (4.6) yields
[TABLE]
Therefore, . Since is decreasing, the statement follows from Fact 2.4.
iii: Let . Then 2 holds and, since , we derive from (4.5) that,
[TABLE]
Note that and are -measurable. Moreover and hence -a.s. Therefore is a stochastic quasi-Fejér sequence with respect to [11]. Then, in view of [4, Proposition 2.3(iv)] it is sufficient to prove that the weak limit points of are contained in -a.s. By assumption 3 there exist two sequences of -valued random variables and and , such that, for every , , , . Let and let be a subsequence of such that , for some . Then,
[TABLE]
Since is weakly-strongly closed [1], we have , so . ∎
Remark 4.2**.**
Inequalities similar to (4.5) appear implicitly in the analysis of several deterministic and stochastic algorithms [3, 18, 26], to get rate of convergence for the function values. Moreover, (4.5) is related also to the concept introduced in [20], in a deterministic setting.
4.2 Convergence under convexity and strong convexity assumptions
In this section we address the convergence of Algorithm 1.1 in the convex and strongly convex case. The main results consist in the rate of convergence for the mean of the function values and in the almost sure weak convergence of the iterates. We start by recalling a standard result (see [36, Lemma 3.12(iii)]). Here we give a slightly more general version, including the moduli of strong convexity. The proof is given in Appendix B for reader’s convenience.
Lemma 4.3**.**
Let be a real Hilbert space. Let be differentiable and convex with modulus of strong convexity and be proper, lower semicontinuous, and convex with modulus of strong convexity . Let and set . Then, for every ,
[TABLE]
Proposition 4.4**.**
Let 1–1 be satisfied. Let and suppose that 1 holds. Let be generated by Algorithm 1.1 with, for every , . Set and . Let be as in (4.1) and and be the moduli of strong convexity of and respectively, in the norm . Set . Then,
[TABLE]
Proof.
Let and . Since for all , , we derive from Lemma 4.3, written in the norm , and (4.3) that
[TABLE]
Now, we majorize . By Fact 2.2 and Fact 2.3, we have
[TABLE]
Moreover,
[TABLE]
where in the last inequality we used that
[TABLE]
which was obtained from (2.3) with
[TABLE]
Therefore,
[TABLE]
Next, it follows from (4.3), 1, and Fact 2.2 that
[TABLE]
Then, we derive from (4.11) that
[TABLE]
The statement follows from (4.9), considering that
[TABLE]
Proposition 4.5**.**
Let 1–1 be satisfied. Let and be as in (4.1) and be generated by Algorithm 1.1. Let and be an -valued random variable which is measurable w.r.t. . Then
[TABLE]
and .
Proof.
If follows from (4.3), Fact 2.2, and Fact 2.3 that
[TABLE]
The second equation follows from (4.12), by choosing . ∎
The following result is a stochastic version of [36, Proposition 3.15].
Proposition 4.6**.**
Let 1–1 be satisfied. Let and suppose that 1 holds. Let be generated by Algorithm 1.1 with, for every , . Set and . Let and be as in (4.1) and and be the moduli of strong convexity of and respectively, in the norm . Set . Then, the following hold.
- (i)
* is decreasing.* 2. (ii)
Suppose that . Then,
[TABLE] 3. (iii)
For every and every
[TABLE]
Proof.
Let and . Since
[TABLE]
we derive from (4.8), multiplied by , that
[TABLE]
Then for an -valued -measurable random variable , Proposition 4.5 yields
[TABLE]
Taking in (4.14), we have
[TABLE]
which plugged into (4.14), with , gives iii. Moreover, taking the expectation in (4.15), we obtain
[TABLE]
which gives i. Finally, set for all , \xi_{n}=\mathsf{E}\big{[}{\mathsf{F}}(x^{n})-{\mathsf{F}}(x^{n+1})\big{|}\mathfrak{E}_{n-1}\big{]}\geq 0. Then
[TABLE]
This shows that if , then is -integrable and hence it is -a.s. finite. Then ii follows from (4.15) and Proposition 4.5. ∎
Proposition 4.7**.**
Under the same assumptions of Proposition 4.6, suppose that condition 1 is replaced by condition 2. Then
[TABLE]
Proof.
We derive from 2 (since has the same distribution of ) and (4.3) that
[TABLE]
Therefore, summing (4.10), from to , we have
[TABLE]
Hence -a.s. ∎
Proposition 4.8**.**
Under the assumptions of Proposition 4.6, suppose in addition that is bounded from below. Then, there exist and , sequences of -valued random variables, such that the following hold.
- (i)
* -a.s.* 2. (ii)
* and -a.s.*
Proof.
It follows from (4.2) that, , for all and . Hence
[TABLE]
Set and let be such that, for every ,
[TABLE]
Clearly is measurable and hence it is a random variable. Moreover, for every ,
[TABLE]
Now, since is bounded from below, Proposition 4.6ii yields that is summable -a.s. and hence -a.s. The statement follows from the fact that is Lipschitz continuous (see Theorem 3.1iv). ∎
Now we are ready to state one of the main convergence results of this paper. From one hand, it extends to the stochastic setting a well-known convergence rate of the (deterministic) forward-backward algorithm [7, 15, 36]. On the other hand, it proves the almost sure weak convergence of the iterates of Algorithm 1.1 in the convex case. We stress that none of the works [21, 23, 31, 33, 34, 35, 38] addresses this latter aspect. To the best of our knowledge, [4] is the only work that proves almost sure weak convergence of the iterates. However, in [4, Corollary 5.11] the stepsize is set according to the (global) Lipschitz constant of which, in general, leads to smaller stepsizes and worse upper bounds on convergence rates. See the subsequent discussion.
Theorem 4.9**.**
Let 1–1 be satisfied. Let and suppose that 1 holds. Let be generated by Algorithm 1.1 with, for every , . Set and . Let be as in (4.1) and set , , and . Then, the following hold.
- (i)
. 2. (ii)
Suppose that . Then and, for every integer ,
[TABLE]
Moreover, there exists a random variable taking values in such that -a.s.
Proof.
Proposition 4.6iii with gives, for all and ,
[TABLE]
where
[TABLE]
Note that the random variables ’s are discrete with finite range and is decreasing. Moreover, . Therefore, the statement follows from Theorem 4.1 and Proposition 4.8. ∎
Discussion.
In the following we examine some crucial aspects related to Algorithm 1.1. We suppose that 1 and 1 hold and that for every , with .
The benefit of a parallel block update.
Here we discuss the advantage of updating multiple blocks in parallel instead of just a single block. We consider the setting of a -nice uniform block sampling, which was described in Section 3.2. In this case, for ever , with , and, since , we have, for every , . Moreover, we can set for every , with defined as in (3.5). In order to compare different choices of , we normalize the iterations so to match the same computational cost per iteration of the standard (full parallel) forward-backward algorithm (FB). It follows from (4.18) that after iterations of Algorithm 1.1, which have the same total computational cost of iterations of FB, we have
[TABLE]
where . Now, since, , we see that if and , then is close to 1 (and is close to zero), so that nearly does not depend on , as long as remains sufficiently small. For instance, in Section 6 we consider the setting where and . In such case, if we let , the corresponding ’s and ’s are essentially the same so that the right hand side of (4.19) does not change much. Therefore, the above options for require the same total amount of computations (i.e., block-coordinate updates) and lead essentially to the same improvement in the objective function. However, a parallel implementation, say with on a CPU with cores, will be times faster than a serial implementation () which uses only one core per iteration. In summary, in the large scale ( large) and sparse () setting, the parallel strategy ( and equal to the number of CPU cores) is definitely advantageous provided that is sufficiently small compared to .
Comparison with [4].
The almost sure weak convergence of the iterates of Algorithm 1.1 is also obtained in [4], but with stepsizes set according to the global Lipschitz constant of the gradient of . Let be the Lipschitz constant of and note that is also Lipschitz smooth in the norm , defined by the operator , with constant (see Corollary A.2iv). Therefore, the results in [4] can be applied in the original norm or in the norm . In this respect we note that since
[TABLE]
the implementation in the norm is nothing but Algorithm 1.1 with stepsizes . In both cases Corollary 5.11 in [4], applied in the corresponding norms, proves weak convergence of the iterates for Algorithm 1.1 with stepsizes and respectively. However, Theorem 4.9, together with Theorem 3.1, allows to set the stepsizes as . Since may be much smaller than and, in view of Remark 3.3, it is always smaller than , Theorem 4.9 provides a significant improvement over [4] in terms of flexibility in the stepsizes.
The advantage over the standard FB.
We consider the forward-backward algorithm (FB) in the original norm of and in the norm . All the remarks about the stepsizes discussed in the previous paragraph apply also here. Moreover, in the case , standard convergence rate for FB (see e.g., [7]) yields that after iterations, we have
[TABLE]
depending on which of the two above implementations of FB we consider. In order to appropriately compare the rates (4.20) with that of Algorithm 1.1 given in Theorem 4.9 in the following we set and analyze two choices of the block sampling.
- (i)
Assume that we perform a -nice block sampling. Then we saw that the (normalized) convergence rate of Algorithm 1.1 is (4.19). We first note that (4.19) reduces to the second inequality in (4.20) when and . Comparing the bounds in (4.20) with (4.19) (with ) we see that, if we assume that the terms and are about of the same magnitude, then Algorithm 1.1 features always a better rate than FB if implemented in the norm (since ), whereas if FB is implemented in the original norm of , Algorithm 1.1 is still a better choice provided that . 2. (ii)
Suppose that the block sampling performs on average updates per iteration and that, for every , is proportional to the Lipschitz constant , that is, (provided that ). In this case, as stated at the beginning of Section 3.2, we can let and . Then, (4.18) becomes for and ,
[TABLE]
where and . Here we see that Algorithm 1.1 can be superior to FB if , under the assumption that .
We now provide an additional convergence theorem, analyzing the strongly convex case, which extends [21, Theorem 1] and [38, Theorem 3] to an arbitrary (not necessarily uniform) sampling and to the more general stepsize rule (1.5). The proof is still based on Proposition 4.6iii and is postponed to Appendix B.
Theorem 4.10**.**
Under the same assumptions of Theorem 4.9, let and be the moduli of strong convexity of and respectively, in the norm , and suppose that and that . Then, for every ,
[TABLE]
where
[TABLE]
Remark 4.11**.**
Let and . Let and be the moduli of strong convexity of and respectively, in the norm . Then it is easy to see that and . Moreover, as in Remark 2.1, one can also see that . Then, (4.21) becomes
[TABLE]
One can check that the maximum of with respect to is
[TABLE]
which is achieved at
[TABLE]
Note that if (as is normally the case), then and .
Remark 4.12**.**
If and are the moduli of strong convexity of and respectively in the original norm. Let, for every , and set . Then and Therefore, the optimal stepsizes are achieved for
[TABLE]
and the corresponding rate in Theorem 4.10 becomes
[TABLE]
Remark 4.13**.**
Suppose that the block sampling is uniform, that is, for all and let, for every , . Then and Theorem 4.10 reduce to
[TABLE]
This result was obtained in [38, Theorem 3], which is in turn a generalization of [21, Theorem 1], treating the serial case (). Thus, Theorem 4.10 and the subsequent Remark 4.11 show that the rate in (4.26) can indeed be improved by choosing .
4.3 Linear convergence under error bound conditions
In this section we analyze the convergence of Algorithm 1.1 under error bound conditions. We improve and simplify the results given in [23]. In the rest of the section we assume 1 and 2. Moreover, we let , , , and suppose .
We consider the following condition, which was studied in [8] in connection with the proximal gradient method and is known as Luo-Tseng error bound condition [22].
- \rmEB
For some , we have
[TABLE]
Remark 4.14**.**
- (i)
Another popular error bound condition is that of the metric subregularity of the subdifferential. More precisely, is -metrically subregular on with respect to the metric [8, 15] if for some the following holds
[TABLE] 2. (ii)
1 and (4.28) are equivalent if , since in that case and . 3. (iii)
Since and , it follows that if for every , , then (4.28) holds with constant . 4. (iv)
[8, Theorem 3.5] yields that for any Hilbert norm , . So, if 1 holds on , then (4.28) holds on with . In [8, Theorem 3.4-3.5] also the reverse implication was shown when is Lipschitz smooth and is a sublevel set of .
Remark 4.15**.**
In [8, Corollary 3.6] condition 1 was shown to be equivalent to the following quadratic growth condition (also called -conditioning in [15])
[TABLE]
on every sublevel set . Moreover, the relationships between the constants are and . Finally, if the quadratic growth condition (4.29) holds, then (4.28) holds on , with .
We now analyze the convergence of Algorithm 1.1 under condition 1.
Theorem 4.16**.**
Under the assumptions of Theorem 4.9, suppose that and that 1 holds on a set such that {\mathsf{X}}\supset\big{\{}x^{n}\,|\,n\in\mathbb{N}\big{\}} -a.s. with . Then,
[TABLE]
Moreover, there exists a random variable which takes values in such that -a.s. and \mathsf{E}[\lVert x^{n}-x_{*}\rVert_{{\mathsf{W}}}]=O\big{(}(1-\mathsf{p}_{\min}\min\{1,(2-\delta)/(2c_{{\mathsf{X}},{\mathsf{\Gamma}}^{-1}})\})^{n/2}\big{)}.
Proof.
Let and . Then, (4.8) with , yields
[TABLE]
Since (4.31) holds for all , using 1, (4.15), and Proposition 4.5, we have
[TABLE]
which can be equivalently written as
[TABLE]
Therefore,
[TABLE]
which gives (4.30). Now we set \rho=1-\mathsf{p}_{\min}\min\big{\{}1,(2-\delta)/(2c_{{\mathsf{X}},{\mathsf{\Gamma}}^{-1}})\big{\}} and . Then, Jensen inequality, (4.16), and (4.30) yield
[TABLE]
Therefore, since , we have . Hence -a.s., which means that is a Cauchy sequence -a.s. Now, Theorem 4.9ii yields that there exists a random variable with values in such that -a.s. Therefore, -a.s. Finally, let . Then, for every ,
[TABLE]
Hence, letting , we have . Therefore, it follows from (4.33) that
[TABLE]
Remark 4.17**.**
- (i)
The rate given in Theorem 4.16 matches the one given in [8, Theorem 3.2] for the deterministic case (). 2. (ii)
In Theorem 4.16, the constant depends on the stepsizes ’s which in turn depend on (usually with ). Therefore, the optimal value of in the rate (4.30) can be determined after specifying the expression of . We did so in the special application of Section 5.2. 3. (iii)
In [23, Definition 5.2], in relation to Algorithm 1.1 but with uniform block sampling and assuming 1 and , the following error bound condition is considered
[TABLE]
for some constants and . The authors show several examples in which such condition is satisfied with and possibly . The above error bound looks more general then 1. However, for the purpose of analyzing Algorithm 1.1 and under the assumptions considered in [23] this is not the case. Indeed, in [23, equation (3.11)] it was shown that the algorithm is descending almost surely222Alternatively, note that 1 implies 2 which in turn, in view of Proposition 4.7, ensures the descending property., so -a.s. Therefore, since , if (4.34) holds on a set containing -a.s. the set , then 1 holds on -a.s. with . Thus, Theorem 4.16 applies accordingly. Moreover, [23, Theorem 5.5] gives the linear rate
[TABLE]
Then, we have and hence
[TABLE]
This shows that Theorem 4.16 improves the rate in [23, Theorem 5.5]. Moreover, the analysis given here, relying on Proposition 4.6, relaxes the assumptions and is significantly simpler. 4. (iv)
It follows from [23, Theorem 6.8] that if is a quadratic function and is an indicator function of a polyhedral set, then (4.34) is satisfied on . Therefore, if is bounded, then 1 holds on with and Theorem 4.16 can be applied, since . 5. (v)
Several works address the convergence of random coordinate descent methods under error bound conditions. We mention [16] which considers a serial sampling and stepsizes set according to the global Lipschitz constant of and [14], which analyzes restarting procedures for accelerated and parallel coordinate descent methods using assumptions 1 and (4.29).
Remark 4.18**.**
Often error bound conditions or quadratic growth conditions are satisfied when is a sublevel set (see Remark 4.15). So, in such scenarios, apart when is a sublevel set of , in order to fulfill the assumption -a.s. in Theorem 4.16, it is desirable for Algorithm 1.1 to be (a.s.) descending. This occurs if condition 2 holds (Proposition 4.7), whereas, in general, 1 does not guarantee any such descending property. However, especially when , condition 2 may be much more restrictive than 1, thus leading to a significant reduction of the stepsizes, which ultimately slows down the convergence. The next result shows that Algorithm 1.1 can be slightly modified so to ensure the descending property while keeping the validity of Theorem 4.16.
Theorem 4.19**.**
Let 1–1 be satisfied. Let and suppose that 1 holds. Suppose in addition that , and that 1 holds on the set with . Let be generated by the following variation of Algorithm 1.1
[TABLE]
Then the conclusions of Theorem 4.16 still hold.
Proof.
It follows from the definition of that . Therefore . Recalling (4.2) and (4.3), algorithm (4.35) can be alternatively written as
[TABLE]
and we have . Then we can essentially repeat the argument in the proof of Theorem 4.16. First we note that (4.8) and hence (4.31) holds with replaced by . This follows from the definition of . Moreover, also (4.13) holds with replaced by and hence we derive (with and ) that
[TABLE]
Then, we have
[TABLE]
and hence, since , (4.32) still holds (for the new definition of ). Thus, (4.30) follows. As for the second part of the statement, we note that, since , by (4.37), we have . Moreover, it follows from the definitions of and in algorithm (4.36) that
[TABLE]
and hence, by Proposition 4.5, we have
[TABLE]
In the end (4.16) with still holds and the proof can continue as in that of Theorem 4.16. ∎
5 Applications
In this section we show some relevant optimization problems for which the theoretical analysis of Algorithm 1.1 can be particularly useful.
5.1 The Lasso problem
Many papers study the convergence of coordinate descent methods for the Lasso problem and recent works prove linear convergence (see e.g., [16, 23, 25]). In [16] a random serial update of blocks is considered while in [25] the general framework of feasible descend methods is analyzed which include (nonrandom) cyclic coordinate methods. In the following we discuss our contribution comparing with [23]. Let and . We consider the problem
[TABLE]
We denote by and the -th column and -th row of respectively. Since
[TABLE]
1 holds and . Moreover, since and recalling Remark 3.2i, conditions 1 and 2 are satisfied with and respectively. Then, Algorithm 1.1 (assuming that each block is made of one coordinate only) writes as
[TABLE]
where the soft thresholding operator is defined as and is the canonical basis of . Now, define . Then multiplying the equation in (5.2) by and subtracting by both terms, the algorithm is equivalently written as
[TABLE]
showing that each iteration costs multiplications, where is the maximum number of block updates per iteration. We now address the determination of the smoothness parameters . We first give a general rule which holds for any arbitrary sampling. Recalling Theorem 3.1iii and Remark 3.2i and noting that , if we set , then 1 and hence 2 holds. This choice was considered in [23]. Moreover, according to the discussion at the beginning of Section 3.2 other options for satisfying 2 are or . This latter choice is better than the second one and, if we assume that the nonzero entries of are about of the same magnitude, it is also better than the first one. Next, we face the special case of the -nice sampling which allows to reduce the ’s while satisfying 1. Recalling the corresponding discussion in Section 3.2, we have the following alternatives: (1) set, for every , ; (2) set for every , . Finally, we make few remarks on the convergence properties of algorithm (5.3). Since the objective function in (5.1) satisfies a quadratic growth condition on its sublevel sets [15, Example 3.8], then Remark 4.15, Remark 4.18, and Theorem 4.16, yield linear convergence of algorithm (5.3) provided that 2 holds. Whereas Theorem 4.19 ensures that if we modify algorithm (5.3) so that we accept the next iterate only if , then the resulting algorithm converges linearly under condition 1. If the violation of the monotonicity condition above occurs few times along all the iterations, this modification does not increase much the computational cost of the algorithm (see also Section 6). We stress that both Theorem 4.16 and Theorem 4.19 ensure also almost sure and linear convergence in mean of the iterates of (5.2). This latter result is new and is especially relevant in this context, since the iterates carry sparsity information.
5.2 Computing the minimal norm solution of a linear system
Let and (the range of ). Let us consider the problem
[TABLE]
Here, we denote by and the -th row and the -th column of . The dual problem is
[TABLE]
which is a smooth convex optimization problem. Moreover, if is the solution of (5.4) and (the primal-dual relationship), then we have
[TABLE]
Then, the dual problem is clearly of the form (1.1), with , and 1 and 1 are satisfied, assuming that each block is made of one coordinate only, with and . So, Algorithm 1.1 applied to (5.5), turns into
[TABLE]
Now, setting and multiplying the above equality by , we have
[TABLE]
Since, , it is easy to see, through a singular value decomposition of , that, for every , (where is the minimum singular value of ) [15, Example 3.6]. So, in view of Remark 4.14ii-iii, 1 is satisfied on the entire space with constant . Therefore, if, for every , with , Theorem 4.16 and (5.6) ensure the linear convergence of the iterates ’s towards the solution of (5.4) with rate \big{(}1-\mathsf{p}_{\min}\min\big{\{}1,\gamma_{\min}\sigma^{2}_{\min}({\mathsf{A}})(2-\delta)/2\big{\}}\big{)}^{1/2}. We remark that (5.7) is nothing but a stochastic gradient descent algorithm on the problem
[TABLE]
Since , we have then showed the linear convergence rate
[TABLE]
of the stochastic gradient descent with arbitrary and possibly variable batch size for least squares problems. This also shows that the best rate is achieved for . We finally note that in the serial case, that is, if for every , multiplying equation (5.7) by , we have
[TABLE]
Therefore, since in this case , we can chose the stepsizes such that (so that ) and hence is a solution of the -th equation of the linear system . Moreover, is the projection of onto the affine space defined by the equation [41]. Thus, this method is nothing but the randomized Kaczmarz method [37] and we proved linear convergence for general probabilities ’s, although the constants we derive are not optimal (see [19, 37, 41]).
5.3 Regularized empirical risk minimization
Let be a separable real Hilbert space. Regularized empirical risk estimation solves the following optimization problem
[TABLE]
where is the training set (input-output pairs), , , is the loss function, which is convex in the second variable, and is a regularization parameter. The dual problem of (5.8) is
[TABLE]
where is the Fenchel conjugate of and is the Gram matrix of . Moreover, the solutions of the primal and dual problems are characterized by the following KKT conditions
[TABLE]
where is the subdifferential of . Note also that the first of (5.10) gives the link between the dual and the primal variable and, if , then it holds . Now, the dual problem (5.9) is of the form (1.1) and hence Algorithm 1.1 can be applied. The following examples give implementation details for two specific losses.
Example 5.1** (Ridge regression).**
The least squares loss is . Then and, in this case, (5.9) reduces to
[TABLE]
which is strongly convex with modulus and has solution . Since is smooth and , conditions 1 and 1 hold with and Algorithm 1.1 (with ) becomes
[TABLE]
Moreover, multiplying (5.11) by , defining , and recalling that , we have
[TABLE]
Note that, since the dual problem is strongly convex with modulus , then it follows from Theorem 4.10, Remark 4.12, and Theorem C.1i that, setting, for every , and , we have
[TABLE]
Now, we compare algorithm (5.12) with the stochastic gradient descent on problem (5.8). Assume that for some . Then, and we can take , and set and , so that algorithm (5.12) turns into
[TABLE]
If we apply stochastic gradient descent with batch size and stepsize directly on the primal problem (5.8) (multiplied by ), and recalling that , we have
[TABLE]
Then, comparing (5.13) and (5.14) we see that, provided that for every , they only differ for the replacement . We stress that the stepsize in the stochastic gradient descent algorithm (5.14) is normally set according to the spectral norm of , which may be difficult to compute. On the contrary in algorithm (5.13) the stepsizes ’s are simply set as , so they allow possibly much longer steps and also do not require any SVD computation.
Example 5.2** (Support vector machines).**
The hinge loss is . Then we have and the dual problem (5.9) is
[TABLE]
Then Algorithm 1.1 on the dual turns into a parallel random block-coordinate projected gradient descent method. Moreover, it follows from Remark 4.17iv that the objective in (5.15) satisfies 1 on its domain. Therefore, it follows from Theorem 4.16, Theorem 3.1i, and Theorem C.1ii that converges linearly to zero, provided that, for all , and .
6 Numerical Experiments
In this section we consider a Lasso problem, that is,
[TABLE]
where is generated with random entries uniformly distributed in so that each row is sparse and with and a sparse vector in . We implement Algorithm 1.1 with () and a -nice uniform sampling, as described in Section 5.1. We present two experiments. The first compares conditions 1 and 2 for the determination of the stepsizes . The second one investigates the role played by . In all the experiments we empirically checked that the algorithm is essentially descending in the sense that during the iterations there are very few violations of the descent property and with low magnitude. So, since the objective function in (6.1) satisfies 1 on the sublevel sets, in virtue of Theorem 4.19, linear convergence holds.
Condition 1 vs 2 and the effectiveness of the parallel strategy.
We compare the conditions 1 and 2 for the stepsizes selection and we checked the critical role played by 1 for the effectiveness of the parallel strategy on problems with sparse structure. Here we set . In Figure 1, Algo1 uses smoothness parameters specifically designed for the -nice sampling, that is, with (making 1 satisfied), while Algo2 uses a more conservative choice for the smoothness parameters which is valid for any sampling updating a maximum of blocks per iteration, that is with (making also 2 satisfied).
In the left diagram, we considered a large scale setting with . In that case may be much smaller than leading to significantly larger stepsizes. Moreover, and more importantly, we note that as long as is small enough the behavior of Algo1 does not depend on (indeed perform equally well), whereas this is not true for Algo2. This feature of 1, first noted in [34], has been already discussed after Theorem 4.9 and is at the basis of the effectiveness of the parallel strategy. Indeed in the small- regime described above, the various versions of Algo1 depicted in Figure 1(left) have the same total computation cost ( block-coordinate updates), but the parallel implementation on cores is times faster than the serial one (). Finally, in the right diagram of Figure 1 we show a scenario in which is larger (the problem is less sparse). In such situation we see that the difference between the two stepsize selection criteria is less evident for . Moreover, Algo1 is more sensitive to (compare and ), so that the benefit of the parallel scheme is reduced.
The effect of
Here we study the effect of over-relaxing the stepsizes, meaning choosing . We compare Algorithm 1.1 with and several choices of . Figure 2 considers different scenarios for the degree of separability of . In those cases we see that choosing usually speeds up the convergence, depending on the parameter of partial separability of and the number of parallel block updates. This fact seems not to occur when both and are very small.
Appendix A Structured Lipschitz smoothness
In this section we discuss the Lipschitz smoothness properties of under the hypotheses 1 and 1 and we prove Theorem 3.1. Most of the results presented in this section are basically given in [34]. However, here they are rephrased in our notation and extended to our more general assumptions.
Proposition A.1**.**
Let be a convex function satisfying assumptions 1 and 1. Let be a nonempty subset of and let be such that , for every . Then for every and such that , we have
[TABLE]
Proof.
Let and, for every , set . Then
[TABLE]
Now, for every , we have
[TABLE]
Therefore, using the convexity of each we have
[TABLE]
It follows from the definition of that . Hence, switching the order of summation, and using the fact that is Lipschitz smooth with constant , we have
[TABLE]
Corollary A.2**.**
Let be a convex function satisfying 1 and 1. Let , , and be such that, for every , . Let , , and . Then, the function is Lipschitz smooth
- (i)
in the metric defined by with constant ; 2. (ii)
in the metric , with constant333 This constant is if the ’s are set according to (1.5) with the ’s as in Theorem 3.1i. * .* 3. (iii)
in the (original) metric of with constant ; 4. (iv)
in the metric , with constant .
Proof.
i: It follows from Proposition A.1 with and noting that and then invoking the characterization of the Lipschitz continuity of the gradient troughout the descent lemma (see [1, Theorem 18.15(iii)]).
ii: It follows from (A.1) by choosing , , and noting that and then invoking [1, Theorem 18.15(iii)].
iii: It follows from ii with .
iv: It follows from ii with . ∎
Remark A.3**.**
If ( is not partially separable), Corollary A.2-iii-iv establishes that
[TABLE]
We show that the above bounds are tight. Indeed, if we consider , where and , then we have (where is the -th column of ), so that . Instead, since , the Lipschitz constant of is . It is well-known that if is rank one, then , so in this case the Lipschitz constant of is exactly . Moreover, if in addition the columns of have the same norm, then and hence . We finally note that if is an orthonormal matrix, then and hence .
Corollary A.4**.**
Let be a function satisfying 1 and 1. Let and . Then,
[TABLE]
Proof.
It follows from Proposition A.1 with , , and q_{i}=1/\big{(}\max_{1\leq k\leq p}\mathrm{card}(I\cap I_{k})\big{)}=1/\big{(}\max_{1\leq k\leq p}\sum_{i\in I_{k}}\epsilon_{i}\big{)}. ∎
Remark A.5**.**
Most of the above results, appears in [34] for the special case that for and for . In particular, see [34, Theorem 8].
Proposition A.6**.**
Let be a function satisfying 1 and suppose that, for every , is -Lipschitz smooth. Set for every , . Then the following holds.
- (i)
* is Lipschitz smooth with constant in the original metric of ;444 In [4, Corollary 5.11] the worse constant was considered.* 2. (ii)
* satisfies assumption 1 with ;* 3. (iii)
Suppose that, for every and for every , , the range of the operators and are orthogonal. Then, for every , ,
[TABLE]
Proof.
i: For every , let , . Let . We have
[TABLE]
Therefore, we have
[TABLE]
ii: It follows from (A.5) with that
[TABLE]
hence is a Lipschitz constant of .
iii: Since if , it follows from (A.4) that
[TABLE]
Remark A.7**.**
If and are orthogonal to each other, then
[TABLE]
and hence .
Appendix B Additional proofs
Proof. of Theorem 3.1
ii: It follows from (A.2) that, point-wise it holds
[TABLE]
Moreover, since \beta_{2}=\operatorname*{\text{\rm ess\>sup}}\big{(}\max_{1\leq k\leq p}\textstyle\sum_{i\in I_{k}}\varepsilon_{i}\big{)}, we have that -a.s. The statement follows.
i It follows by taking the expectation in (B.1) and noting that
[TABLE]
where we used the fact that for every discrete random variable , .
iii It follows from Proposition A.6iii that 1 holds with .
iv: Let and and set . Then
[TABLE]
Hence, it follows from 1 that . This shows that is Lipschitz smooth w.r.t. the -th block coordinate with Lipschitz constant . The global Lipschitz smoothness of follows from Corollary A.2. ∎
We have and {\mathsf{f}}(\mathsf{x}+\mathsf{v})=\sum_{k=1}^{p}{\mathsf{g}}_{k}\big{(}\mathsf{U}_{k}\mathsf{x}+\sum_{i=1}^{m}\mathsf{U}_{k,i}\mathsf{v}_{i}\big{)}. Moreover, . We let and define
[TABLE]
We clearly have , and , . Moreover,
[TABLE]
Therefore,
[TABLE]
Now, if is such that and we set , then we have and
[TABLE]
Hence \psi_{k}\big{(}\sum_{i\in I_{k}}\mathsf{U}_{k,i}\varepsilon_{i}\mathsf{v}_{i}\big{)}\leq(1/t)\sum_{i=1}^{m}\varepsilon_{i}\psi_{k}(t\mathsf{U}_{k,i}\mathsf{v}_{i}) on the event . Then,
[TABLE]
Plugging the above inequality in (B.3) we get
[TABLE]
where in the last inequality we used 1. So, setting as in (3.3), if , then 1 holds. Note that in deriving (B), if for some there are no such that , the corresponding term \max_{k\in\varnothing}\mathsf{P}\big{(}\sum_{i\in I_{k}}\varepsilon_{i}=t\,|\,\varepsilon_{i}=1\big{)} can be set to zero. ∎
Proof. of formula (3.4).
Since in the proof of (3.3) in Remark 3.2iv, we only use the fact that , we can assume, without loss of generality, that, for every , . Let . Then, since the block sampling is doubly uniform, we have that \mathsf{P}\big{(}\sum_{i\in I_{k}}\varepsilon_{i}=t\,|\,\varepsilon_{i}=1\big{)} does not depend on such that . Therefore, (3.3) becomes \beta_{1,i}=\sum_{t=1}^{\eta}t\mathsf{P}\big{(}\sum_{j\in I_{k}}\varepsilon_{j}=t\,|\,\varepsilon_{i}=1\big{)}, for some such that . Hence
[TABLE]
where and (with ). Now, since and , we have that \tilde{\mathsf{p}}=\big{(}\mathsf{E}[(\sum_{i=1}^{m}\varepsilon_{i})^{2}]-m\mathsf{p}\big{)}/(m(m-1)), which plugged into (B.6) gives (3.4).
Proof. of Lemma 4.3
Let . It follows from the definition of that . Therefore, hence
[TABLE]
Now, we note that . Then,
[TABLE]
and hence
[TABLE]
Since , the statement follows. ∎
Lemma B.1**.**
Let . Then the largest constant satisfying the following inequality
[TABLE]
is
[TABLE]
Proof.
Property (B.7) is equivalent to
[TABLE]
Therefore,
[TABLE]
Now, since
[TABLE]
it follows from (B.9) that
[TABLE]
Therefore, the statement follows. ∎
Proof. of Theorem 4.10
We first note that, since, , the conclusion of Proposition 4.6iii can be stated as follows:
[TABLE]
Let and set for brevity , and . Then, (B.11) yields
[TABLE]
Let . Then the above inequality can be rewritten as
[TABLE]
Now, we derive from (2.1)-(2.2) that
[TABLE]
Therefore, it follows from Lemma B.1 (with and ) that
[TABLE]
where
[TABLE]
Then, by (B.12) and (B.13), we have that
[TABLE]
and hence, taking the expectation, and applying the resulting inequality recursively, we have,
[TABLE]
To conclude it is sufficient to note that, since
[TABLE]
we have
[TABLE]
Appendix C Some results on duality theory
In this section, for the reader’s convenience, we recap the results obtained in [9]. Let and be two lower semicontinuous and convex functions defined on Hilbert spaces, and let be a bounded linear operator. In this section we suppose that is -strongly convex. We consider the following optimization problems in duality (in the sense of Fenchel-Rockafellar)
[TABLE]
We define the duality gap function , and recall that
[TABLE]
So, the duality gap function bounds the primal and dual objectives. We have the following theorem
Theorem C.1**.**
Suppose that . Then the following holds:
- (i)
Suppose that is -strongly convex. Let and set . Then,
[TABLE] 2. (ii)
Suppose that is -Lipschitz continuous. Let be such that and set . Then, we have
[TABLE]
Moreover, if is a random variable taking values in and such that and we set , then .
The reference list from the paper itself. Each links out to its DOI / PubMed record.
- 1[1] H.H. Bauschke, P.L. Combettes, Convex Analysis and Monotone Operator Theory in Hilbert Spaces. 2nd Ed. , Springer, New York, 2017.
- 2[2] A. Beck, L. Tetruashvili, On the convergence of the block coordinate descent type methods, SIAM J. Optim., vol. 23, n.4, pp. 2037–2060, 2013.
- 3[3] D.P. Bertsekas, Incremental proximal methods for large scale convex optimization, Math. Program. Ser. B, pp. 129–163, 2011.
- 4[4] P.L. Combettes, J-C. Pesquet, Stochastic quasi-Fejér block-coordinate fixed point iterations with random sweeping, SIAM J. Optim., vol. 25, n.2, pp. 1121–1248, 2015.
- 5[5] P.L. Combettes, J-C. Pesquet, Stochastic quasi-Fejér block-coordinate fixed point iterations with random sweeping II: mean-square and linear convergence, Math. Program. Ser. B, pp. 1–19, 2018.
- 6[6] P.L. Combettes, V.R. Wajs, Signal recovery by proximal forward-backward splitting, Multiscale Model. Simul., vol. 4, pp. 1168–1200, 2005.
- 7[7] D. Davis, Y. Yin, Convergence rate analysis of several splitting schemes. In Splitting Methods in Communication, Imaging, Science, and Engineering (R. Glowinski, S.J. Osher, and W. Yin, Eds.), pp. 115–163, Springer, Cham, 2016.
- 8[8] D. Drusvyatskiy, A.S. Lewis, Error bounds, quadratic growth, and linear convergence of proximal methods, Math. Oper. Res. , Vol. 43, pp. 919–948, 2018.
