Linear Convergence of Variable Bregman Stochastic Coordinate Descent Method for Nonsmooth Nonconvex Optimization by Level-set Variational Analysis
Lei Zhao, Daoli Zhu

TL;DR
This paper proves the linear convergence of a variable Bregman stochastic coordinate descent method for broad nonsmooth, nonconvex problems, using a novel level-set variational analysis approach.
Contribution
It introduces a new convergence analysis for VBSCD in nonsmooth nonconvex optimization, establishing linear rates under certain conditions.
Findings
Almost sure convergence to critical points
Linear convergence rate under level-set error bound
Applicable to large-scale nonconvex nonsmooth problems
Abstract
Large-scale nonconvex and nonsmooth problems have attracted considerable attention in the fields of compress sensing, big data optimization and machine learning. Exploring effective methods is still the main challenge of today's research. Stochastic coordinate descent type methods have been widely used to solve large-scale optimization problems. In this paper, we derive the convergence of variable Bregman stochastic coordinate descent (VBSCD) method for a broad class of nonsmooth and nonconvex optimization problems, i.e., any accumulation of the sequence generated by VBSCD is almost surely a critical point. Moreover, we develop a new variational approach on level sets that aim towards the convergence rate analysis. If the level-set subdifferential error bound holds, we derive a linear rate of convergence for the expected values of the objective function and expected values of random…
| Paper | Problem property | Theoretical Results | ||||||
| Algorithm | ||||||||
| Nesterov, 2012 [8] |
|
linear | ||||||
| Lu &Xiao, 2015 [6] |
|
linear | ||||||
| Patrascu &Necoara, 2015 [10] |
|
linear | ||||||
| This paper |
|
|
||||||
| LS-EB on . |
|
|||||||
| satisfy Assumption 2. |
|
|||||||
|
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSparse and Compressive Sensing Techniques · Advanced Optimization Algorithms Research · Advanced Image Processing Techniques
11footnotetext: Acknowledgments: this research was supported by NSFC:71471112 and NSFC:71871140
Linear Convergence of Variable Bregman Stochastic Coordinate Descent Method for Nonsmooth Nonconvex Optimization by Level-set Variational Analysis
Lei Zhao, Daoli Zhu Antai College of Economics and Management and Sino-US Global Logistics Institute, Shanghai Jiao Tong University, Shanghai, China([email protected])Antai College of Economics and Management and Sino-US Global Logistics Institute, Shanghai Jiao Tong University, Shanghai, China([email protected])
Abstract
Large-scale nonconvex and nonsmooth problems have attracted considerable attention in the fields of compress sensing, big data optimization and machine learning. Exploring effective methods is still the main challenge of today’s research. Stochastic coordinate descent type methods have been widely used to solve large-scale optimization problems. In this paper, we derive the convergence of variable Bregman stochastic coordinate descent (VBSCD) method for a broad class of nonsmooth and nonconvex optimization problems, i.e., any accumulation of the sequence generated by VBSCD is almost surely a critical point. Moreover, we develop a new variational approach on level sets that aim towards the convergence rate analysis. If the level-set subdifferential error bound holds, we derive a linear rate of convergence for the expected values of the objective function and expected values of random variables generated by VBSCD.
Keywords: Level-set subdifferential error bound, Variable Bregman stochastic coordinate descent method, Linear convergence, Variational approach, Nonsmooth nonconvex optimization, Level-set based error bounds
1 Introduction
This paper considers the following nonconvex and nonsmooth optimization problem:
[TABLE]
where is a -smooth function (may be nonconvex) and is a continuous semi-convex function. Moreover, , and .
1.1 Notations and assumptions
Throughout this paper, and denote the Euclidean scalar product of and its corresponding norm respectively.
Let be a subset of and be any point in . Define . When , we set .
Moreover, we use to denote the set of -smooth functions from to ; is the set of continuous functions from to .
Given an , let , set and .
Additionally, the subdifferential calculus that we will use throughout the paper is limiting-subdifferential, which is standard in variational analysis [13].
Throughout the remainder of this paper, we make the following assumption on and .
Assumption 1
- (i)
* is a nonconvex differentiable function with convex. Its gradient is *Lipschitz continuous on such that
[TABLE]
- (ii)
* and is a convex set. Moreover, is semi-convex on with modulus , one has*
[TABLE]
- (iii)
* is level-bounded i.e., the set is bounded (possibly empty) for every .*
From (i) and (ii), is a convex set. In addition, as a consequence of (iii), the optimal value of (P) is finite and the optimal solution set of (P) is non-empty. Moreover, the set of all critical points of is denoted by .
Here we note that the well known SCAD penalty [3] and MCP penalty [15] are both semi-convex.
To construct an algorithm for (P), we introduce Bregman distance function and parameter .
Let be a twice differentiable strongly convex function. The Bregman distance function associated with is defined by
[TABLE]
Then we have , and . We make the following standing assumption on function and parameter .
Assumption 2
- (i)
* is strongly convex with and with its gradient being -Lipschitz.*
- (ii)
The parameter satisfies: .
Under this assumption, we have Bregman distance function satisfies:
[TABLE]
Now we are ready to introduce the variable Bregman stochastic coordinate descent method.
1.2 Variable Bregman Stochastic Coordinate Descent method
In this subsection we introduce the Variable Bregman Stochastic Coordinate Descent (VBSCD) method. First, recall that a variable Bregman distance-like function has the form
[TABLE]
We propose to solve (P) by generating a sequence using the following VBSCD method:
**Variable Bregman Stochastic Coordinate Descent method (VBSCD)
** Initialize
for , do
[TABLE]
**end for
** Here we note that if , VBSCD refines the classical random coordinate descent scheme. (see [6, 8, 10, 11])
1.3 Error bounds and their relationship
In this subsection, let , we introduce four types of error bounds which are always used in the convergence rate analysis of algorithms.
For given positive numbers and , let and
[TABLE]
Then we will introduce the concepts of the level-set subdifferential error bound (LS-EB).
Definition 1.1** (Level-set subdifferential error bound (LS-EB))**
The proper lower semicontinuous function is said to satisfy the level-set subdifferential error bound condition at if there exist , , and such that the following inequality holds:
[TABLE]
For a given Bregman Proximal Mapping, we can introduce Bregman proximal error bound. The definition of Bregman Proximal Mapping is as follows.
**Bregman Proximal Mapping
**Bregman proximal mapping is defined by
[TABLE]
Bregman proximal error bound (BP-EB) is defined as follows.
Definition 1.2** (Bregman proximal error bound (BP-EB))**
Given a Bregman function along with , we say that the Bregman proximal error bound holds at , if there exist , and such that
[TABLE]
The well known Kurdyka-Łojasiewicz is defined as follows. (see [1, 4])
Definition 1.3** (Kurdyka-Łojasiewicz property (KL))**
The proper lower semi continuous function is said to satisfy the Kurdyka-Łojasiewicz (KL) property at , if there exist , and such that the following inequality holds:
[TABLE]
By introducing , we have Luo and Tseng’s error bound (LT-EB) as follows. (see [7, 10, 14])
Definition 1.4** (Luo and Tseng’s error bound (LT-EB))**
We say the Luo-Tseng error bound holds if any , there exists constant and such that
[TABLE]
whenever , .
Additionally, we introduce two standard assumptions, which are always used together with above error bounds.
Assumption 3** ( [7, 10, 14])**
There exists such that whenever with .
Assumption 4** (Growth condition [1])**
For any there exist , and such that
[TABLE]
Here we note that, according to Zhu and Deng [16], the combination of BP-EB, semi-convex and Assumption 3 implies LS-EB. If is convex, LT-EB implies BP-EB. Additionally, if is semi-convex, KL and LS-EB are equivalent. Moreover, we observe that LT-EB is a global version error bound (EB). BP-EB, KL and LS-EB are local-version error bounds. Therefore, LS-EB is the weakest EB-type condition. (see Figure 1)
1.4 Related works
In this subsection, we compare the VBSCD method and existing theoretical results of the stochastic coordinate descent-type methods on linear convergence. (see Table 1) The difference of this paper compared to existing research is our analysis of the convergence and convergence rate using a local version error bound condition (level-set subdifferential error bound). In other words, the error bound condition holds in the neighborhood of a given point. We analyze three cases in this paper: the critical point, local minimum and global minimum.
1.5 Main contributions and outline of this paper
In this paper, we propose a variable Bregman stochastic coordinate descent (VBSCD) method based on the Variable Bregman Proximal Gradient (VBPG) method (Zhu and Deng, 2019 [16], Cohen 1980 [2]) for (P). In this method, we randomly update a block of variables based on the uniform distribution in each iteration. The sequence generated by our algorithm is proven to converge to a critical point of problem (P). Moreover, we develop a new variational approach on level sets that aim towards the convergence rate analysis. The linear convergence rate and linear convergence rate are analyzed in this paper.
The remainder of this paper is organized as follows. Section 2 is devoted to the properties of the Bregman-type mappings and functions. In Section 3 we establish the convergence of VBSCD. In Section 4, the linear convergence rates of three cases are analyzed.
2 Basic properties of Bregman type mappings and functions
The analysis of convergence and rate of convergence for the VBSCD method, essentially relies on Bregman type mappings and functions. Given a Bregman function and a positive , the following mappings and functions will play a key role for the analysis of convergence and rate of convergence for the VBSCD method.
**Bregman Proximal Envelope Function (BP Envelope Function)
**BP envelope function is defined by
[TABLE]
**Coordinate Bregman Proximal Mapping
**Coordinate Bregman proximal mapping is defined by
[TABLE]
which can be viewed as the optimizer of optimization problem (APi(k)), where is replaced by random variable and is replaced by . In another words, if index is random variable, then is a random output.
Lemma 2.1
Let a Bregman function and parameter be given. Let index is chosen from with equal probability. Suppose that Assumptions 1 and 2 hold, then for any , the following assertions are true.
- (i)
(see (2)) and (see (6)) are single value.
- (ii)
For any given function , we have
[TABLE]
- (iii)
For , for any mapping , we have
[TABLE]
Specifically, it guarantees that
[TABLE]
Proof.
- (i)
Since is semi-convex with modules in Assumption 1 and in Assumption 2, by Proposition 2.4 of Zhu and Deng [16], we have statement (i) of this lemma.
- (ii)
& (iii) Trivially.
The above mappings and functions enjoy favorable properties, which are summarized in the following propositions.
Proposition 2.1
(Global properties of Bregman type mappings and functions)* Let a Bregman function and parameter be given. Let index is chosen from with equal probability. Suppose that Assumptions 1 and 2 hold, then for any ,*
- (i)
and ;
- (ii)
N\mathbb{E}_{i}F\big{(}\hat{T}_{i,D,\epsilon}(x)\big{)}-(N-1)F(x)\leq E_{D,\epsilon}(x)-\frac{N}{2}(\frac{m}{\overline{\epsilon}}-L)\mathbb{E}_{i}\|x-\hat{T}_{i,D,\epsilon}(x)\|^{2}.
Proof.
- (i)
Since is the optimizer of minimization problem in (2), we have that
[TABLE]
Take in above inequality, by the fact and we have
[TABLE]
By the gradient Lipschitz of and , we have that
[TABLE]
with .
- (ii)
By the definition of BP Envelope Function we have that
[TABLE]
By (15), we have
[TABLE]
For any , by (14) with , we have
[TABLE]
It follows that
[TABLE]
For any , again using (14) with , we have
[TABLE]
Therefore
[TABLE]
Together ((ii)), (20), (21) and (22), we have
[TABLE]
The following proposition provides an upper bound for the function value under the level-set subdifferential error bound condition.
Proposition 2.2
(Uniform estimate of value proximity by stochastic coordinate Bregman proximal mappings)* Let Bregman function and parameter be given. Let index is chosen from with equal probability. Suppose that Assumptions 1 and 2 hold. Moreover, assume the level-set subdifferential error bound condition holds at for positive numbers , and . If with , then there exist positive number , , and *
- (i)
;
- (ii)
N\mathbb{E}_{i}F\big{(}\hat{T}_{i,D,\epsilon}(x)\big{)}-(N-1)F(x)-\overline{F}\leq E_{D,\epsilon}(x)-\overline{F}\leq\theta_{2}{dist}^{2}(x,[F\leq\overline{F}]);
- (iii)
N\mathbb{E}_{i}F\big{(}\hat{T}_{i,D,\epsilon}(x)\big{)}-(N-1)F(x)-\overline{F}\leq N^{2}\kappa\mathbb{E}_{i}\|\hat{T}_{i,D,\epsilon}(x)-x\|^{2};
- (iv)
F(x)-\overline{F}\leq b\left[F(x)-\mathbb{E}_{i}F\big{(}\hat{T}_{i,D,\epsilon}(x)\big{)}\right];
- (v)
.
Proof.
- (i)
This statement from Theorem 3.1 in Zhu and Deng [16].
- (ii)
The first inequality of this statement is followed by statement (ii) of Proposition 2.1 in this paper. The second inequality is derived by Proposition 3.1 in Zhu and Deng [16].
- (iii)
Together statement (i) and (ii), we have
[TABLE]
- (iv)
From statement (iii) of this Proposition, we have that
[TABLE]
where .
- (v)
This statement is directly from statement (iv) of this proposition.
Moreover, we introduce the following Property A which will be used in the convergence rate analysis.
Property (A) Let be given, We say satisfies Property A if we have F\big{(}\hat{T}_{i,D,\epsilon}(x)\big{)}\geq F(\overline{x})=\overline{F}, .
Lemma 2.2
Let Bregman function and parameter be given. Suppose Assumptions 1 and 2 hold. Let with be given. If satisfies Property A, then for all , we have that and .
Proof. Since satisfy Property A, F\big{(}\hat{T}_{i,D,\epsilon}(x)\big{)}\geq\overline{F}, . By statement (i) of Proposition 2.1 and , we have that
[TABLE]
with . Since , consequently,
[TABLE]
Since , it follows that , . It follows that .
3 Convergence analysis for VBSCD
In this section, we discuss the convergence behavior of the sequences generated by the VBSCD method. In section 3 and 4, we assume that the variable Bregman functions and parameters uniformly satisfy Assumption 2. In algorithm VBSCD, the indices , are random variables. After iterations, the VBSCD method generates a random output . We denote by is a filtration generated by the random variable , i.e.,
[TABLE]
Additionally, we define that , is the condition expectation w.r.t. and the condition expectation in term of given as . Several basic properties of sequences and are summarized in the following proposition.
Proposition 3.1
Suppose that Assumptions 1 and 2 hold. Let be a sequence generated by the VBSCD method. Then the following assertions hold:
- (i)
, and ;
- (ii)
* with is some random variable, and ;*
- (iii)
The random variable sequence generated by VBSCD is almost surely bounded;
- (iv)
Any cluster point of a realization sequence generated by VBSCD is a critical point of .
Proof.
- (i)
The claim follows directly from (i) of Proposition 2.1.
- (ii)
Take expectation of on both side of statement (i) of this proposition, we have
[TABLE]
By the Robbins-Siegmund’s Lemma [12], we have with is some random variable and . Further, due to the almost sure convergence of sequence , it can easily get that . Together with statement (i) of this proposition we have .
Moreover, implies that .
- (iii)
From statement (ii) with is some random variable, then the almost surely boundness of comes from Assumption 1, is level bounded.
- (iv)
By statement (ii) of Proposition 2.3-2.5 in Zhu and Deng 2019 [16] and in statement (ii) of this proposition, we have that any cluster point of a realization sequence generated by VBSCD is a critical point of .
4 Linear Convergence of VBSCD
This section will provide the linear convergence of VBSCD under the level-set subdifferential error bound condition. First we propose a lemma which will be used in the convergence rate analysis.
Lemma 4.1
Suppose Assumptions 1 and 2 hold. Let be a sequence generated by VBSCD, if , then there exists positive number , which is independent of , such that , .
Proof. This results directly by the basic property of expectation.
Under the level-set subdifferential error bound condition, next proposition will show that the sequence of random variable generated by the VBSCD method almost surely belong to .
Proposition 4.1
(Almost surely finite length property of sequence )* Suppose Assumptions 1 and 2 hold. Furthermore, we assume that the level-set subdifferential error bound holds at the point with and . Let , , and be constants given in Proposition 2.1, 2.2 and Lemma 4.1 respectively. Suppose that satisfies the following conditions:*
[TABLE]
Assume moreover that
[TABLE]
Then the following statements hold.
- (i)
, ;
- (ii)
* (finite length property), and the sequence converges to a random variable ;*
- (iii)
, and the sequence converges to a point .
Proof.
- (i)
From Assumptions, obviously, . By (i) of Proposition 3.1, we have . By using (31) and Lemma 2.2 we have and . Moreover, . Combining the triangle inequality, we have . By (30), it follows that .
Now suppose for and . Again using (31) and Lemma 2.2, we have and .
We need to show that . By the concavity of function , we have
[TABLE]
Combing statement (iv) of Proposition 2.2, above inequality follows that
[TABLE]
Take expectation with respect to for (33), it follows
[TABLE]
or
[TABLE]
Taking expectation with respect to , (35) follows that
[TABLE]
Since , it follows that
[TABLE]
By Lemma 4.1, we have
[TABLE]
with .
Again combining (33) and (i) in Proposition 3.1, we have
[TABLE]
or
[TABLE]
It follows from with nonnegative and that
[TABLE]
Summing ((i)) for , we obtain
[TABLE]
By (38), we have
[TABLE]
Combining triangle inequality, (43) and (30), we have
[TABLE]
It follows that , .
Therefore, , .
- (ii)
A direct consequence of (43) is, for all ,
[TABLE]
(45) implies that the sequence converges to a random variable .
- (iii)
Take expectation respect to on both side of statement (ii) of this Proposition, we have that
[TABLE]
By the convexity of and the fact , we have that
[TABLE]
Moreover, (47) implies that the sequence converges to the point .
To further study the linear convergence of VBSCD, we need the following assumption.
Assumption 5
The set is convex.
Here we note that, local convex and local quasi-convex function satisfy Assumption 5. In robust statistics, there are many popular functions which are both quasi-convex and semi-convex, such as SCAD, MCP, etc. (see [3, 5, 15])
4.1 Linear convergence of VBSCD under LS-EB at critical point
For given and a realization sequence generated by VBSCD, let be a cluster point of . Therefore, there is such that satisfy (29) and (30). Let be a filtration generated by the random variable ,…,, i.e.,
[TABLE]
where , is fixed index corresponding to realization . And . By statement (iii) of Proposition 3.1, there exists positive number , for all , . Then we have the linear convergence as follows theorem.
Theorem 4.1** (The linear convergence under LS-EB at critical point)**
Suppose Assumption 1 and 2 hold. Let Assumption 3 holds with . Moreover, we assume that the level-set subdifferential error bound holds at the point with and . Let be constants given in Proposition 2.2. Considering the sequence of realization . Then there exist , for all the following assertions are true.
- (i)
* converges to value at the -linear rate of convergence by expectation; that is, for , there are some such that*
[TABLE]
As a consequence,
[TABLE]
- (ii)
The sequence , converges to point at the -linear rate of convergence; under Assumption 5 we have .
Proof. Let be the set of accumulation points for the realization of . Since Assumption 3 holds with , we have that , . Moreover, , . Then condition (31) holds.
Together with satisfy (29) and (30), by Proposition 4.1, for all , .
- (i)
For all , by (v) of Proposition 2.2 it follows that
[TABLE]
where . Again using the fact , we take the expectation with respect to for inequality (50), we obtain that
[TABLE]
- (ii)
We now derive the R-linear rate of convergence of . Taking expectation with respect to for (i) in Proposition 3.1, we have
[TABLE]
Thus
[TABLE]
From the convexity of and above inequality, we see that
[TABLE]
where . By statement (iii) of Proposition 4.1, we have and converges to the point . Hence,
[TABLE]
This shows that converges to at the R-linear rate; that is,
[TABLE]
Since , then and . Let be the set realization of . For , , under Assumption 5, we have . Since and , we have . Then it follows that .
4.2 Linear convergence to a local minima
Thorough this subsection, let be a local minimum on . If the level-set subdifferential error bound holds at point with and , we will show that, under Assumption 4 and special selection of the initial point, sequence almost surely belongs to and the linear convergence to local minima of VBSCD
Theorem 4.2** (The linear convergence to a local minima)**
Suppose Assumption 1 and 2 hold. Moreover, we assume that the level-set subdifferential error bound holds at the point with and . Let , be constant given in Proposition 2.2, Assumption 4 holds with , , and . Let satisfy (29) and (30) and the sequence be generated by the VBSCD method. Then the following assertions are true.
- (i)
* converges to value at the -linear rate of convergence by expectation; that is, there are some such that*
[TABLE]
As a consequence,
[TABLE]
- (ii)
The sequence , converges to point at the -linear rate of convergence; under Assumption 5, we have and is a local minimum.
Proof. The proof of this theorem is similar to the proof of Theorem 4.1. The difference of proof is the following:
- (1)
Together Assumption 4 (Growth condition) holds with , , and and statement (i) in Proposition 3.1, we have that , implies . Since is local minimum on , then and condition (31) holds. Since satisfy (29) and (30), by Proposition 4.1, we have that , .
- (2)
Under Assumption 5, we have and . Since is a local minimum on , we have that is also a local minimum on .
4.3 Linear convergence to a global minima
Thorough this subsection, let be a global minimum. If the level-set subdifferential error bound holds at point with and , we will show the linear convergence of VBSCD.
Theorem 4.3** (The linear convergence to a global minima)**
Suppose Assumption 1 and 2 hold. Moreover, we assume that the level-set subdifferential error bound holds at the point with and . Let and be constant given in Proposition 2.2. There exist such that the inequalities
[TABLE]
implies that any realization of the sequence generated by VBSCD satisfies
- (i)
,
- (ii)
* converges to value at the -linear rate of convergence by expectation; that is, there are some such that*
[TABLE]
As a consequence,
[TABLE]
- (iii)
The sequence , converges to point at the -linear rate of convergence; under Assumption 5, we have and is a global minimum.
Proof. It is a straightforward variant of Theorem 4.1 and 4.2.
The reference list from the paper itself. Each links out to its DOI / PubMed record.
- 1[1] Attouch, H., Bolte, J., & Svaiter, B. F. (2013). Convergence of descent methods for semi-algebraic and tame problems: proximal algorithms, forward-backward splitting, and regularized Gauss-Seidel methods. Mathematical Programming, 137 (1-2), 91-129.
- 2[2] Cohen, G. (1980). Auxiliary problem principle and decomposition of optimization problems. Journal of Optimization Theory and Applications, 32 (3), 277-305.
- 3[3] Fan, J., & Li, R. (2001). Variable selection via nonconcave penalized likelihood and its oracle properties. Journal of the American statistical Association, 96 (456), 1348-1360.
- 4[4] Kazemi, E., & Wang, L. (2019). Asynchronous Delay-Aware Accelerated Proximal Coordinate Descent for Nonconvex Nonsmooth Problems. ar Xiv preprint ar Xiv:1902.01856 .
- 5[5] Lim, C. H. (2018). An Efficient Pruning Algorithm for Robust Isotonic Regression. In Advances in Neural Information Processing Systems (pp. 219-229).
- 6[6] Lu, Z., & Xiao, L. (2015). On the complexity analysis of randomized block-coordinate descent methods. Mathematical Programming, 152 (1-2), 615-642.
- 7[7] Luo, Z. Q., & Tseng, P. (1992). Error bound and convergence analysis of matrix splitting algorithms for the affine variational inequality problem. SIAM Journal on Optimization, 2 (1), 43-54.
- 8[8] Nesterov, Y. (2012). Efficiency of coordinate descent methods on huge-scale optimization problems. SIAM Journal on Optimization, 22 (2), 341-362.
