Linear convergence of distributed Dykstra's algorithm for sets under an intersection property
C.H. Jeffrey Pang

TL;DR
This paper proves that Dykstra's algorithm converges linearly when applied to intersecting sets under a specific intersection property, extending understanding of its convergence behavior.
Contribution
It establishes the linear convergence of distributed Dykstra's algorithm under a new intersection condition, broadening its theoretical applicability.
Findings
Proves linear convergence under a new intersection property.
Extends convergence results for distributed Dykstra's algorithm.
Provides theoretical foundation for algorithm performance.
Abstract
We show the linear convergence of Dykstra's algorithm for sets intersecting in a manner slightly stronger than the usual constraint qualifications.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsOptimization and Variational Analysis · Advanced Optimization Algorithms Research · Point processes and geometric inequalities
Linear convergence of distributed Dykstra’s algorithm for sets under an intersection property
C.H. Jeffrey Pang Department of Mathematics
National University of Singapore
Block S17 08-11
10 Lower Kent Ridge Road
Singapore 119076 [email protected]
(Date: March 2, 2024)
Abstract.
We show the linear convergence of a distributed Dykstra’s algorithm for sets intersecting in a manner slightly stronger than the usual constraint qualifications.
Key words and phrases:
Distributed optimization, Dykstra’s algorithm, linear convergence
2010 Mathematics Subject Classification:
68Q25, 68W15, 90C25, 90C30, 65K05
C.H.J. Pang acknowledges grant R-146-000-265-114 from the Faculty of Science, National University of Singapore.
Contents
1. Introduction
Let be an undirected graph. For all , let be closed convex sets, and . For a closed convex set , let be its indicator function. Consider the distributed optimization problem
[TABLE]
where communications between two vertices in occur only along edges in . In Remark 2.3, we explain that we can assume that all are equal to some without losing any generality. The problem is therefore equivalent to projecting onto in a distributed manner.
1.1. A review of the distributed Dykstra’s splitting
In our earlier paper [Pan18a], we considered the more general problem than (1.1) where can be general closed convex functions instead. We proposed a deterministic distributed asynchronous decentralized algorithm based on dual ascent for (1.1) that converges to the primal minimizer, and call it the distributed Dykstra’s algorithm. Our approach was motivated by work on Dykstra’s algorithm in [Dyk83, BD85, GM89, HD97]. See also [Han88]. We also remark that the dual ascent idea had been discussed in [CDV11, CDV10, ACP*+*17]. We refer to the introduction in [Pan18a] for more historical summary of these methods. Part of the contribution in [Pan18a] was to point out that the dual ascent idea leads to a desirable distributed optimization algorithm. We give more details of the distributed Dykstra’s algorithm in Section 2.
1.2. Linear convergence of Dykstra’s algorithm
A well known algorithm for solving (1.1) is Dykstra’s algorithm. The primal problem and its corresponding (Fenchel) dual are typically written as
[TABLE]
respectively, and solved by block coordinate maximization on the dual problem. (See [BD85, Han88, GM89]). (Note that this dual is different from (2.4).) In the case when are halfspaces, linear convergence of Dykstra’s algorithm was established in [lP90], with refined rates given in [DH94]. We extended the linear rates to polyhedra in [Pan17].
A linear convergence rate of Dykstra’s algorithm assures that a high accuracy solution can be obtained in a reasonable amount of time. This would then allow the algorithm to be used as a subroutine of other optimization algorithms. For example, the distributed optimization algorithms [AH16, TSDS18] (and perhaps many others) make use of the averaged consensus algorithm as a subroutine. (The linear convergence rate of averaged consensus is used in the convergence proof of the main distributed optimization algorithm.) Since averaged consensus is a particular case of the distributed Dykstra’s algorithm with all being , it is plausible to make use of the distributed Dykstra’s algorithm to help solve constrained distributed problems.
1.3. Contributions of this paper
Even though we have observed linear convergence rates of the distributed Dykstra’s algorithm in [Pan18b] in our numerical experiments for the case when some of the terms are indicator functions of closed convex sets, it seems that there is no theoretical justification yet of linear convergence for both Dykstra’s original algorithm and for the distributed Dykstra’s algorithm beyond the polyhedral case. As is well-known, the intersection can be sensitive to the perturbation of the sets [Kru06], so additional constraint qualifications are needed for the linear convergence of the method of alternating projections (see for example [BB96]).
In this paper, we prove the asymptotic linear convergence of the distributed Dykstra’s algorithm when the functions are indicator functions of sets that are not necessarily polyhedral. We assume that the sets satisfy a property on systems of intersections of sets stronger than what is typically studied in the method of alternating projections. We also make assumptions that are closely related to conditions used to prove linear convergence in proximal algorithms.
1.4. Notation
Variables in bold, like and , typically lie in the space , while variables not in bold, like and , typically lie in . All norms shall be the 2-norm. We often use “ to represent the unit vector in a given direction. For example, .
2. Preliminaries
In this section, we lay down the preliminaries of the paper.
For each , let be defined by
[TABLE]
For each , define the halfspaces to be
[TABLE]
Since the graph is connected, the intersection of all these halfspaces is the diagonal set defined by
[TABLE]
For each , define by . The setting for the distributed Dykstra’s algorithm that is easily seen to be equivalent to (1.1) is
[TABLE]
where is such that each component of , where , is equal to . Let the dual variables be , where each . The (Fenchel) dual of (2.3) can be calculated to be
[TABLE]
Proposition 2.1**.**
(Sparsity) If the value in (2.4) is finite, then
- (1)
If , then is such that for all . 2. (2)
If , then is such that for all , and .
Proof.
The proof is elementary and exactly the same as that in [Pan18a]. (Part (1) makes use of the fact that depends on only the -th coordinate of the input, while part (2) makes use of the fact that , and if and only if the conditions in (2) hold.) ∎
In view of Proposition 2.1, the vector for all are such that if . Letting , we let the dual function be
[TABLE]
It is clear to see that differs from (2.4) by a sign and a constant. It is known that strong duality between (2.3) and (2.4) holds (even though a dual minimizer may not exist). Minimizing allows one to find the optimal value to (2.4), and also the optimal solution to (2.3). It turns out that the only variables that need to be tracked are for all and as marked above. We shall prove that converges linearly to the optimal primal solution under some additional assumptions. We refer to the -th coordinate of as . Also, if , the projection of onto , were to be zero, then takes the minimum of zero when is the primal optimal solution and are optimal multipliers.
Here are the first set of assumptions we need to prove our linear convergence result.
Assumption 2.2**.**
Suppose that the following assumptions hold:
- (1)
Let be the optimal solution to (1.1). We assume that . 2. (2)
The are all equal for all . 3. (3)
(Existence of dual minimizers) There exists such that and . 4. (4)
(Regularity of the sets ) The sets satisfy a nondegeneracy constraint qualification: There is a neighborhood of and parameters and such that if the multipliers and points are such that and for all and for all , then
[TABLE]
Let be the hyperplane for all . Assume that for all , there is some constant such that . 5. (5)
(Graph connectedness) The (undirected) graph is connected. 6. (6)
(Semismoothness) The sets satisfy the semismoothness property of order 2 at : For a point near , let a supporting hyperplane to at with normal be . Then we have . [We know that all convex sets satisfy the property if were replaced by .] Suppose . Since , there is a such that
[TABLE] 7. (7)
(First order property on normals) There is a neighborhood of and such that for all , if , and , then there is a such that
[TABLE] 8. (8)
(A linear regularity property on the normal cones) Define the set of optimal multipliers to be , where
[TABLE]
Assume there is a such that
[TABLE]
We remark about Assumption 2.2(8). The linear regularity property is usually stated as for all , but we state a weaker version of it in Assumption 2.2(8) because that is what our proof needs. The stronger linear regularity is satisfied whenever the normal cones are polyhedral (see for example [BB96, Corollary 5.26]), so this assumption is quite reasonable.
Assumption 2.2(4) is stronger than the usual transversality condition typically studied in the method of alternating projections. Now that we are working with an optimization problem (1.1) rather than a feasibility problem, it may be more appropriate to compare to the Robinson constraint qualification. We seek to study this assumption further in future work.
We make the following remark.
Remark 2.3*.*
(On Assumption 2.2(2)) We now show that Assumption 2.2(2) does not lose any generality. Suppose that the are not all necessarily the same. Note that , where . Thus all the can be replaced by . Note that this does not mean that the primal iterate needs to be such that all its coordinates are at the start.
We now state Algorithm 2.4, which minimizes by block coordinate minimization.
To provide some intuition to Algorithm 2.4, we mention that minimizing only one at a time for some (i.e., ) reduces (2.11) to a standard proximal problem. Minimizing only one for some (i.e., ) has the natural interpretation of averaging the -th and -th components of .
Let the function to be defined to be . Let be the optimal solution of (2.3). Before we prove the result, we note that using a technique in [GM89], the duality gap between the primal and dual pair (2.3) and (2.4) satisfies
[TABLE]
The strategy behind our linear convergence proof is to show that the duality gap in the first line of (2.19) converges linearly to zero, which will force the last formula of (2.19) to converge linearly to zero, which in turn shows the linear convergence of to . Note that since , throughout, and for all , the first line of (2.19) can be simplified to be the in (2.5).
We make another set of assumptions on Algorithm 2.4 that will allow us to prove our linear convergence result.
Assumption 2.5**.**
For Algorithm 2.4, we assume that:
- (1)
For all and , there is a such that . 2. (2)
.
Out plan is to prove the main result in Section 3 with Assumption 2.5(2) first, then remove it in Section 4.
3. Main result
In this section, we state and prove the main theorem on linear convergence of the distributed Dykstra’s algorithm. Our proof is split into three cases. For the first two cases, the proof in this section does not rely on Assumption 2.5(2). For the third case, we first prove our result by first assuming Assumption 2.5(2). We then show how to lift this assumption in Section 4.
Theorem 3.1**.**
(Linear convergence of dual value) Suppose Assumptions 2.2 and 2.5 hold. For Algorithm 2.4, there is a constant such that . Together with (2.19), this implies that the distance converges linearly to zero for all .
We need positive parameters , and to be small enough so that they satisfy , and (3.53), where and are defined in (3) and (3), and the other constants are described in Assumption 2.2 and in the course of the proof. It is easy to see that the parameters , and can be chosen to satisfy these conditions.
The first two cases of the proof of Theorem 3.1 are easier than the third case. To simplify notation, we let to be written simply as for all and , and the dropping of “” appears in all other variables as well. Let be the -th coordinate of , and let be , the -th coordinate of . If for some , then and are the solutions to the primal dual pair of subproblems
[TABLE]
By adjusting (3.1d), we can easily check that is the minimizer of
[TABLE]
Proof of cases 1 and 2 of Theorem 3.1.
The proof is split into 3 cases:
Case 1:
Let be . We have
[TABLE]
Also,
[TABLE]
We can assume that at index , we have and for all . We have 2 cases.
Case 1a: ,
In this case,
[TABLE]
Recall that . Since , we also have for all and for all . We have
[TABLE]
Case 1b:
Note that is an estimate of the decrease of the dual objective value. We choose so that . We have
[TABLE]
We then have
[TABLE]
Then
[TABLE]
We then have
[TABLE]
**Case 2: , and . **
In this case, note that . Since and are hyperplanes, there is some such that . Let be such that , and let be such that , which exists by Assumption 2.5(1). We then have
[TABLE]
Now,
[TABLE]
We have , so
[TABLE]
Hence we are done. ∎
This leaves us with Case 3, i.e.,
**Case 3: , and . **
By the definition of in (2.2), all components of are equal to some value, which we call . Then we have the inequalities
[TABLE]
We have
[TABLE]
and
[TABLE]
We now show that there is a constant such that . We have , where , and is the index such that for all such that . If we have , then
[TABLE]
This would then give us . The parameter can be chosen large enough so that the coefficient of is greater than , which once again leads to the conclusion in Theorem 3.1. Therefore, we shall assume
[TABLE]
throughout. We now assume Assumption 2.5(2), and let and be and respectively.
Proof of case 3 of Theorem 3.1.
We consider and , where . Recall defined as the set of optimal multipliers defined in Assumption 2.2(8). Let be
[TABLE]
where . Let be . Let be the direction . There are two subcases to consider.
Case 3a:
Since , we have
[TABLE]
We would be projecting onto for all . Let an outer approximate of be
[TABLE]
Since , we have , and so . By the design of , we have . Since and , we have . Proposition 2.1(2) implies that . So we have
[TABLE]
Hence there is some such that
[TABLE]
Then we move ahead with this (without labeling it as to save notation).
Since , we have . Note that
[TABLE]
so
[TABLE]
Let be the formula marked above. Let . We have
[TABLE]
We then project onto . Suppose is close enough to so that . Then
[TABLE]
If we assume that is close enough to so that , then , and so
[TABLE]
This means that does not satisfy the second inequality in the definition of in (3.46), so at least one of the inequalities there must be active at . We let the point be .
Claim*.*
Recall that . Let be , which is checked to be greater than . Suppose are chosen small enough so that the following conditions hold:
[TABLE]
Then .
We now prove the claim. For , there are three different cases.
Case 3a-1: Only the constraint in (3.46) is active at .
If that active constraint is , then by the KKT conditions, would be of the form , and hence
[TABLE]
Then
[TABLE]
**Case 3a-2: **Both constraints in (3.46) are active at .
Step 1: Bounding .
For all , we have
[TABLE]
The projection of onto is , where is as defined in (2.9d) and . This means that
[TABLE]
For the parameters , we note from Proposition 2.1 that , for all and , which gives
[TABLE]
Recall that , and Assumption 2.2(7). This gives and
[TABLE]
So
[TABLE]
Hence, for all , we have
[TABLE]
Also,
[TABLE]
Since , Assumption 2.2(7) shows us that . Note that for any ,
[TABLE]
We thus have
[TABLE]
Step 2: Showing is large enough.
Since both constraints in (see (3.46)) are tight at , the projection of onto is equivalent to the projection of onto . We have
[TABLE]
Note that by the KKT conditions, for some . So
[TABLE]
Then we have
[TABLE]
Also,
[TABLE]
Note that is the deflection of along the normal , i.e., for some . Moreover, we have since the two constraints in the definition of in (3.46) are tight. The distance of must be at least
[TABLE]
which concludes the proof for this case.
Case 3a-3: Only the constraint in (3.46) is active at .
We now show that this case is impossible by showing that and cannot hold at the same time. We have . By the nonexpansiveness of the projection operation, we have
[TABLE]
Define to be the point such that and . Note that is of the form with . Further arithmetic gives us
[TABLE]
Now
[TABLE]
Also, . Since can be made arbitrarily small by (3.43) and , we can assume that there is an such that throughout. So
[TABLE]
By the KKT conditions, the point has the form for some . We show that points of the form , where , cannot satisfy both and at the same time. Since is a multiple of , we can prove our results for points of the form . Now,
[TABLE]
In view of , (3.95) and (3.98), and the fact that , we can choose small enough so that . So if , then , which implies that . This completes the proof of the claim.
Let the minimizer of be . It is standard to obtain . We have
[TABLE]
Note that . Also, . Therefore
[TABLE]
This once again leads to linear convergence.
Case 3b: .
For each , define the hyperplanes , and by
[TABLE]
Recall that by Assumption 2.2(4), are big enough so that is always outside , so that is onto the boundary of (and not in the interior of ). Recall that the dual vectors after the projection are . The term in the definition of implies that is a supporting hyperplane of at with normal vector . Due to the fact that the dual function is decreasing, we have for all , so
[TABLE]
If a point is on , then the distance of the supporting hyperplane of at to the origin is by Assumption 2.2(3). (We actually have , but is enough for this part of the proof.) So we have . Since , the term is , for any , we have for all if is close enough to , which gives
[TABLE]
We have
[TABLE]
We have . Also, and , which leads us to . Recall can be arbitrarily small by (3.76). Note also that , and the latter can be arbitrarily small. Also, by Assumption 2.2(4), so can be arbitrarily small. Thus we can make . So
[TABLE]
Next, by Assumption 2.2(1), we have
[TABLE]
We have
[TABLE]
We have
[TABLE]
Since are chosen so that , we have
[TABLE]
This leads to linear convergence like in the last three lines of (3.111). ∎
4. Lifting Assumption 2.5(2)
In this section, we show how to adjust the proof of the main result in Section 3 so that Assumption 2.5(2) can be lifted. We let and be what they were in the proof of Theorem 3.1 in Section 3. We shall treat case 3a first, and then explain the similarities in case 3b.
We can assume that there is an index such that for all (which implies ) and . Let the operator be . Define as
[TABLE]
Note also that . Since is a monotone operator, the operator is nonexpansive (see for example the textbook [BC11]), which gives . We have
[TABLE]
Then
[TABLE]
The same steps as (3.109) leads us to
[TABLE]
Once again, the steps similar to (3.111) gives
[TABLE]
The adjustments for case 3b is similar, except that the set is set to be , and and can be replaced by and respectively.
The reference list from the paper itself. Each links out to its DOI / PubMed record.
- 1[ACP + 17] F. Abboud, E. Chouzenoux, J.-C. Pesquet, J.-H. Chenot, and L. Laborelli, Dual block-coordinate forward-backward algorithm with application to deconvolution and deinterlacing of video sequences , Journal of Mathematical Imaging and Vision 59 (2017), no. 3, 415–431.
- 2[AH 16] N.S. Aybat and E.Y. Hamedani, A primal-dual method for conic constrained distributed optimization problems , Advances in Neural Information Processing Systems 29, Curran associates, Red Hook, NY, 2016, pp. 5049–5057.
- 3[BB 96] H.H. Bauschke and J.M. Borwein, On projection algorithms for solving convex feasibility problems , SIAM Rev. 38 (1996), 367–426.
- 4[BC 11] H.H. Bauschke and P.L. Combettes, Convex analysis and monotone operator theory in Hilbert spaces , Springer, 2011.
- 5[BD 85] J.P. Boyle and R.L. Dykstra, A method for finding projections onto the intersection of convex sets in Hilbert spaces , Advances in Order Restricted Statistical Inference, Lecture notes in Statistics, Springer, New York, 1985, pp. 28–47.
- 6[CDV 10] P.L. Combettes, D. Dũng, and B.C. Vũ, Dualization of signal recovery problems , Set-Valued and Variational Analysis 18 (2010), 373–404.
- 7[CDV 11] by same author, Proximity for sums of composite functions , Journal of Mathematical Analysis and Applications 380 (2011), no. 2, 680–688.
- 8[DH 94] F. Deutsch and H. Hundal, The rate of convergence of Dykstra’s cyclic projections algorithm: the polyhedral case , Numer. Funct. Anal. Optimiz. 15 (1994), no. 5-6, 536–565.
