Distributed Nesterov gradient methods over arbitrary graphs
Ran Xin, Dusan Jakovetic, Usman A. Khan

TL;DR
This paper introduces a novel distributed Nesterov gradient method that operates over arbitrary graphs without requiring doubly-stochastic weights, achieving accelerated convergence compared to existing methods.
Contribution
The paper proposes the BN method that works with row- and column-stochastic weights, and a FROZEN variant that only needs row-stochastic weights, broadening applicability.
Findings
Achieves acceleration over state-of-the-art distributed optimization methods.
Works on arbitrary strongly-connected graphs without doubly-stochastic weights.
FROZEN variant reduces communication requirements at the cost of extra iterations.
Abstract
In this letter, we introduce a distributed Nesterov method, termed as , that does not require doubly-stochastic weight matrices. Instead, the implementation is based on a simultaneous application of both row- and column-stochastic weights that makes this method applicable to arbitrary (strongly-connected) graphs. Since constructing column-stochastic weights needs additional information (the number of outgoing neighbors at each agent), not available in certain communication protocols, we derive a variation, termed as FROZEN, that only requires row-stochastic weights but at the expense of additional iterations for eigenvector learning. We numerically study these algorithms for various objective functions and network parameters and show that the proposed distributed Nesterov methods achieve acceleration compared to the current state-of-the-art methods for distributed…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Distributed Nesterov gradient methods
over arbitrary graphs
Ran Xin, Dušan Jakovetić, and Usman A. Khan
RX and UAK are with the Department of Electrical and Computer Engineering, Tufts University, USA. {ran.xin@,khan@ece.}tufts.edu DJ is with the Department of Mathematics and Informatics, Faculty of Science, University of Novi Sad, Serbia. [email protected]
Abstract
In this letter, we introduce a distributed Nesterov method, termed as , that does not require doubly-stochastic weight matrices. Instead, the implementation is based on a simultaneous application of both row- and column-stochastic weights that makes this method applicable to arbitrary (strongly-connected) graphs. Since constructing column-stochastic weights needs additional information (the number of outgoing neighbors at each agent), not available in certain communication protocols, we derive a variation, termed as FROZEN, that only requires row-stochastic weights but at the expense of additional iterations for eigenvector learning. We numerically study these algorithms for various objective functions and network parameters and show that the proposed distributed Nesterov methods achieve acceleration compared to the current state-of-the-art methods for distributed optimization.
I Introduction
Distributed optimization has recently seen a surge of interest particularly with the emergence of modern signal processing and machine learning applications. A well-studied problem in this domain is finite sum minimization that also has some relevance to empirical risk formulations, i.e.,
[TABLE]
where each is a smooth and convex function available at an agent . Since the ’s depend on data that may be private to each agent and communicating large data is impractical, developing distributed solutions of the above problem have attracted a strong interest. Related work has been a topic of significant research in the areas of signal processing and control [1, 2, 3, 4], and more recently has also found coverage in the machine learning literature [5, 6, 7, 8, 9, 10].
Since the focus is on distributed implementation, the information exchange mechanism among the agents becomes a key ingredient of the solutions. Such inter-agent information exchange is modeled by a graph and significant work has focused on algorithm design under various graph topologies. The associated algorithms require two key steps: (i) consensus, i.e., reaching agreement among the agents; and, (ii) optimality, i.e., showing that the agreement is on the optimal solution. Naturally, consensus algorithms have been predominantly used as the basic building block of distributed optimization on top of which a gradient correction is added to steer the agreement to the optimal solution. Initial work thus follows closely the progress achieved in the consensus algorithms and extensions to various graph topologies, see e.g., [11, 12, 13, 5, 6, 14, 15].
Early work on consensus assumes doubly-stochastic (DS) weights [16, 17], which require the underlying graphs to be undirected (or balanced) since both incoming and outgoing weights must sum to . The subsequent work on optimization over undirected graphs includes [12] where the convergence is sublinear and [18, 19, 20] with linear convergence. For directed (and unbalanced) graphs, it is not possible to construct DS weights, i.e., the weights can be chosen such that they sum to either only on incoming edges or only on outgoing edges. Optimization over digraphs [21, 22, 23, 24, 25, 26, 27, 28] thus has been built on consensus with non-DS weights [29, 30, 31]. Required now is a division with additional iterates that learn the non- (where is a vector of all ’s) Perron eigenvector of the underlying weight matrix, see [23, 25, 26] for details. Such division causes significant conservatism and stability issues [32].
Recently, we introduced the algorithm that removes the need of eigenvector learning by utilizing both row-stochastic (RS) and column-stochastic (CS) weights, simultaneously, [33]. The algorithm thus is applicable to arbitrary strongly-connected graphs. The intuition behind using both sets of weights is as follows: Let be RS and be CS, with and , in addition to being primitive. From Perron-Frobenius theorem, we have that and . Clearly, using or alone makes an algorithm dependent on the non- Perron eigenvector ( or ) and thus the need for the aforementioned division by the iterates learning this eigenvector. Using and simultaneously, the asymptotics of are driven by, loosely speaking, , which recovers the consensus matrix, , without any scaling. It is shown in [33] that converges linearly to the optimal for smooth and strongly-convex functions.
In this letter, we study accelerated optimization over arbitrary graphs by extending with Nesterov’s momentum. We first propose that uses both RS and CS weights. Construct CS weights requires each agent to know at least its out-degree, which may not be possible in broadcast-type communication scenarios. To address this challenge, we provide an alternate algorithm, termed as FROZEN, that only uses RS weights. We show that FROZEN can be derived from with the help of a simple state transformation. Finally, we note that a rigorous theoretical analysis is beyond the scope of this letter and we present extensive simulations to highlight and verify different aspects of the proposed methods.
We now describe the rest of this paper. Section II formulates the problem and recaps the algorithm. Section III describes the two methods, and FROZEN, and Section IV provides simulations comparing the proposed methods with the state-of-the-art in distributed optimization over both convex and strongly-convex functions, and over various digraphs.
II Problem Formulation and Preliminaries
Consider agents connected over a digraph, , where is the set of agents and is the collection of edges, , such that . We define as the collection of in-neighbors of agent , i.e., the set of agents that can send information to agent . Similarly, is the set of out-neighbors of agent . Note that both and include node . The agents solve the following unconstrained optimization problem:
[TABLE]
where each is private to agent . We formalize the set of assumptions as follows.
Assumption 1**.**
The graph, , is strongly-connected.
Assumption 2**.**
Each local objective, , is -strongly-convex, , i.e., and , we have
[TABLE]
Assumption 3**.**
Each local objective, , is -smooth, i.e., its gradient is Lipschitz-continuous: and , we have, for some ,
[TABLE]
Let be the class of functions satisfying Assumption 3 and let be the class of functions that satisfy both Assumptions 2 and 3; note that . In this letter, we propose distributed algorithms to solve Problem P1 for both function classes, i.e., and . We assume that the underlying optimization is solvable in the class .
II-A Centralized Optimization: Nesterov’s Method
The gradient descent algorithm is given by
[TABLE]
where is the iteration and is the step-size. It is well known [34, 35] that the oracle complexity of this method to achieve an -accuracy is for the function class and for the function class , where is the condition number of the objective function, . There are gaps between the lower oracle complexity bounds of the function class and , and the upper complexity bounds of gradient descent [35]. This gap is closed by the seminal work [35] by Nesterov, which accelerates the convergence of the gradient descent by adding a certain momentum to gradient descent. The centralized Nesterov’s method [35] iteratively updates two variables , initialized arbitrarily with , as follows:
[TABLE]
where is the momentum parameter. For the function class , choosing leads to an optimal oracle complexity of , while for the function class , results into an optimal oracle complexity of .
II-B Distributed Optimization: The algorithm
When the objective functions are not available at a central location, distributed solutions are required to solve Problem P1. Most existing work [1, 2, 3, 11, 12, 13, 14, 18, 19, 20] is restricted to undirected graphs, since the weights assigned to neighboring agents must be doubly-stochastic. The work on directed graphs [21, 22, 25, 26, 27, 28] is largely based on push-sum consensus [29, 30] that requires eigenvector learning. Recently, algorithm was introduced in [33] that does not require eigenvector learning by utilizing a novel approach to deal with the non-doubly-stochasticity in digraphs.
We now describe the algorithm: Consider two distinct sets of weights, and , at each agent such that
[TABLE]
[TABLE]
In other words, the weight matrix, , is row-stochastic, while is column-stochastic. It is straightforward to note that the construction of row-stochastic weights, , is trivial as it each agent on its own assigns arbitrary weights to incoming information (from agents in ) such that these weights sum to . The construction of column-stochastic weights is more involved as it requires that all outgoing weights at agent must sum to and thus cannot be assigned on incoming information. The simplest way to obtain such weights is for each agent to transmit to its outgoing neighbors in . This strategy, however, requires the knowledge of the out-degree at each agent .
With the help of the row- and column-stochastic weights, we can now describe the algorithm as follows [33]:
[TABLE]
where is arbitrary and . We explain the above algorithm in the following. Eq. (2a) essentially is gradient descent where the descent direction is , instead of as used in the earlier methods [12, 24]. Eq. (2b), on the other hand, is gradient tracking, i.e., , and thus Eq. descends in the global direction, asymptotically. It is shown in [33] that converges linearly to the optimal solution for the function class .
The algorithm for undirected graphs where both weights are doubly-stochastic was studied earlier in [18, 19, 26]. It is shown in [19] that the oracle complexity with doubly-stochastic weights is . Extensions of include: non-coordinated step-sizes and heavy-ball momentum [32]; time-varying graphs [36, 37]; analysis for non-convex functions [38]. Related work on distributed Nesterov-type methods can be found in [39, 40, 41], which is restricted to undirected graphs. There is no prior work on Nesterov’s method that is applicable to arbitrary strongly-connected graphs.
III Distributed Nesterov Gradient Methods
In this section, ww introduce two distributed Nesterov gradient methods, both of which are applicable to arbitrary, strongly-connected, graphs.
III-A The algorithm
Each agent, , maintains three variables: , and , all in , where and are the local estimates of the global minimizer and is used to track the average gradient. The algorithm is described in Algorithm 1.
A valid choice for ’s at each is to choose them as , which does not require knowing the outgoing nodes but only the out-degree. For the function class , is a constant; for the function class , we choose .
III-B The FROZEN algorithm
Note that is restricted to communication protocols that allow column-stochastic weights, ’s. When this is not possible, it is desirable to have algorithms that only use row-stochastic weights. Row-stochasticity is trivially established at the receiving agent by assigning a weight to each incoming information such that the sum of weights is . To avoid CS weights altogether, we now develop a distributed Nesterov gradient method that only row-stochastic weights and show the procedure of constructing this new algorithm from .
To this aim, we first write in the vector-matrix form. Let , , and denote the concatenated vectors with ’s, ’s, ’s, and ’s, respectively. Then can be compactly written follows:
[TABLE]
where and , where is the Kronecker. Since is already row-stochastic, we seek a transformation that makes a row-stochastic matrix. Since is column-stochastic, we denote its left and right Perron eigenvectors as and . Let denote a matrix with on its main diagonal. With the help of , we define a state transformation, , and rewrite as follows:
[TABLE]
where can be easily verified to be row-stochastic. Since is the right Perron vector of , it is not locally known to any agent and thus the above equations are not practically possible to implement. We thus add an independent eigenvector learning algorithm to the above set equations and obtain FROZEN (Fast Row-stochastic OptimiZation with Nesterov’s momentum) described in Algorithm 2. The momentum parameter is chosen the same way as in .
In the above algorithm, is a vector of zeros with a at the th location and denotes the th element of a vector. We note that although the weight assignment in FROZEN is straightforward, this flexibility comes at a price:
(i) each agent must maintain an additional -dimensional vector, ;
(ii) additional iterations are required for eigenvector learning in Eq. (6b); and,
(iii) the initial condition requires each agent to have and know a unique identifier.
However, as discussed earlier, may not be applicable in some communication protocols and thus, FROZEN may be the only algorithm available. Finally, we note that when , FROZEN reduces to FROST whose detailed analysis and a linear convergence proof can be found in [27, 28].
**Generalizations and extensions: **The method we described to convert to FROZEN leads to another variant of with only CS weights, see [33] for details. The resulting methods add Nesterov’s momentum to ADDOPT and Push-DIGing [25, 26]. Since these variants only require CS weights, and are preferable due to their faster convergence. It is further straightforward to conceive a time-varying implementation of and FROZEN over gossip based protocols or random graphs, see e.g., the related work in [36, 37] on non-accelerated methods. Asynchronous schemes may also be derived following the methodologies studied in [42, 43]. Finally, we note that a rigorous theoretical analysis of and is beyond the scope of this letter. We thus rely on simulations to highlight and verify different aspects of the proposed methods.
IV Numerical Results
In this section, we numerically verify the convergence of the proposed algorithms, and FROZEN, in this letter, and compare them with well-known solutions for distributed optimization. To this aim, we generate strongly-connected digraphs with nodes using nearest-neighbor rules. We use an uniform weighting strategy to generate the row- and column-stochastic weight matrices, i.e., and . We first compare and FROZEN with the following methods over digraphs: ADDOPT/Push-DIGing [25, 26], FROST [28], and [33]. For comparison, we plot the average residual: .
IV-A Strongly-convex case
We first consider a distributed binary classification problem using logistic loss: each agent has access to training samples, , where contains features of the th training data at agent , and is the corresponding binary label. The agents cooperatively minimize , where are the optimization variables to learn the separating hyperplane, with each being
[TABLE]
In our setting, the feature vectors, ’s, are generated from a Gaussian distribution with zero mean. The binary labels are generated from a Bernoulli distribution. We set and . The results are shown in Fig. 1. Although FROZEN is slower than , it is applicable broadcast-based protocols as it only requires row-stochastic weights. The step-size and momentum parameters are manually chosen to obtain the best performance for each algorithm.
IV-B Non strongly-convex case
We next choose the objective functions, ’s, to be smooth, convex but not strongly-convex. In particular, , where ’s are randomly generated, , and is chosen as follows:
[TABLE]
It can be verified that is not strongly-convex as . The results are shown in Fig. 2 where the momentum parameter is chosen as and other parameters are manually optimized.
IV-C Influence of graph sparsity
Finally, we study the influence of graph sparsity with the help of the logistic regression problem discussed earlier. We fix the number of nodes to and randomly generate three nearest-neighbor digraphs, , and , with decreasing sparsity, see Fig. 3 (Top). In Fig. 3 (Bottom), we compare the performance of the proposed methods with centralized Nesterov over the three graphs. It can be verified that and FROZEN approach centralized Nesterov method as the graphs become dense. FROZEN, however, is much slower than because it additionally requires eigenvector learning.
V Conclusions
In this letter, we present accelerated methods for optimization based on Nesterov’s momentum over arbitrary, strongly-connected, graphs. The fundamental algorithm, , uses both row- and column-stochastic weights, simultaneously, to achieve agreement and optimality. We then derive a variant from , termed as FROZEN, that only uses row-stochastic weights and thus is applicable to a larger set of communication protocols, however, at the expense of eigenvector learning, thus resulting into slower convergence. Although a theoretical analysis is beyond the scope of this letter, we provide an extensive set of numerical results to study the behavior of the proposed methods for both convex and strongly-convex cases.
The reference list from the paper itself. Each links out to its DOI / PubMed record.
- 1[1] M. Rabbat and R. Nowak, “Distributed optimization in sensor networks,” in 3rd International Symposium on Information Processing in Sensor Networks , Berkeley, CA, Apr. 2004, pp. 20–27.
- 2[2] J. Chen and A. H. Sayed, “Diffusion adaptation strategies for distributed optimization and learning over networks,” IEEE Trans. on Signal Processing , vol. 60, no. 8, pp. 4289–4305, Aug. 2012.
- 3[3] A. Mokhtari, W. Shi, Q. Ling, and A. Ribeiro, “A decentralized second-order method with exact linear convergence rate for consensus optimization,” IEEE Trans. on Signal and Information Processing over Networks , vol. 2, no. 4, pp. 507–522, 2016.
- 4[4] S. Safavi, U. A. Khan, S. Kar, and J. M. F. Moura, “Distributed localization: A linear theory,” Proceedings of the IEEE , vol. 106, pp. 1204–1223, Jul. 2018.
- 5[5] P. A. Forero, A. Cano, and G. B. Giannakis, “Consensus-based distributed support vector machines,” Journal of Machine Learning Research , vol. 11, no. May, pp. 1663–1707, 2010.
- 6[6] S. Boyd, N. Parikh, E. Chu, B. Peleato, and J. Eckstein, “Distributed optimization and statistical learning via the alternating direction method of multipliers,” Foundation and Trends in Maching Learning , vol. 3, no. 1, pp. 1–122, Jan. 2011.
- 7[7] H.-T. Wai, Z. Yang, Z. Wang, and M. Hong, “Multi-agent reinforcement learning via double averaging primal-dual optimization,” ar Xiv preprint ar Xiv:1806.00877 , 2018.
- 8[8] X. Lian, C. Zhang, H. Zhang, C. Hsieh, W. Zhang, and J. Liu, “Can decentralized algorithms outperform centralized algorithms? A case study for decentralized parallel stochastic gradient descent,” in Advances in Neural Information Processing Systems , 2017, pp. 5330–5340.
