Distributed Dual Coordinate Ascent in General Tree Networks and   Communication Network Effect on Synchronous Machine Learning

Myung Cho; Lifeng Lai; Weiyu Xu

arXiv:1703.04785·cs.DC·February 19, 2021

Distributed Dual Coordinate Ascent in General Tree Networks and Communication Network Effect on Synchronous Machine Learning

Myung Cho, Lifeng Lai, Weiyu Xu

PDF

Open Access

TL;DR

This paper analyzes the convergence rate of distributed dual coordinate ascent algorithms in general tree networks, considering communication delays, and optimizes the algorithm to improve convergence speed in large-scale distributed machine learning.

Contribution

It generalizes distributed dual coordinate ascent to tree networks, provides convergence rate analysis, and optimizes the algorithm considering communication delays.

Findings

01

Convergence rate can be recursively characterized in tree networks.

02

Optimizing local iterations based on communication delays improves convergence.

03

The algorithm is effective in scenarios with communication constraints.

Abstract

Due to the big size of data and limited data storage volume of a single computer or a single server, data are often stored in a distributed manner. Thus, performing large-scale machine learning operations with the distributed datasets through communication networks is often required. In this paper, we study the convergence rate of the distributed dual coordinate ascent for distributed machine learning problems in a general tree-structured network. Since a tree network model can be understood as the generalization of a star network model, our algorithm can be thought of as the generalization of the distributed dual coordinate ascent in a star network model. We provide the convergence rate of the distributed dual coordinate ascent over a general tree network in a recursive manner and analyze the network effect on the convergence rate. Secondly, by considering network communication delays,…

Equations157

w \in R^{d} minimize P (w) ≜ \frac{λ}{2} ∣∣ w ∣ ∣^{2} + \frac{1}{m} i = 1 \sum m ℓ_{i} (w^{T} x_{i}),

w \in R^{d} minimize P (w) ≜ \frac{λ}{2} ∣∣ w ∣ ∣^{2} + \frac{1}{m} i = 1 \sum m ℓ_{i} (w^{T} x_{i}),

α \in R^{m} maximize D (α) ≜ - \frac{λ}{2} ∣∣ A α ∣ ∣^{2} - \frac{1}{m} i = 1 \sum m ℓ_{i}^{*} (- α_{i}),

α \in R^{m} maximize D (α) ≜ - \frac{λ}{2} ∣∣ A α ∣ ∣^{2} - \frac{1}{m} i = 1 \sum m ℓ_{i}^{*} (- α_{i}),

\displaystyle{\mathbb{E}}[D({\bm{\alpha}}^{\star})-D({\bm{\alpha}}^{(T)})]\leq\big{(}1-(1-\Theta)\frac{1}{K}\frac{\lambda m\gamma}{\rho+\lambda m\gamma}\big{)}^{T}\big{(}D({\bm{\alpha}}^{\star})-D({\bm{\alpha}}^{(0)})\big{)},

\displaystyle{\mathbb{E}}[D({\bm{\alpha}}^{\star})-D({\bm{\alpha}}^{(T)})]\leq\big{(}1-(1-\Theta)\frac{1}{K}\frac{\lambda m\gamma}{\rho+\lambda m\gamma}\big{)}^{T}\big{(}D({\bm{\alpha}}^{\star})-D({\bm{\alpha}}^{(0)})\big{)},

ρ \geq ρ_{min} ≜ α \in R^{m} maximize λ^{2} m^{2} \frac{\sum _{k = 1}^{K} ∣∣ A _{[k]} α _{[k]} ∣ ∣ ^{2} - ∣∣ A α ∣ ∣ ^{2}}{∣∣ α ∣ ∣ ^{2}} \geq 0.

ρ \geq ρ_{min} ≜ α \in R^{m} maximize λ^{2} m^{2} \frac{\sum _{k = 1}^{K} ∣∣ A _{[k]} α _{[k]} ∣ ∣ ^{2} - ∣∣ A α ∣ ∣ ^{2}}{∣∣ α ∣ ∣ ^{2}} \geq 0.

Θ = (1 - s / \tilde{m})^{H},

Θ = (1 - s / \tilde{m})^{H},

ϵ_{Q, k} (α) ≜ \hat{α}_{[Q; k]} maximize D (α_{[Q; 1]}, ..., \hat{α}_{[Q; k]}, ..., α_{[Q; K]}, α_{\overline{Q}}) - D (α_{[Q; 1]}, ..., α_{[Q; k]}, ..., α_{[Q; K]}, α_{\overline{Q}}) .

ϵ_{Q, k} (α) ≜ \hat{α}_{[Q; k]} maximize D (α_{[Q; 1]}, ..., \hat{α}_{[Q; k]}, ..., α_{[Q; K]}, α_{\overline{Q}}) - D (α_{[Q; 1]}, ..., α_{[Q; k]}, ..., α_{[Q; K]}, α_{\overline{Q}}) .

E [ϵ_{Q, k} (α_{[Q; 1]}, ..., α_{[Q; k - 1]}, α_{[Q; k]} + △ α_{[Q; k]}, ..., α_{[Q; K]}, α_{\overline{Q}})] \leq Θ_{i + 1} \cdot ϵ_{Q, k} (α) .

E [ϵ_{Q, k} (α_{[Q; 1]}, ..., α_{[Q; k - 1]}, α_{[Q; k]} + △ α_{[Q; k]}, ..., α_{[Q; K]}, α_{\overline{Q}})] \leq Θ_{i + 1} \cdot ϵ_{Q, k} (α) .

\displaystyle\Theta_{p}=\big{(}1-\frac{\lambda m\gamma}{1+\lambda m\gamma}\frac{1}{{m_{B}}}\big{)}^{T_{p}}.

\displaystyle\Theta_{p}=\big{(}1-\frac{\lambda m\gamma}{1+\lambda m\gamma}\frac{1}{{m_{B}}}\big{)}^{T_{p}}.

\displaystyle{\mathbb{E}}[D({\bm{\alpha}}_{Q}^{*},{\bm{\alpha}}_{\overline{Q}})-D({\bm{\alpha}}_{Q}^{(T_{i})},{\bm{\alpha}}_{\overline{Q}})]\leq\big{(}1-(1-\Theta_{i+1})\frac{1}{K_{i}}\frac{\lambda m\gamma}{\rho_{i}+\lambda m\gamma}\big{)}^{T_{i}}\big{(}D(\bm{\alpha}_{Q}^{*},\bm{\alpha}_{\overline{Q}})-D({\bm{\alpha}}_{Q}^{(0)},{\bm{\alpha}}_{\overline{Q}})\big{)},

\displaystyle{\mathbb{E}}[D({\bm{\alpha}}_{Q}^{*},{\bm{\alpha}}_{\overline{Q}})-D({\bm{\alpha}}_{Q}^{(T_{i})},{\bm{\alpha}}_{\overline{Q}})]\leq\big{(}1-(1-\Theta_{i+1})\frac{1}{K_{i}}\frac{\lambda m\gamma}{\rho_{i}+\lambda m\gamma}\big{)}^{T_{i}}\big{(}D(\bm{\alpha}_{Q}^{*},\bm{\alpha}_{\overline{Q}})-D({\bm{\alpha}}_{Q}^{(0)},{\bm{\alpha}}_{\overline{Q}})\big{)},

ρ_{i} \geq ρ_{min} ≜ α_{Q} \in R^{∣ Q ∣} maximize λ^{2} m^{2} \frac{\sum _{k = 1}^{K_{i}} ∣∣ A _{[Q; k]} α _{[Q; k]} ∣ ∣ ^{2} - ∣∣ A _{Q} α _{Q} ∣ ∣ ^{2}}{∣∣ α _{Q} ∣ ∣ ^{2}} \geq 0.

ρ_{i} \geq ρ_{min} ≜ α_{Q} \in R^{∣ Q ∣} maximize λ^{2} m^{2} \frac{\sum _{k = 1}^{K_{i}} ∣∣ A _{[Q; k]} α _{[Q; k]} ∣ ∣ ^{2} - ∣∣ A _{Q} α _{Q} ∣ ∣ ^{2}}{∣∣ α _{Q} ∣ ∣ ^{2}} \geq 0.

\displaystyle\Theta_{i}=\bigg{(}1-(1-\Theta_{i+1})\frac{C_{i}}{K_{i}}\bigg{)}^{T_{i}},

\displaystyle\Theta_{i}=\bigg{(}1-(1-\Theta_{i+1})\frac{C_{i}}{K_{i}}\bigg{)}^{T_{i}},

Θ_{0}

Θ_{0}

Θ_{i}

Θ_{i}

Θ_{0} \approx

Θ_{0} \approx

t_{t o t a l} = (t_{l p} T_{p} + t_{d e l a y} + t_{c p}) \cdot T_{0} .

t_{t o t a l} = (t_{l p} T_{p} + t_{d e l a y} + t_{c p}) \cdot T_{0} .

T_{0} = \frac{t _{t o t a l}}{( t _{l p} T _{p} + t _{d e l a y} + t _{c p} )} .

T_{0} = \frac{t _{t o t a l}}{( t _{l p} T _{p} + t _{d e l a y} + t _{c p} )} .

\displaystyle\bigg{(}1-\big{(}1-[1-\delta]^{T_{p}}\big{)}\frac{C}{K}\bigg{)}^{T_{0}},

\displaystyle\bigg{(}1-\big{(}1-[1-\delta]^{T_{p}}\big{)}\frac{C}{K}\bigg{)}^{T_{0}},

\displaystyle\underset{T_{p}\geq 0}{\text{minimize}}\;F(T_{p})\triangleq\bigg{(}1-\big{(}1-\big{[}1-\delta\big{]}^{T_{p}}\big{)}\frac{C}{K}\bigg{)}^{\frac{t_{total}}{t_{lp}T_{p}+t_{delay}+t_{cp}}}.

\displaystyle\underset{T_{p}\geq 0}{\text{minimize}}\;F(T_{p})\triangleq\bigg{(}1-\big{(}1-\big{[}1-\delta\big{]}^{T_{p}}\big{)}\frac{C}{K}\bigg{)}^{\frac{t_{total}}{t_{lp}T_{p}+t_{delay}+t_{cp}}}.

\displaystyle\ln F(T_{p})=\underbrace{\frac{t_{total}/t_{lp}}{T_{p}+(t_{delay}+t_{cp})/t_{lp}}}_{(A)}\underbrace{\ln\bigg{(}\frac{K-C}{K}+\frac{C}{K}\big{[}1-\delta\big{]}^{T_{p}}\bigg{)}}_{(B)}.

\displaystyle\ln F(T_{p})=\underbrace{\frac{t_{total}/t_{lp}}{T_{p}+(t_{delay}+t_{cp})/t_{lp}}}_{(A)}\underbrace{\ln\bigg{(}\frac{K-C}{K}+\frac{C}{K}\big{[}1-\delta\big{]}^{T_{p}}\bigg{)}}_{(B)}.

\frac{d ln F ( T _{p} )}{d T _{p}} =

\frac{d ln F ( T _{p} )}{d T _{p}} =

\displaystyle\underbrace{\frac{K-C}{K}\big{(}T_{p}+r\big{)}\big{[}1-\delta\big{]}^{T_{p}}\ln(1-\delta)}_{(C)}-\underbrace{\bigg{(}\frac{K-C}{K}+\frac{C}{K}\big{[}1-\delta\big{]}^{T_{p}}\bigg{)}\ln\bigg{(}\frac{K-C}{K}+\frac{C}{K}\big{[}1-\delta\big{]}^{T_{p}}\bigg{)}}_{(D)}=0.

\displaystyle\underbrace{\frac{K-C}{K}\big{(}T_{p}+r\big{)}\big{[}1-\delta\big{]}^{T_{p}}\ln(1-\delta)}_{(C)}-\underbrace{\bigg{(}\frac{K-C}{K}+\frac{C}{K}\big{[}1-\delta\big{]}^{T_{p}}\bigg{)}\ln\bigg{(}\frac{K-C}{K}+\frac{C}{K}\big{[}1-\delta\big{]}^{T_{p}}\bigg{)}}_{(D)}=0.

\displaystyle\frac{K-C}{K}\big{(}T_{p}+r\big{)}\big{[}1-\delta\big{]}^{T_{p}}\ln(1-\delta)=\frac{K-C}{K}\ln\bigg{(}\frac{K-C}{K}\bigg{)}.

\displaystyle\frac{K-C}{K}\big{(}T_{p}+r\big{)}\big{[}1-\delta\big{]}^{T_{p}}\ln(1-\delta)=\frac{K-C}{K}\ln\bigg{(}\frac{K-C}{K}\bigg{)}.

\displaystyle T_{p}=\frac{1}{\ln(1-\delta)}W\bigg{(}\big{[}1-\delta\big{]}^{r}\ln\bigg{(}\frac{K-C}{K}\bigg{)}\bigg{)}-r.

\displaystyle T_{p}=\frac{1}{\ln(1-\delta)}W\bigg{(}\big{[}1-\delta\big{]}^{r}\ln\bigg{(}\frac{K-C}{K}\bigg{)}\bigg{)}-r.

w \in R^{d} minimize \frac{λ}{2} ∣∣ w ∣ ∣^{2} + \frac{1}{m} ∣∣ A^{T} w - y ∣ ∣^{2},

w \in R^{d} minimize \frac{λ}{2} ∣∣ w ∣ ∣^{2} + \frac{1}{m} ∣∣ A^{T} w - y ∣ ∣^{2},

\displaystyle\underset{{\bm{\alpha}}\in{\mathbb{R}}^{m}}{\text{maximize}}\;-\frac{\lambda}{2}||{\bm{A}}{\bm{\alpha}}||^{2}-\lambda^{2}m\sum_{i=1}^{m}\bigg{(}\frac{\alpha_{i}^{2}}{4}-\frac{y_{i}\alpha_{i}}{\lambda m}\bigg{)}.

\displaystyle\underset{{\bm{\alpha}}\in{\mathbb{R}}^{m}}{\text{maximize}}\;-\frac{\lambda}{2}||{\bm{A}}{\bm{\alpha}}||^{2}-\lambda^{2}m\sum_{i=1}^{m}\bigg{(}\frac{\alpha_{i}^{2}}{4}-\frac{y_{i}\alpha_{i}}{\lambda m}\bigg{)}.

\displaystyle\bigtriangleup\alpha=-\bigg{(}\frac{||{\bm{x}}_{i}||^{2}}{\lambda m}+\frac{\lambda^{2}m^{2}}{2}\bigg{)}^{-1}\bigg{(}{{\bm{w}}^{(h-1)}}^{T}{\bm{x}}_{i}+\frac{\lambda^{2}m^{2}}{2}\alpha_{i}^{(h-1)}-\lambda my_{i}\bigg{)},

\displaystyle\bigtriangleup\alpha=-\bigg{(}\frac{||{\bm{x}}_{i}||^{2}}{\lambda m}+\frac{\lambda^{2}m^{2}}{2}\bigg{)}^{-1}\bigg{(}{{\bm{w}}^{(h-1)}}^{T}{\bm{x}}_{i}+\frac{\lambda^{2}m^{2}}{2}\alpha_{i}^{(h-1)}-\lambda my_{i}\bigg{)},

w \in R^{d} minimize \frac{λ}{2} ∣∣ w ∣ ∣^{2} + \frac{1}{m} i = 1 \sum m max (0, 1 - A^{T} w),

w \in R^{d} minimize \frac{λ}{2} ∣∣ w ∣ ∣^{2} + \frac{1}{m} i = 1 \sum m max (0, 1 - A^{T} w),

α \in R^{m} maximize i = 1 \sum m α_{i} - \frac{1}{2 λ} ∣∣ A α ∣ ∣^{2} subject to 0 \leq α_{i} \leq \frac{1}{m}, \forall i .

α \in R^{m} maximize i = 1 \sum m α_{i} - \frac{1}{2 λ} ∣∣ A α ∣ ∣^{2} subject to 0 \leq α_{i} \leq \frac{1}{m}, \forall i .

α_{Q} \in R^{∣ Q ∣} maximize - \frac{λ}{2} ∣∣ \overline{w} + \frac{1}{λ} A_{Q} α_{Q} ∣ ∣^{2} + i \in Q \sum α_{i} + i \in \overline{Q} \sum α_{i}

α_{Q} \in R^{∣ Q ∣} maximize - \frac{λ}{2} ∣∣ \overline{w} + \frac{1}{λ} A_{Q} α_{Q} ∣ ∣^{2} + i \in Q \sum α_{i} + i \in \overline{Q} \sum α_{i}

subject to 0 \leq α_{i} \leq \frac{1}{m}, \forall i \in Q,

△ α =

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsStochastic Gradient Optimization Techniques · Advanced Graph Neural Networks · Privacy-Preserving Technologies in Data

Full text

Distributed Dual Coordinate Ascent in General Tree Networks and Communication Network Effect on Synchronous Machine Learning

Myung Cho, Lifeng Lai, and Weiyu Xu M. Cho is with the Department of ECE, Penn State Behrend, Erie, PA 16563, USA (E-mail: [email protected]).L. Lai is with the Department of ECE, University of California, Davis, CA 95616, USA (E-mail: [email protected]).W. Xu is with the Department of ECE, University of Iowa, Iowa City, IA 52242, USA (E-mail: [email protected]).

Abstract

Due to the big size of data and limited data storage volume of a single computer or a single server, data are often stored in a distributed manner. Thus, performing large-scale machine learning operations with the distributed datasets through communication networks is often required. In this paper, we study the convergence rate of the distributed dual coordinate ascent for distributed machine learning problems in a general tree-structured network. Since a tree network model can be understood as the generalization of a star network model, our algorithm can be thought of as the generalization of the distributed dual coordinate ascent in a star network model. We provide the convergence rate of the distributed dual coordinate ascent over a general tree network in a recursive manner and analyze the network effect on the convergence rate. Secondly, by considering network communication delays, we optimize the distributed dual coordinate ascent algorithm to maximize its convergence speed. From our analytical result, we can choose the optimal number of local iterations depending on the communication delay severity to achieve the fastest convergence speed. In numerical experiments, we consider machine learning scenarios over communication networks, where local workers cannot directly reach to a central node due to constraints in communication, and demonstrate that the usability of our distributed dual coordinate ascent algorithm in tree networks. Additionally, we show that adapting number of local and global iterations to network communication delays in the distributed dual coordinated ascent algorithm can improve its convergence speed.

Index Terms:

distributed machine learning, distributed dataset, machine learning over communication networks

I Introduction

In the past decade, machine learning has been driven by huge amount of data, simply called big data. In various fields including education, finance, transportation, healthcare, engineering, and management, etc., big data is fundamentally changing our lives and societies [1], e.g., recommender services [2], disease diagnosis and analysis [3], or even signal recovery [4]. However, due to limited storage volumes in storage server and constraints in communication, we face challenges of processing big data. Especially, big data are very often collected and stored from different locations at different times. Also, it is very expensive, inefficient, and insecure to aggregate distributed data in one central place. Machine learning over wireless communication networks can be a good example having these challenges, where machine learning process is performed through multiple decentralized devices having local data over wireless communication networks without sharing their raw data with others [5, 6]. Therefore, it is quite natural to consider solving large-scale machine learning problems with distributed data over communication networks in order to obtain valuable information from the distributed data.

Solving large-scale machine learning problems dealing with distributed data over communication networks is a challenging problem, due to the limited resources and obstacles including limited communication bandwidth, limited storage volume, limited energy consumption or even privacy and security issues. In order to handle the challenges of distributed data with limited resources, researchers have developed and studied various algorithms in [7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17] and the references therein. More specially, synchronous Stochastic Gradient Decent (SGD) [7, 8], synchronous Stochastic Dual Coordinate Ascent (SDCA) [13, 9, 10, 11, 14], asynchronous SGD [12, 15], and asynchronous SDCA [16, 17] for distributed data have been intensively investigated in the literature. Among them, [13] reports that even though the convergence of SGD does not depend on the size of data, SDCA can outperform SGD when we need relatively high solution accuracy. Moreover, asynchronous updating scheme in SGD and SDCA can suffer from the conflicts between intermediate results.

Motivated by these facts, [9, 10, 11] consider using synchronous SDCA to solve regularized loss minimization problems in a star network. In the scenario, data are distributed over a few local workers in the star network, and each local worker communicates with a central station. The authors in [9, 10, 11] analyze the convergence rate of the distributed SDCA in terms of communication rounds. Espeically, the strong aspects of the proposed distributed optimization framework in [10, 11] include free-of-tuning parameters or learning rates compared with SGD-based methods, and the readily computable duality gap for fair stopping criterion and efficient accuracy certificates.

However, in practice, the local workers may be organized in various types of network topologies such as a tree, a mesh, or a ring. Especially, in wireless communication networks, due to limited communication power and energy consumption, local workers sometimes cannot directly communicate with a central node. In this situation, the distributed dual coordinate ascent in a star network cannot be used for distributed machine learning. And if intermediate nodes are added for the communication from local workers to a central node, the distributed dual coordinate ascent for a star network will easily suffer from the increased latency and delay in communication. Therefore, considering communication network topologies in distributed machine learning problems is an important problem, and taking advantage of the network topologies may play a significant role in finding efficient solutions for the problems. Then, it is natural to ask how to design and analyze the distributed dual coordinate ascent over a network with general topologies beyond a star network. Additionally, since delay and latency in communication can affect the convergence speed of a distributed machine learning algorithm, it is essential to investigate how network communication delays will affect the design and convergence rate of distributed dual coordinate ascent algorithms previsouly introduced in [9, 10, 11] in terms of overall computational time instead of the number of communication rounds. The authors in [18] analyzed the convergence bound in terms of time by considering communication delays in a network for a consensus optimization problem. Additionally, the research [19, 20, 21, 22, 23] studied separable consensus problems to each worker by using ADMM techniques. We remark that the regularized loss minimization problem considered in [9, 10, 11] is a different problem from the consensus problems considered in [19, 20, 21, 22, 23] in the aspect of separability. Moreover, in [24, 25, 26], the authors considered a distributed deep neural network model. By introducing auxiliary variables, the authors made the non-convex problem separable, which can lead to a consensus problem over a network. Unlike the works in [24, 25, 26], we consider to solve distributed machine learning problems in an augmented manner by taking into account communication networks. Therefore, our work focuses on the communication and network topology’s effect on the distributed machine learning algorithms, while the works in [24, 25, 26] focus on the alternating or block coordinate descent algorithm itself without considering network constraints to solve neural network problems with auxiliary variables.

The contribution of this paper is three-fold. Firstly, we design the distributed dual coordinate ascent for a regularized loss minimization problem in a general tree-structured communication network and analyze the convergence rate of the algorithm over the general tree network. Since a star network is a special case of a general tree network, our distributed dual coordinate ascent algorithm can be thought of as a generalized version of the distributed dual coordinate ascent in a star network. Secondly, we study the influence of the communication constraints in a network on the convergence rate of the distributed dual coordinate ascent. By considering delays in communication, we optimize the network-constrained dual coordinate ascent to maximize its convergence speed in terms of time, and provide an analytical solution for the optimal number of local iterations depending on the communication delay severity. The analytical solution, which is a function of the ratio between the communication delay and the local processing time, can be used to achieve the fastest convergence speed of the distributed dual coordinate ascent in time. Finally, we demonstrate the usability of our proposing algorithm in machine learning over communication networks, where local workers cannot directly reach to a central node.

The rest of the paper is organized as follows. In Section II, we introduce the regularized loss minimization problem with distributed data. Section III describes a review of existing works on the synchronous distributed dual coordinate ascent in a star network. In Section IV, we propose the generalized distributed dual coordinate ascent in tree-structured networks. Section V describes the convergence analysis of the generalized distributed dual coordinate ascent. In Section VI, we study the communication delay factor in the convergence speed of the distributed dual coordinate ascent. In Section VII, we demonstrate the performance of the generalized distributed dual coordinate ascent and the optimal iteration numbers for the fast convergence speed. The proposed algorithm and its convergence rate without a proof were introduced in our previous conference paper [27]. In this journal paper, we provide the full proof of our theorem in Appendix A, the analysis of network topology and communication effect on the algorithm in Sections V and VI respectively, and additional numerical experiments in Section VII.

Notations: We denote the set of real numbers as ${\mathbb{R}}$ . We use $[k]$ to denote the index set of the coordinates in the $k$ -th coordinate block. For an index set $Q$ , $\overline{Q}$ and $|Q|$ are used to represent the complement and the cardinality of $Q$ respectively. We use bold letters to represent vectors and matrices. If we use an index set as a subscript of a vector (resp. matrix), we refer to the partial vector (resp. partial matrix) over the index set (resp. with columns over the index set). The superscript $(t)$ is used to denote the $t$ -th iteration. For example, ${\bm{\alpha}}^{(t)}_{[k]}$ represents a partial vector ${\bm{\alpha}}$ over the $k$ -th block coordinate set at the $t$ -th iteration. We reserve the superscript $\star$ to denote the optimal solution to an optimization problem.

II Problem formulation

We consider the following regularized loss minimization problem [9, 10, 14, 16, 17, 11]:

[TABLE]

where ${\bm{x}}_{i}\in{\mathbb{R}}^{d}$ , $i=1,2,...,m$ , are data points, $\ell_{i}(\cdot)$ , $i=1,2,...,m$ , are loss functions, and $\lambda$ is a tuning parameter for a regularization term. Note that due to the regularization term for ${\bm{w}}$ , which is a global variable, this minimization problem is not separable for each distributed node unlike the consensus problems introduced in [19, 20, 21, 22, 23], where the regularization term is defined like $\sum_{k=1}^{K}r({\bm{w}}_{k})$ . Here $r(\cdot)$ is a regularization function and $K$ is the number of distributed nodes. By considering different loss functions, (1) can be interpreted as various machine learning problems including regression and classification. For instance, for linear classification, by choosing the loss function $\ell_{i}(\cdot)$ to the hinge loss, i.e., $\ell_{i}({\bm{w}}^{T}{\bm{x}}_{i})=\max(0,1-y_{i}({\bm{w}}^{T}{\bm{x}}_{i}))$ , (1) with labeled dataset $\{({\bm{x}}_{i},y_{i})\}_{i=1}^{m}$ , where $y_{i}\in{\mathbb{R}}$ is label information, can be understood as the linear Support Vector Machine (SVM) classification problem. For regression, we can set $\ell_{i}({\bm{w}}^{T}{\bm{x}}_{i})=({\bm{w}}^{T}{\bm{x}}_{i}-y_{i})^{2}$ with some measurement data $y_{i}$ , $i=1,2,...,m$ . Throughout the paper, we assume that the data points ${\bm{x}}_{i}$ , $i=1,2,...,m$ , are normalized in $\ell_{2}$ norm, i.e., $\|{\bm{x}}_{i}\|\leq 1$ , $i=1,2,...,m$ , and the dataset $\{({\bm{x}}_{i},y_{i})\}_{i=1}^{m}$ is stored in a distributed manner over a network having $K$ local workers. More specifically, the $k$ -th local worker has training data $\{({\bm{x}}_{i},y_{i})\}$ , $i\in[k]$ , where $[k]$ represents the index set for the training data of the $k$ -th local worker. Hence, we have $|\cup_{k=1}^{K}[k]|=m$ .

From the primal problem (1), we have the following dual problem by considering the conjugate function, i.e., $\ell_{i}(a)=\sup_{b}ab-\ell_{i}^{*}(b)$ , where $a,b\in{\mathbb{R}}$ :

[TABLE]

where $\alpha_{i}$ is the $i$ -th element of the dual vector ${\bm{\alpha}}\in{\mathbb{R}}^{m}$ , and the data matrix ${\bm{A}}\in{\mathbb{R}}^{d\times m}$ whose $i$ -th column is $\frac{1}{\lambda m}{\bm{x}}_{i}$ , i.e., ${\bm{A}}_{i}=\frac{1}{\lambda m}{\bm{x}}_{i}$ , is introduced for notation convenience. By defining ${\bm{w}}({\bm{\alpha}})\triangleq{\bm{A}}{\bm{\alpha}}$ shown in [14, 10], we have the duality gap as $P({\bm{w}}({\bm{\alpha}}))-D({\bm{\alpha}})$ for a useful and readily computable stopping criteria. It is noteworthy that from the duality principle [28], we have $P({\bm{w}})\geq D({\bm{\alpha}})$ for all ${\bm{w}}$ and ${\bm{\alpha}}$ , and thus, $P({\bm{w}}({\bm{\alpha}}))\geq D({\bm{\alpha}})$ for all ${\bm{\alpha}}$ . If ${\bm{\alpha}}={\bm{\alpha}}^{\star}$ , which is the optimal solution to the dual problem (2), and the loss function $\ell(\cdot)$ is convex, we have $P({\bm{w}}({\bm{\alpha}}^{\star}))=D({\bm{\alpha}}^{\star})$ from strong duality condition. Thus, ${\bm{w}}({\bm{\alpha}}^{\star})$ becomes ${\bm{w}}^{\star}$ , which is the optimal solution to the primal problem (1). Additionally, if the loss function $\ell_{i}(\cdot)$ is non-convex, the primal problem will become a non-convex problem. However, the dual problem is still expressed in a convex problem [28]. Therefore, our algorithm to tackle the dual problem can provide an optimal solution to the dual problem. Unfortunately, in this case, there is no guarantee that the optimal solution to the dual problem becomes an optimal solution to the primal problem.

In the following sections, we consider a distributed dual coordinate ascent for the regularized loss minimization problem over distributed data. We firstly review the previous research on the distributed dual coordinate ascent in a star network.

III Review of the distributed dual coordinate ascent in a star network

The distributed dual coordinate ascent for the regularized loss minimization problem over distributed data in a network has been studied in [9, 10, 16, 11], where a star network topology for the network is considered as shown in Figure 1. In particular, the authors in [10] introduced a distributed dual coordinate ascent framework, called the Communication-Efficient Distributed Dual Coordinate Ascent (CoCoA), and later proposed CoCoA+[11], which is an enhanced version of CoCoA by adjusting the parameter value in the accumulation of intermediate results for faster convergence speed than CoCoA. Since we are interested in the distributed dual coordinate ascent for various structural network topologies and their influences to the performance of the distributed algorithm, we provide a high level review of CoCoA proposed in [10].

Suppose a star network has $K$ local workers and each local worker has disjoint parts of dataset $\{({\bm{x}}_{i},y_{i})\}_{i=1}^{m}$ . With this problem setting, the authors in [10] introduced the distributed dual coordinate ascent for a star network. Due to the nature of the distributed algorithm, the algorithm updates the global variable in the outer iteration, and locally each worker has inner iterations. Particularly, at the $t$ -th outer iteration of the algorithm, each worker solves a local dual problem for given dataset via LocalDualMethod( $\cdot$ ), which represents any dual method to solve (2), e.g. Stochastic Dual Coordinate Ascent (SDCA), simply denoted by LocalSDCA( $\cdot$ ), through inner iterations. And then, each local worker sends the intermediate solution to the center node. The center node collects and accumulates all the results from the local workers, and then updates and shares the global solution ${\bm{w}}^{(t)}$ at the $t$ -th outer iteration back to the workers. Algorithm 1 describes the detail steps of the distributed coordinate ascent in a star network. The following theorem characterizes the convergence rate of the algorithm in [10].

Theorem 1 ( [10, Theorem 2] ).

Suppose that Algorithm 1 is run for $T$ outer iterations of $K$ local computers with the procedure LocalSDCA( $\cdot$ ) having local geometric improvement $\Theta$ . Further, assume that the loss functions $\ell_{i}(\cdotp)$ are $\nicefrac{{1}}{{\gamma}}$ -smooth. Then, the following geometric convergence rate holds for the global (dual) objective:

[TABLE]

where $m$ is the size of the whole dataset and $\rho$ is any real number satisfying

[TABLE]

With LocalSDCA( $\cdot$ ), which uses the SDCA to solve the dual problem for given dataset at each worker, the local geometric improvement $\Theta$ can be set to

[TABLE]

where $\tilde{m}\triangleq\max_{k=1,...,K}m_{k}$ is the size of the largest block of coordinates among $K$ local workers, $H$ is the number of local (or inner) iterations in LocalSDCA( $\cdot$ ), and $s\in[0,1]$ is a step size of the gradient ascent which determines how far the next solution will be taken from the current solution at each iteration. Additionally, by choosing different parameter values instead of $\frac{1}{K}$ in the summation of $\bigtriangleup{\bm{w}}_{k}$ ’s in Algorithm 1, the authors in [11] proposed CoCoA+, which has the same framework as CoCoA introduced in Algorithm 1, for faster convergence speed than CoCoA.

CoCoA has been shown to work well for distributed machine learning problems with distributed data in a star network, which is a simple network model. However, the topology of a network may not necessarily be a star network. In the next section, we study the distributed dual coordinate ascent in a general network, which is a tree-structured network model.

IV Generalized distributed dual coordinate ascent in tree networks

One may think of a connected communication network, e.g., a spanning tree network, as a virtual star network by considering the long relays of links from a central node to each leaf node as a direct virtual-link through intermediate nodes. However, since communication delays normally exist in a network and the communication is a big burden of distributed algorithms, the distributed algorithms in the virtual star network can easily suffer from the long delays in communication by significantly slowing down the convergence of the distributed algorithms. Therefore, in a connected communication network, it is efficient to perform distributed optimization among local workers close to each other, and then, communicate the intermediate results to a central or sub-central nodes. Based on this idea, we investigate how to design the distributed dual coordinate ascent over a general tree-structured network, and provide its convergence analysis. Since every connected network has a spanning tree, we choose to investigate the distributed algorithm over a tree-structured network, which is also a generalization of a star network.

In Figure 2, we show a two-layer tree network as an example of a general tree-structured network, where the number of layers represents the depth of the tree network. The root node of the tree network represents the central station of the network. Each tree node may have several direct child nodes. For example, the root node has three direct child nodes $S_{1}$ , $S_{2}$ , and $S_{3}$ in Figure 2. A node not having any child node is called as a leaf node. Without loss of generality, we assume that only leaf nodes have the distributed data, which are disjoint segmented blocks of the data matrix ${\bm{A}}$ in column-wise. Note that ${\bm{A}}_{i}=\frac{1}{\lambda m}{\bm{x}}_{i}$ , where ${\bm{A}}_{i}$ is the $i$ -th column of ${\bm{A}}$ and ${\bm{x}}_{i}\in{\mathbb{R}}^{d}$ is the $i$ -th data point. If a non-leaf node $Q$ has data, we can always create a virtual leaf node $L$ attached to $Q$ , and “stores” the data in $L$ . Thus, without loss of generality, we can assume that the dataset $\{({\bm{x}}_{i},y_{i})\}_{i=1}^{m}$ are distributed only to leaf nodes.

For a tree node $Q$ , we can consider a subtree including the tree node $Q$ and its indirect and direct child nodes up to leaf nodes, simply called the subtree $Q$ . Figure 3 illustrates the subtree $Q$ . We also denote the set of indices of all data points stored in the subtree $Q$ as $Q$ , and the set of indices of data points in the $k$ -th direct child node of $Q$ as $[Q;k]$ . Therefore, $[Q;k]\subset Q$ . Then, ${\bm{\alpha}}_{[Q;k]}$ represents the partial vector of ${\bm{\alpha}}\in{\mathbb{R}}^{m}$ corresponding to the data points in the subtree with the $k$ -th direct child node of $Q$ . Since each node is used for an index set, we denote the number of data points stored in the subtree $Q$ , i.e., the cardinality of $Q$ , as $|Q|$ . In a tree network, we also assume that a node can only communicate with its direct child nodes or its direct parent node.

We then introduce the generalized distributed dual coordinate ascent, which we call TreeDualMethod, to solve the dual problem (2) with distributed data stored over a general tree-structured network. For simplicity, we consider the tree network in Figure 2, where the number of layers, $p$ , is 2. In a leaf node $W_{ij}$ , TreeDualMethod in Procedure IV is run with a local dataset for $T_{p}$ iterations, and then, the intermediate value $\bigtriangleup{\bm{w}}$ is shared with its direct parent node, i.e., the sub-central node $S_{i}$ . In the sub-central node $S_{i}$ , a global variable ${\bm{w}}$ for the $i$ -th cluster is updated, and distributed to local workers $W_{ij}$ ’s. After running this process for $T_{i}$ times independently in clusters, the variables $\bigtriangleup{\bm{w}}$ ’s from clusters are shared with the central node. The central node updates and shares the global variable ${\bm{w}}$ for whole distributed nodes. And the algorithm repeats this process until some stopping criteria holds. Algorithm 2, Algorithm 3 and Procedure IV describe the computational steps of TreeDualMethod for the root node, a general tree node (not root or leaf), and a leaf node respectively.

It is noteworthy that like the distributed algorithm in a star network case, in the distributed networks, the output $\bigtriangleup{\bm{w}}_{Q}$ in Procedure IV and Algorithm 3 or the output ${\bm{w}}$ in Algorithm 2 are transmitted between nodes, while the outputs ${\bm{\alpha}}$ and $\bigtriangleup{\bm{\alpha}}_{Q}$ are not transmitted through communication networks. Each node generates ${\bm{\alpha}}$ or $\Delta{\bm{\alpha}}_{Q}$ as an output of each node, but those outputs are used in each node at the next iteration without transmission to other nodes. Therefore, even though we have a large dataset, the communication cost is not affected by the size of the dataset. Also, when the dimension of ${\bm{\alpha}}\in{\mathbb{R}}^{m}$ is large, i.e., large amount of data, and the dimension of ${\bm{w}}$ is much smaller than $m$ , which is normally the case in big data, our distributed algorithm will have less communication burden.

V Convergence analysis of TreeDualMethod over a tree network

We analyze the convergence rate of the distributed dual coordinate ascent in a general tree-structured network model in this section. In a nutshell, we will show a recursive relation between the convergence rate of the algorithm at a tree node $Q$ and that at the node $Q$ ’s direct child nodes. Hence, the overall convergence rate of the distributed dual coordinate ascent in a general tree-structured network can be understood in a recursive manner, where the number of recursions is dependent on the number of layers of the tree network.

For clear description, let us consider a general tree network model having $p$ layers from the root node to leaf nodes, where the root node is on the layer-[math] and the leaf nodes are on the layer- $p$ . Suppose a node $Q$ on the $i$ -th layer has $K$ direct child nodes on the $(i+1)$ -th layer shown in Figure 3. We use ${\bm{\alpha}}_{[Q;k]}$ to denote the partial dual variable vector corresponding to its $k$ -th direct child node, where $1\leq k\leq K$ . Then, let us define the local suboptimality gap for the $k$ -th direct child node of $Q$ as

[TABLE]

Remark that the local suboptimality gap for the $k$ -th child node is defined with fixing ${\bm{\alpha}}_{\overline{Q}}$ and ${\bm{\alpha}}_{[Q;i]}$ ’s, where $i\neq k$ , and only updating ${\bm{\alpha}}_{[Q;k]}$ . Thus, the local suboptimality gap for the $k$ -th direct child node of $Q$ represents the maximum objective value gap that the $k$ -th direct child node of $Q$ can achieve from the current ${\bm{\alpha}}^{(t)}$ value with fixing other $\alpha_{i}$ , $i\notin[Q;k]$ , variables. Then, we introduce the following assumption about the local geometric improvement of TreeDualMethod at the $k$ -th direct child node of $Q$ .

Assumption 1 (Geometric improvement of TreeDualMethod at a direct child node).

For a tree node $Q$ on the $i$ -th layer, we assume that there exists $\Theta_{i+1}\in[0,1)$ such that for any given ${\bm{\alpha}}$ , TreeDualMethod at the $k$ -th direct child node of $Q$ returns an update $\bigtriangleup{\bm{\alpha}}_{[Q;k]}$ satisfying

[TABLE]

Note that Assumption 1 here is introduced for an arbitrary tree node in a general tree network and used as a starting assumption in mathematical induction for recursive convergence analysis, while Assumption 1 of [10] is introduced for an abstract function in the distributed algorithm framework.

For a leaf node, we use LocalSDCA for TreeDualMethod described in Procedure IV as in [10], and provide the following proposition about the convergence bound for a leaf node $B$ even with the input ${\bm{w}}$ also determined by ${\bm{\alpha}}_{\overline{Q}}$ and ${\bm{\alpha}}_{Q\setminus B}$ in Procedure IV.

Proposition 1 ([10, Proposition 1] ).

Let us consider a tree node $Q$ whose direct child node $B$ is a leaf node. Assume that loss functions $\ell_{i}(\cdot)$ are $\nicefrac{{1}}{{\gamma}}$ -smooth. Then for the leaf node $B$ , Assumption 1 holds with

[TABLE]

where $m_{B}$ is the size of data stored at node $B$ , $T_{p}$ is the number of iterations in Procedure IV.

Basically, the geometric improvement condition holds true with LocalSDCA if the $k$ -th direct child node of $Q$ is a leaf child node with $\Theta$ introduced in (4), where $s$ in (4) is $\frac{\lambda m\gamma}{1+\lambda m\gamma}$ in (7).

Additionally, Theorem 2, which is our main result, shows that if the geometric improvement condition holds true for direct child nodes of $Q$ , then the geometric improvement condition also holds true for $Q$ ; thus it leads to a recursive calculation of the convergence rate for the whole tree network.

Theorem 2.

Let us consider a tree node $Q$ on the $i$ -th layer which has $K_{i}$ direct child nodes satisfying the local geometric improvement requirement introduced in Assumption 1, with parameters $\Theta^{1}_{i+1}$ , $\Theta_{i+1}^{2}$ , …, and $\Theta^{K_{i}}_{i+1}$ . We assume that Algorithm 3 (or Algorithm 2) has an input ${\bm{w}}$ and is run for $T_{i}$ iterations. We further assume that loss functions $\ell_{i}(\cdot)$ ’s are $\nicefrac{{1}}{{\gamma}}$ -smooth.

Then, for any input ${\bm{w}}$ to Algorithm 3 (or Algorithm 2), the following geometric convergence rate holds for $Q$ :

[TABLE]

where $\Theta_{i+1}=\max_{k}\Theta^{k}_{i+1}$ , and $\rho_{i}$ is any real number satisfying

[TABLE]

Note that the parameter $\rho_{i}$ is related to the overlapping level among the datasets in the subtree $Q$ . When we have more overlap among local datasets in the subtree, the parameter $\rho_{i}$ can become larger, which will lead to slower convergence rate.

Proposition 1 is for the local geometric improvement of TreeDualMethod at a leaf node. Namely, Assumption 1 holds for leaf nodes. Theorem 2 is for the local geometric improvement of TreeDualMethod at any non-leaf tree node. Note that $(1-(1-\Theta_{i+1})\frac{1}{K_{i}}\frac{\lambda m\gamma}{\rho_{i}+\lambda m\gamma})^{T_{i}}$ in (2) becomes $\Theta_{i}$ for a tree node $Q$ on the $i$ -th layer, and then, (2) is interpreted as the local geometric improvement of TreeDualMethod at the direct child node by the direct parent node of $Q$ , which is a node on the $(i-1)$ -th layer. Basically, for the convergence rate of the generalized dual coordinate ascent over the whole tree network, we use the mathematical induction, where Proposition 1 is the base case, Assumption 1 is the starting assumption of the mathematical induction, and Theorem 2 completes the induction for the recursive convergence analysis. Therefore, by combining Theorem 2 with Proposition 1, we can recursively obtain the convergence rate of the generalized distributed dual coordinate ascent algorithm for the whole tree network with the fact that Assumption 1 holds true for every node in a tree network. Figure 4 illustrates the structure of the tree network factor in convergence rate, shown through $\Theta_{1}$ and $\Theta_{2}$ .

We remark that Theorem 2 is different from Theorem 2 of [10] in three aspects. Firstly, Theorem 2 is applicable to any tree node in a general tree network, beyond a star network discussed in [10]. Secondly, even when the input ${\bm{w}}$ of Algorithm 3 is determined by not only ${\bm{\alpha}}_{Q}$ but also ${\bm{\alpha}}_{\overline{Q}}$ , Theorem 2 holds. Note that ${\bm{w}}={\bm{A}}({\bm{\alpha}}_{Q},{\bm{\alpha}}_{\overline{Q}})={\bm{A}}_{Q}{\bm{\alpha}}_{Q}+{\bm{A}}_{\overline{Q}}{\bm{\alpha}}_{\overline{Q}}$ . Unlike our Theorem, in Theorem 2 of [10], due to the star network topology, a local worker has ${\bm{w}}$ as an input from the root node which is updated with intermediate results obtained from all the local workers. Hence, ${\bm{\alpha}}_{\overline{Q}}$ is not considered in Theorem 2 of [10] and its proof. Our proof of Theorem 2 addresses this challenge that the input ${\bm{w}}$ is also affected by ${\bm{\alpha}}_{\overline{Q}}$ . Therefore, we have to deal with both updating coordinates ${\bm{\alpha}}_{Q}\in{\mathbb{R}}^{|Q|}$ and un-updating coordinates ${\bm{\alpha}}_{\overline{Q}}\in{\mathbb{R}}^{|\overline{Q}|}$ , where $|Q|+|\overline{Q}|=m$ , while in the proof of Theorem 2 of [10], all the coordinates are updating coordinates, i.e., ${\bm{\alpha}}\in{\mathbb{R}}^{m}$ . For the readability, we place the proof of Theorem 2 in Appendix A. Finally, unlike [10], we do not consider the different local-dual problem introduced in Eqn. (8) of [10] for local workers, but deal with the original dual problem introduced in (2) with fixed $\overline{{\bm{w}}}\triangleq{\bm{A}}_{\overline{Q}}{\bm{\alpha}}_{\overline{Q}}$ for a general tree node $Q$ . Therefore, our theorem works for any tree node in a general tree network rather than just for one central node, which allows the recursive convergence analysis of the distributed dual coordinate ascent in a general tree network.

By denoting the convergence bound in (2) as $\Theta_{i}$ , i.e.,

[TABLE]

where $C_{i}$ represents $\frac{\lambda m\gamma}{\rho_{i}+\lambda m\gamma}$ , $T_{i}$ is the outer iteration in a tree node on the $i$ -th layer, and $K_{i}$ is the number of direct child nodes attached to a tree node on the $i$ -th layer, we can express the convergence bound on the whole tree-network, i.e., $\Theta_{0}$ , in terms of the number of layers $p$ , and the number of nodes $K_{i}$ ’s and $C_{i}$ ’s, as follows:

[TABLE]

where $\Theta_{p}$ is introduced in (7). For simplicity, we assume that all tree nodes on the $i$ -th layer have the same number of direct child nodes $K_{i}$ .

If $|T_{i}\cdot\frac{C_{i}\Theta_{i+1}}{K_{i}-C_{i}}|\ll 1$ , for $i=0,1,...,p-1$ , by applying the binomial approximation, we can have

[TABLE]

and approximate (9) as follows:

[TABLE]

In Section VII, we will investigate the gap between (9) and (10) through numerical experiments as well as analyze the network topology’s effect including the number of workers $K_{i}$ and the number of layers $p$ on the convergence bounds over the whole tree network introduced in (9) and (10).

We have discussed how the network topology can affect the convergence rate of the distributed dual coordinate ascent, which is expressed in terms of the number of layers, the number of nodes, and the number of iterations. However, for distributed algorithms, communications in a network can be a bottleneck of the convergence of the distributed algorithms. Therefore, it is quite natural to consider communication delay, which is normally expressed in time, in order to predict or estimate the convergence speed of the distributed algorithms. In the next section, we will study how communication delay, which is one of major network constraints, impacts the convergence of distributed dual coordinate ascent algorithms. By taking communication delays into account, we will optimize the number of local iterations $T_{p}$ in Procedure IV and $T_{i}$ in Algorithm 3 for maximum convergence speed.

VI Impacts of communication delay on the convergence rate of distributed dual coordinate ascent

Earlier works [9, 10, 11] bounded the convergence of distributed dual coordinate ascent algorithms with respect to the number of inner and outer iterations. However, in distributed algorithms, there may be significant communication delays in a distributed network. Thus, the convergence speed of distributed algorithms depends on not only how many iterations of these algorithms have been run, but also the communication delays in performing these iterations. Intuitively, if the communication delay is close to zero, local workers may be better to perform a small number of local iterations, and communicate with the central station at a higher frequency; on the other hand, if the communication delay is large, namely, there is a large communication cost, then local workers may want to perform more local iterations before communicating with the central station in order to speed up convergence. Therefore, our goal here is to investigate the convergence speed of distributed dual coordinate ascent with respect to total execution time including computational time and communication delays, and to optimize the number of local iterations by considering communication delays to achieve the maximum convergence speed of the distributed dual coordinate ascent. The research [18, 29, 30, 31] studied the impact of the communication delays on the convergence rate of algorithms in various distributed optimization problems including distributed consensus problems. However, for the regularized loss minimization problem that we deal with in this paper, to the best of our knowledge, our paper is the first one to analytically study the communication delay’s impact on the convergence rate, and finds the optimal number of local iterations depending on the communication delay severity.

For simplicity, let us first consider a star network as shown in Figure 1 and the corresponding Algorithm 1. Since the communication delay is normally given in time, we need to consider both time and the number of iterations in the convergence analysis in order to obtain the optimal number of iterations in practical applications having communication delay and computational time. We denote the round-trip communication delay between a local worker and the central station as $t_{{delay}}$ . We use $t_{lp}$ to denote the computational time for one local iteration at a worker, and use $t_{{cp}}$ to denote the computational time for parameter update at the central station. Figure 5 illustrates the communication delay, and the processing time of each local and central station.

Suppose that each local worker performs $T_{p}$ local iterations before communicating with the central station, and there are $T_{0}$ outer iterations in total. Then, the total experienced time is

[TABLE]

Hence, the number of outer iterations $T_{0}$ is given by

[TABLE]

From (2), for $T_{0}$ outer iterations, the expected gap between the optimal objective value and the current objective value with Algorithm 1 is expressed as

[TABLE]

where $\delta=\frac{s}{\tilde{m}}$ , $C=\nicefrac{{\lambda m\gamma}}{{(\rho+\lambda m\gamma)}}$ , and $K$ is the number of local workers. In order to minimize the gap in objective value (13) for a given total time $t_{{total}}$ , we introduce the following optimization problem over the number of local iterations $T_{p}$ by plugging (12) into (13):

[TABLE]

In order to figure out the optimal number of local iterations, let us find the critical point of the objective function $F(T_{p})$ . By applying logarithm to $F(T_{p})$ , we have

[TABLE]

(15) can be interpreted as the multiplication of two parts: the fraction part $(A)$ and the logarithm part $(B)$ . Note that the fraction part $(A)$ is a decreasing function over $T_{p}$ . And for the logarithm part $(B)$ , as $T_{p}$ increases, $(B)$ goes to $\ln((K-C)/K)$ , which is less than zero, due to the condition $0\leq 1-\delta<1$ . At $T_{p}=0$ , $\ln F(T_{p})$ is 0 due to $(B)=0$ . As $T_{p}$ goes to infinity, $\ln F(T_{p})$ will go to 0 due to $(A)=0$ . Therefore, we can expect at least a critical point at some $T_{p}$ . In order to figure out the critical point of (15), which is the same critical point of $F(T_{p})$ , we calculate the first order condition as follows:

[TABLE]

By simplifying (16) and denoting $\frac{t_{delay}+t_{cp}}{t_{lp}}$ to $r$ , we have the first order condition over $T_{p}$ as

[TABLE]

When $T_{p}$ is large enough, $(D)$ is approximated to $(\frac{K-C}{K})\ln(\frac{K-C}{K})$ . And then, we have

[TABLE]

Note that (18) has Lambert W-function [32], which is defined as when $xe^{x}=a$ , the solution $x$ is $W(a)$ , where $W(\cdot)$ is the Lambert W-function. By using the definition of the Lamber W-function, we have the following optimal local iteration $T_{p}$ from (18):

[TABLE]

From the recursive manner of the convergence analysis in a tree network as introduced in Section V, the optimal number of iterations $T_{i}$ in Algorithm 3 for a node $Q$ can also be obtained by using aforementioned equation (14) with slightly different interpretation. In the tree network, the number of local iterations $T_{p}$ in (14) is understood as the number of local iteration $T_{i}$ in Algorithm 3 for the node $Q$ . The computational time for the local iteration at a worker, denoted by $t_{lp}$ , is interpreted as the computational time for one-time receiving the updating intermediate results from $Q$ ’s child nodes. And $t_{delay}$ and $t_{cp}$ represent the communication delay time and the processing time at $Q$ ’s direct parent node respectively. Thus, with the same equation as (14) with different interpretation, the optimal number of local iterations for a general tree node $Q$ can be obtained as (19).

Since the objective function $F(T_{p})$ in (14) represents the convergence bound in terms of time, it is clearly recognized that for a fixed local iteration $T_{p}$ , the larger communication severity $r=t_{delay}/t_{lp}$ exists, the slower convergence rate we have. Additionally, if in a network, a central node, sub-central nodes and local workers are needed to be chosen, by considering the convergence analysis shown in a recursive manner and the communication delay between layers, choosing a root node making the depth of the connected network shallow will be better for fast convergence.

In the numerical experiments section, we will further investigate the impact of the communication delay severity $r$ , and other parameters including $C$ , $K$ , and $\delta$ in (19) on the optimal number of local iterations $T_{p}$ .

VII Numerical experiments

In wireless communication networks, it can often occur that the local workers are located out of communication range from the central node due to communication constraints such as limited communication power, long distance, limited bandwidth, and limited latency, etc. By reflecting the communication constraints, in the numerical experiments, we consider machine learning scenarios over communication networks, where local workers cannot directly communicate with a central node. Thus, in the distributed dual coordinate ascent for a star network, local workers can only share their local solutions with a central node through multiples of intermediate nodes, which can possibly cause heavy communication delay and latency. For comparison, we solve machine learning problems including regression and classification over different communication networks having different delays with the following datasets: KDD Cup 1998 dataset111KDD Cup 1998 dataset: https://archive.ics.uci.edu/ml/datasets/KDD+Cup+1998+Data, covertype dataset222Binary Covertype dataset: https://www.csie.ntu.edu.tw/~cjlin/libsvmtools/datasets/binary.html#covtype.binary[33], and wine quality dataset333Wine quality dataset: https://archive.ics.uci.edu/ml/datasets/wine+quality[34]. In addition, we numerically check that the optimal number of local iterations and demonstrate the impact of communication delay on the convergence speed of the distributed dual coordinate ascent by varying the communication delay in networks. And, we further numerically investigate the effect of network topology on the convergence of the distributed dual coordinate ascent over a tree network.

We compare the convergence of the generalized distributed dual coordinate ascent in tree networks against that in star networks with intermediate nodes. Since the authors in [10, 11] compared the distributed dual coordinate ascent in a star network, so-called CoCoA, with other well known methods including mini-batch SDCA[35], local SGD and mini-batch-SGD[36], we focus on comparing our generalized distributed dual coordinate ascent in tree networks with that in star networks by considering network constraints, especially, communication delay and latency. Additionally, since we are interested in the communication network’s effect on the convergence speed of the synchronous distributed dual coordinate ascent, considering the CoCoA+ [11], which is the updated version of CoCoA, or an asynchronous updating method, is out of the scope of this paper.

VII-A Machine learning over communication networks

We consider both regression and classification problems with KDD Cup 1998 dataset and the covtype dataset over communication networks. In the communication networks, we assume that local workers cannot directly reach to a central node, and huge communication delay exists due to the long relay of communication path. In order to reflect this scenario, we deal with various communication delays between the central node and its direct child nodes.

VII-A1 KDD Cup 1998 regression problem

In this numerical experiment, we test our algorithm and analysis for a ridge regression problem with KDD Cup 1998 dataset having $481$ attributions including a label and $95412$ instances. We consider the following specific optimization problem by setting $\ell_{i}({\bm{w}}^{T}{\bm{x}}_{i})=(\frac{1}{\lambda m}{\bm{w}}^{T}{\bm{x}}_{i}-y_{i})^{2}$ :

[TABLE]

where ${\bm{A}}\in{\mathbb{R}}^{d\times m}$ is the feature data matrix whose $i$ -th column is $\frac{1}{\lambda m}{\bm{x}}_{i}$ and ${\bm{y}}\in{\mathbb{R}}^{m}$ is a label vector. Then, the following dual problem is obtained from (20):

[TABLE]

Hence, in a local worker, $\bigtriangleup\alpha$ in Procedure IV is simply calculated as follows:

[TABLE]

where $({\bm{x}}_{i},y_{i})$ is a randomly chosen data point and $\alpha_{i}^{(h-1)}$ is $\alpha_{i}$ value at $(h-1)$ -th iteration.

For the dataset, we take first $95410$ instances and $404$ numerical-type attributions for our numerical experiments. And then, we normalize each attribution with $\ell_{2}$ norm of it for the performance of regression operation, and then normalize each instance with $\ell_{2}$ norm in order to make each instance ${\bm{x}}_{i}$ hold the condition $||{\bm{x}}_{i}||\leq 1$ . We set the tuning parameter $\lambda$ to $1$ . For the communication networks, we consider a tree network model having ten local workers, two sub-central nodes (each having five local workers), and one central node. The simulated star network has ten local workers and one center node. In both cases, we evenly split the data to ten local workers; namely, $9541$ instances without overlap are assigned to each local worker.

We set up a scenario where communication delay, $t_{delay}$ , exists between the center node and its direct child node. Therefore, in a star network, the communication delay exists between the central node and local workers, while a tree network has the delay between the central node and the sub-central node. We assume that communication delays between sub-central nodes and local workers are negligible. We set the communication delay $t_{delay}=r\times t_{lp}$ , where $t_{lp}$ is the computational time for one local iteration at a worker and the delay severity $r$ is varied from $1$ to $10^{4}$ . Hence, if the delay severity $r$ is huge, then, there exists huge communication delay in the network when it is compared to the local processing time for one iteration. For the algorithm in the tree network, we set the number of local iterations in local workers and the number of communications between the local workers and the sub-central node to $1000$ and $2$ respectively. For the algorithm in the star network, the number of local iterations at local workers is set to $1000$ . Figure 6 shows the duality gap at the central node as the operation time goes, and demonstrates that as the communication delay severity increases, the gap between a tree network and a star network in the convergence speed of the distributed algorithm is increased, which indicates the distributed algorithm in a star network can suffer more from the communication delay effect.

VII-A2 Covertype dataset classification problem

We further conduct the comparison between the distributed dual coordinate scent in a star network and a tree network with a standard hinge loss $\ell_{2}$ regularized SVM. We assume that the communication delay between the central node and its direct child nodes exists in the communcaiton networks. In this experiment, we use the preprocessed Covertype dataset [37], which is a binary classification dataset having $581012$ instances and $12$ attributions including label information. The $12$ attributions are expressed as $54$ columns of data with $10$ quantitative variables, $4$ binary wilderness areas and $40$ binary soil type variables. In order to satisfy the condition $||{\bm{x}}_{i}||\leq 1$ , we normalize the dataset and $y_{i}\in\{-1,1\}$ , $i=1,...,m$ . In this simulation, we organize a tree network having one central node, two sub-central nodes, and eight local workers. Each sub-central node has four local workers. Each local worker has evenly divided instances of the dataset without overlap. For the tree network, the number of communications between the local workers and the sub-central node is set to $10$ . The number of local iterations in both networks is set to $300$ .

For SVM, we consider the soft-margin SVM classification having hinge loss function, i.e, $\ell_{i}({\bm{w}}^{T}{\bm{x}}_{i})\triangleq\max(0,1-y_{i}(\frac{1}{\lambda m}{\bm{w}}^{T}{\bm{x}}_{i}))$ as follows:

[TABLE]

where ${\bm{A}}_{i}$ , the $i$ -th column of the matrix ${\bm{A}}$ , is $\frac{1}{\lambda m}y_{i}{\bm{x}}_{i}$ , $\max(\cdot)$ is element-wise operator, and $\bm{0}\in{\mathbb{R}}^{m}$ and $\bm{1}\in{\mathbb{R}}^{m}$ are the all [math] and all $1$ vectors respectively.

Then, the dual problem of (23) is stated as follows:

[TABLE]

Note here that while deriving the dual problem (24), we have ${\bm{w}}=\frac{1}{\lambda}{\bm{A}}{\bm{\alpha}}$ as the dual-primal variable relation. Then, the local problem for a local worker $Q$ is stated as follows:

[TABLE]

where $\overline{{\bm{w}}}\triangleq\frac{1}{\lambda}{\bm{A}}_{\overline{Q}}{\bm{\alpha}}_{\overline{Q}}={\bm{w}}-\frac{1}{\lambda}{\bm{A}}_{Q}{\bm{\alpha}}_{Q}$ . Then, in Procedure IV for updating $\bigtriangleup\alpha$ , we solve the following optimization problem:

[TABLE]

Here, we update the randomly chosen $i$ -th coordinate of ${\bm{\alpha}}$ , where $i\in Q$ . It is also possible to update the variable ${\bm{\alpha}}_{Q}$ with a block coordinate method. In order to solve (VII-A2), we calculate the optimal solution of (VII-A2) without the box constraint, i.e., $0\leq\alpha^{(h-1)}_{i}+\bigtriangleup\alpha\leq\frac{1}{m}$ , and then project the optimal solution onto the box constraint as follows:

[TABLE]

Figure 7 shows the duality gap as the operation time of the algorithms goes. As shown in Figure 7, it is better to run more local iterations before sharing intermediate results with the central node when there is huge communication delay in a network.

VII-B Impact of communication delay on the convergence speed

In order to see the impact of the communication delay severity $r$ , which is the ratio between the communication delay and the local processing time for one iteration, on the optimal number of local iterations $T_{p}$ , we provide Figure 8 to show the optimal number of local iterations $T_{p}$ by finding the critical point of (16). In the simulation, we set $(C,K,\delta,t_{total},t_{lp},t_{cp})=(0.5,3,\nicefrac{{1}}{{300}},1,4\times 10^{-5},3\times 10^{-5})$ . We set $t_{delay}=r\times t_{lp}$ , where $r$ is a parameter indicating how severe the communication delay is. Figure 8 (a) shows the objective values of (14) when $T_{p}$ is varied from $1$ to $2000$ . The red line represents the optimal convergence bound at the optimal number of local iterations, i.e., the critical point of (16) with different delay severity. Figure 8 (b) shows the optimal number of local iterations to achieve the fastest convergence rate in different communication delay severity, where $r$ is varied from $1$ to $10^{5}$ . The red dotted line is obtained by calculating the given analytical solution introduced in (19) with given aforementioned parameters, while the blue solid line is obtained by numerically calculating (14) and finding the optimal $T_{p}$ which minimizes the objective value. This simulation results in Figure 8 show that when the delay severity becomes larger, the more local iterations are desired for the fast convergence speed of the overall algorithm. It is noteworthy that in Figure 8(b), the difference between the numerical results from (14) and the analytical solution in (19) is observed. Especially, there is a big gap in the small communication delay severity, e.g., $r=1$ . This gap occurs because in the derivation of the analytical solution in (19), we approximate (D) of (17) by assuming that the local iteration $T_{p}$ is large enough. Hence, the gap becomes smaller when the communication delay severity is increased.

In order to see the impact of the optimal local iterations on a practical machine learning problem, we similarly conduct a regression task with wine quality dataset[34] in a star network. For the number of iterations in local workers, we vary $T_{p}$ from $1000$ to $10000$ , and evaluate the convergence speed in terms of operation time and duality gap. Figures 9 (a) and (b) show the duality gap as the operation time goes when the delay severity levels $r$ are set to $1$ and $10^{5}$ respectively. When $r=1$ , the fastest convergence is obtained at $T_{p}=2000$ , while when $r=10^{5}$ , the fastest convergence is obtained at $T_{p}=10000$ . As we expect in Section VI, when the communication delay is severe, it is better to perform the more local iterations before sharing the intermediate results with the central node. Also, if the communication delay is small, frequently sharing the intermediate results with the central node is helpful to improve the overall convergence speed. Moreover, we calculate the optimal number of iterations in local workers from the analytical solution (19) to see whether the analytical solution for the optimal number of local iterations fits to the simulation results. The parameters $\delta$ , $K$ , and $C$ are set to $\delta=1/1000$ , $K=4$ , and $C=0.9$ by reflecting the network and simulation settings. With those parameter values, we obtain $2117$ for $r=1$ and $6028$ for $r=10^{5}$ from the analytical solution in (19), while in the simulation, $T_{p}=2000$ for $r=1$ and $T_{p}=10000$ for $r=10^{5}$ provide the best convergence speed. Despite a little difference between the simulation result and the analytical solution for the optimal local number of iterations, (19) can still be used as a guideline for the number of local iterations in local workers.

VII-C Network topology’s effect on convergence bound

In this subsection, we numerically investigate the effect of network topology on the convergence bound over the whole tree network, i.e., $\Theta_{0}$ introduced in (9). In order to see the effect of the number of nodes on the convergence bound, we firstly run simulations by varying the number of child nodes, $K_{i}$ . For the simulation, we take into account a tree network having three layers, i.e., $p=3$ . For other parameters, we set $(T_{i},C_{i})$ to $(40,0.9)$ , for all $i=0,1,...,p$ , and $\Theta_{p}$ to 0.5. We consider that all nodes have the same number of child nodes $K$ , i.e., $K_{i}=K$ , for all $i=0,1,...,p-1$ , and vary $K$ from 5 to 10. Figure 10 shows the convergence bound $\Theta_{0}$ by varying the number of child node $K$ . The red solid line and the blue dotted line represent the convergence bound expressed in (9), and its approximation introduced in (10). As the number of nodes is increased, the convergence bound $\Theta_{0}$ is also increased.

We further run simulations to investigate the effect of the number of layers, $p$ , on the convergence bound, $\Theta_{0}$ . For simulations, we set $\frac{K_{i}-C_{i}}{K_{i}}=\frac{K-C}{K}$ and $T_{i}=T$ , for all $i$ . From the setting, we can further simplify the approximated convergence bound introduced in (10) as

[TABLE]

For given $K$ and $C$ , if $T$ is large enough to be $(\frac{K-C}{K})^{T}\frac{CT}{K-C}<1$ , then, the convergence bound is expressed as

[TABLE]

Note that $\lim_{T\rightarrow\infty}(\frac{K-C}{K})^{T}\frac{CT}{K-C}=0$ . If $T$ is small enough to be $(\frac{K-C}{K})^{T}\frac{CT}{K-C}>1$ , then, for the convergence bound, we have

[TABLE]

Therefore, for given $K$ and $C$ , depending on the number of iteration $T$ , the dominant term in (9) (or (10)) is changed like stated in (29) or (30), and the convergence bound, $\Theta_{0}$ , follows two different trends as shown in Figure 11. For Figures 11(a) and 11(b), we set the number of iterations $T$ to 5 and 20 respectively with maintaining the other parameters the same. Note that when $p=1$ , it represents the star network, and when the number of iteration $T$ is large enough, we can have the better convergence bound in a tree network as shown in Figure 11(b).

VII-D Parameter setting for faster convergence speed

In order to investigate the optimal number of local iterations which achieves the fastest convergence speed, from (19), we generate Figure 12 by varying each parameter $r$ , $C$ , $K$ , and $\delta$ . In Figure 12(a), the communication delay severity parameter $r$ is varied with fixed other parameters, $(C,K,\delta)=(0.5,3,\nicefrac{{1}}{{300}})$ . As shown in Figure 12(a) and the previous subsection, when the communication delay severity $r$ increases, the more number of local iterations before communication with the central node is desired for better convergence rate. Additionally, the parameter $\rho$ , which is reciprocal of the parameter $C$ in (19), indicates the distributed data overlapping level; namely, smaller $\rho$ , less overlapping data among local workers. In order to check the impact of the data overlapping level on the optimal local iteration $T_{p}$ , we vary $C$ with fixing other parameters to $(K,\delta,r)=(3,\nicefrac{{1}}{{300}},100)$ , and draw the graph in Figure 12(b). From Figure 12(b), when local workers have more overlapping dataset among them. i.e., larger $\rho$ value or smaller $C$ value, it is desired to run more local iterations to have better convergence speed. And as $\delta$ decreases, correspondingly the step size of the algorithm in a local worker decreases, the more number of local iterations is desired. This is understandable, because with a small step size, more iterations are needed to reach an optimal point. From Figure 12(d), as the number of local workers, $K$ , increases, the optimal number of local iterations, $T_{p}$ , is also increased. Since we fixed other parameters except for $K$ , increasing $K$ represents increasing the total size of dataset. And due to the bigger size of dataset in total, we think that more variance in the intermediate results from local workers may lead to more local iterations to reduce the variance.

VIII Conclusion and discussion

In this paper, we study the distributed dual coordinate ascent in a general tree-structured network, where a central node, sub-central nodes and local workers are connected over the communication network, and its analysis. Additionally, since the communication becomes a bottleneck in distributed network systems, we consider the communication delay in time in the convergence analysis of the distributed dual coordinate ascent and obtain the optimal number of iterations to achieve the best convergence speed. In the numerical experiments, we demonstrate the usability of our algorithm and analysis in synchronous machine learning scenarios over communication networks where local workers cannot directly reach to a central node due to communication constraints. More specifically, the proposed algorithm in a tree-structured network can reduce the communication overhead at the cost of more local computation complexity. However, since the communication is normally a bottleneck in a distributed process, the distributed algorithm in a tree network can play a significant role in the reduction of communication burden in distributed machine learning process.

In addition to the work in the paper, the following topics are possible directions for future research. We leave them for the future research.

•

Asynchronous updating scheme: Due to the possible performance difference among local workers, it is quite natural to consider asynchronous scheme. Thus, the design and analysis of asynchronous dual coordinate ascent algorithm for generalized tree network topologies can be the next direction of the research.

•

Different network topologies: Since every connected network has its spanning tree, in this paper, a general tree network topology is studied. However, in some network models organized in a mesh, thanks to the network connections in a mesh, the intermediate results from local workers can be easily shared with sub-central nodes and central node or even between local workers. Therefore, the distributed algorithm in mesh networks can have potentials to have faster convergence speed than the algorithm in tree networks. Thus, studying distributed algorithms in mesh networks is of great interest for distributed machine learning operations.

•

Various network constraints: The communication networks can have a variety of network constraints including communication delay, limited communication bandwidth, and limited transmission power. Motivated by these network constraints, the impact of communication delay on the convergence speed of distributed dual coordinate ascent is studied in this paper. It is also interesting to study the other communication constraints in distributed algorithms.

•

Training a neural network over distributed datasets: Since distributed data can be stored in any communication network, training a neural network over distributed datasets is also of great interest. By considering a spanning tree network, a distributed algorithm framework on a tree-structured network can be a possible approach to this problem.

Appendix A Proof of Theorem 2

For this proof, we follow the proof of Theorem 2 of [10] with the additional difference, i.e., dealing with both updating coordinates ${\bm{\alpha}}_{Q}$ and un-updating coordinates ${\bm{\alpha}}_{\overline{Q}}$ , and show that for a general tree node $Q$ , the convergence analysis introduced in (2) holds.

Proof.

Suppose the tree node $Q$ has $K$ direct child nodes, and we simply represent the child nodes from $1$ to $K$ . The convergence rate of the algorithm at a tree node $Q$ is obtained by considering the updating scheme at the node $Q$ as follows.

[TABLE]

where ${\bm{\alpha}}_{<[k]>}$ is the zero-padding version of ${\bm{\alpha}}_{[k]}$ and $Q=[1:K]=\cup_{k=1}^{K}[k]$ is the index set corresponding to workers connected to the node $Q$ . The optimal value at the node $Q$ is stated as

[TABLE]

where ${\bm{A}}_{Q}$ is the partial matrix of ${\bm{A}}$ by choosing the columns of ${\bm{A}}$ over the index set $Q$ , and ${\bm{A}}_{\overline{Q}}{\bm{\alpha}}_{\overline{Q}}$ is denoted as $\overline{{\bm{w}}}$ . From (31), we have

[TABLE]

where the inequality is obtained from the Jensen’s inequality. Then, we have

[TABLE]

where $\epsilon_{Q,k}(\cdot)$ is defined in (5) and the super-script $\star$ represents the optimal solution. Then, the expectation of $D\big{(}{\bm{\alpha}}^{(t+1)}_{[1:K]},{\bm{\alpha}}_{\overline{Q}}\big{)}-D\big{(}{\bm{\alpha}}^{(t)}_{[1:K]},{\bm{\alpha}}_{\overline{Q}}\big{)}$ is lower-bounded as follows:

[TABLE]

where the last inequality is obtained from Assumption 1. And $\sum_{k=1}^{K}\epsilon_{Q,k}({\bm{\alpha}}^{(t)}_{[1:K]},{\bm{\alpha}}_{\overline{Q}})$ can be bounded as follows.

[TABLE]

We can lower-bound (A) by upper-bounding (A). For the upper-bound of (A), we have

[TABLE]

where the second inequality is from ${\bm{A}}_{i}=\frac{1}{\lambda m}{\bm{x}}_{i}$ , and the third inequality is obtained from the assumption of the scaled input data, i.e., $\|{\bm{x}}_{i}\|\leq 1$ . We can have the last inequality by introducing $\rho_{min}$ , which is the minimum value of $\rho$ , to hold the last inequality as follows:

[TABLE]

The condition $\rho_{min}\geq 0$ can be shown by considering a feasible solution making $\sum_{k=1}^{K}||{\bm{A}}_{[k]}{\bm{\alpha}}_{[k]}||^{2}-||{\bm{A}}_{[1:K]}{\bm{\alpha}}||^{2}=0$ , e.g., ${\bm{\alpha}}={\bm{e}}_{i}$ , where ${\bm{e}}_{i}$ is a standard unit vector having 1 in the $i$ -th entry and 0 elsewhere.

Then, (A), which is $\sum_{k=1}^{K}\epsilon_{Q,k}({\bm{\alpha}}^{(t)}_{[1:K]},{\bm{\alpha}}_{\overline{Q}})$ , is lower-bounded as follows:

[TABLE]

where $\eta$ in the second inequality is introduced for line search between the optimal solution ${\bm{\alpha}}^{\star}_{[1:K]}$ and ${\bm{\alpha}}^{(t)}_{[1:K]}$ , and the equality holds when $\hat{{\bm{\alpha}}}_{[1:K]}$ is in the line between ${\bm{\alpha}}^{\star}_{[1:K]}$ and ${\bm{\alpha}}^{(t)}_{[1:K]}$ . And the third inequality is obtained from the strong concavity of $D({\bm{\alpha}})$ . Specifically, we use the well-known fact that if a function $\ell_{i}(a)$ is $\frac{1}{\gamma}$ -smooth, the conjugate function $\ell_{i}^{*}$ is $\gamma$ strongly convex: for all $u,v\in{\mathbb{R}}$ and $\eta\in[0,1]$ [14]:

[TABLE]

From (35), we have the following inequality for $D\big{(}\eta{\bm{\alpha}}^{\star}_{[1:K]}+(1-\eta){\bm{\alpha}}^{(t)}_{[1:K]},{\bm{\alpha}}_{\overline{Q}}\big{)}$ :

[TABLE]

Notice that $\eta\in[0,1]$ . Also note that we derive the equations by using ${\bm{A}}({\bm{\alpha}}^{\star}_{[1:K]},{\bm{\alpha}}_{\overline{Q}})$ ; however, at each node, we do not know ${\bm{A}}_{\overline{Q}}$ , but $\overline{{\bm{w}}}$ . Therefore, for the term ${\bm{A}}({\bm{\alpha}}^{\star}_{[1:K]},{\bm{\alpha}}_{\overline{Q}})$ , $({\bm{A}}_{Q}{\bm{\alpha}}^{\star}_{[1:K]}+\overline{{\bm{w}}})$ is the correct notation; however in order to clearly show the dual objective function, we use the term ${\bm{A}}({\bm{\alpha}}^{\star}_{[1:K]},{\bm{\alpha}}_{\overline{Q}})$ instead of $({\bm{A}}_{Q}{\bm{\alpha}}^{\star}_{[1:K]}+\overline{{\bm{w}}})$ with which the derivation can also go through.

(A) can be lower-bounded by choosing $\eta=\frac{\lambda m\gamma}{\lambda m\gamma+\rho}\geq 0$ as

[TABLE]

Therefore, we have

[TABLE]

From (36), we have

[TABLE]

By moving the term $D({\bm{\alpha}}^{\star}_{[1:K]})-D({\bm{\alpha}}^{(t)}_{[1:K]},{\bm{\alpha}}_{\overline{Q}})$ in LHS to RHS and multiplying $-1$ in both sides, we have

[TABLE]

∎

Appendix B Derivation of the optimal number of local iterations $T_{p}$

For the sake of simplicity of (16), by denoting $1-\delta$ , $\frac{K-C}{K}$ , $\frac{C}{K}$ , and $(t_{delay}+t_{cp})/t_{lp}$ to $a$ , $b$ , $c$ , and $r$ respectively, we have the following first order condition over $T_{p}$ for given $a$ , $b$ , $c$ , and $r$ :

[TABLE]

where $a,b,c\in[0,1)$ and $b+c=1$ . When $T_{p}$ is large enough, $b(T_{p}+r)a^{T_{p}}\ln(a)$ is the dominant term of (37) and notice that $0<a<1$ . Therefore, by approximating the term $(b+ca^{T_{p}})\ln(b+ca^{T_{p}})$ to $b\ln(b)$ , we have $b(T_{p}+r)a^{T_{p}}\ln(a)=b\ln(b)$ . And then, the equation is re-stated as follows:

[TABLE]

From the definition of the Lambert W-function, which is when $xe^{x}=a$ , the solution $x$ is $W(a)$ , where $W(\cdot)$ is the Lambert W-function, we have

[TABLE]

Therefore, for the optimal number of local iterations $T_{p}$ , we have

[TABLE]

Bibliography37

The reference list from the paper itself. Each links out to its DOI / PubMed record.

1[1] M. Chen, S. Mao, and Y. Liu, “Big data: A survey,” Mobile Networks and Applications , vol. 19, no. 2, pp. 171–209, 2014.
2[2] J. P. Verma, B. Patel, and A. Patel, “Big data analysis: recommendation system with hadoop framework,” in Proceedings of IEEE International Conference on Computational Intelligence & Communication Technology , 2015, pp. 92–97.
3[3] J. Andreu-Perez, C. Poon, R. D. Merrifield, S. Wong, and G.-Z. Yang, “Big data for health,” IEEE journal of biomedical and health informatics , vol. 19, no. 4, pp. 1193–1208, 2015.
4[4] S. Efromovich, J. Lakey, M. C. Pereyra, and N. Tymes, “Data-driven and optimal denoising of a signal and recovery of its derivative using multiwavelets,” IEEE Transactions on Signal Processing , vol. 52, no. 3, pp. 628–635, 2004.
5[5] J. Park, S. Samarakoon, M. Bennis, and M. Debbah, “Wireless network intelligence at the edge,” Proceedings of the IEEE , vol. 107, no. 11, pp. 2204–2239, 2019.
6[6] G. Zhu, D. Liu, Y. Du, C. You, J. Zhang, and K. Huang, “Toward an intelligent edge: Wireless communication meets machine learning,” IEEE Communications Magazine , vol. 58, no. 1, pp. 19–25, 2020.
7[7] J. Chen, R. Monga, S. Bengio, and R. Jozefowicz, “Revisiting distributed synchronous SGD,” in Proceedings of the International Conference on Learning Representations Workshop Track , 2016.
8[8] R. Gemulla, E. Nijkamp, P. J. Haas, and Y. Sismanis, “Large-scale matrix factorization with distributed stochastic gradient descent,” in Proceedings of the ACM SIGKDD International Conference on Knowledge Discovery and Data Mining . ACM, 2011, pp. 69–77.

TL;DR

Contribution

Findings

Abstract

Peer Reviews

Videos

Taxonomy

Distributed Dual Coordinate Ascent in General Tree Networks and Communication Network Effect on Synchronous Machine Learning

Abstract

Index Terms:

I Introduction

II Problem formulation

III Review of the distributed dual coordinate ascent in a star network

Theorem 1** ( [10, Theorem 2] ).**

IV Generalized distributed dual coordinate ascent in tree networks

V Convergence analysis of TreeDualMethod over a tree network

Assumption 1** (Geometric improvement of TreeDualMethod at a direct child node).**

Proposition 1** ([10, Proposition 1] ).**

Theorem 2**.**

VI Impacts of communication delay on the convergence rate of distributed dual coordinate ascent

VII Numerical experiments

VII-A Machine learning over communication networks

VII-A1 KDD Cup 1998 regression problem

VII-A2 Covertype dataset classification problem

VII-B Impact of communication delay on the convergence speed

VII-C Network topology’s effect on convergence bound

VII-D Parameter setting for faster convergence speed

VIII Conclusion and discussion

Appendix A Proof of Theorem 2

Proof.

Appendix B Derivation of the optimal number of local iterations TpT_{p}Tp​

Theorem 1 ( [10, Theorem 2] ).

Assumption 1 (Geometric improvement of TreeDualMethod at a direct child node).

Proposition 1 ([10, Proposition 1] ).

Theorem 2.

Appendix B Derivation of the optimal number of local iterations $T_{p}$