Distributed Stochastic Gradient Method for Non-Convex Problems with   Applications in Supervised Learning

Jemin George; Tao Yang; He Bai; Prudhvi Gurram

arXiv:1908.06693·math.OC·August 20, 2019

Distributed Stochastic Gradient Method for Non-Convex Problems with Applications in Supervised Learning

Jemin George, Tao Yang, He Bai, Prudhvi Gurram

PDF

1 Repo

TL;DR

This paper introduces a distributed stochastic gradient descent algorithm tailored for non-convex optimization problems, demonstrating its effectiveness in collaborative neural network training for digit recognition across networked agents.

Contribution

It presents a novel distributed stochastic gradient method with convergence guarantees for non-convex problems and applies it successfully to distributed supervised learning tasks.

Findings

01

Agents achieve similar performance to centralized training

02

Distributed training enables recognition without local data for all classes

03

Algorithm converges under specific step-size conditions

Abstract

We develop a distributed stochastic gradient descent algorithm for solving non-convex optimization problems under the assumption that the local objective functions are twice continuously differentiable with Lipschitz continuous gradients and Hessians. We provide sufficient conditions on step-sizes that guarantee the asymptotic mean-square convergence of the proposed algorithm. We apply the developed algorithm to a distributed supervised-learning problem, in which a set of networked agents collaboratively train their individual neural nets to recognize handwritten digits in images. Results indicate that all agents report similar performance that is also comparable to the performance of a centrally trained neural net. Numerical results also show that the proposed distributed algorithm allows the individual agents to recognize the digits even though the training data corresponding to all…

Equations201

R_{i} (w) = \int_{R^{d_{x}} \times R^{d_{y}}} ℓ (h (x_{i}; w), y_{i}) d P_{i} (x_{i}, y_{i}) = E_{P_{i}} [ℓ (h (x_{i}; w), y_{i})] .

R_{i} (w) = \int_{R^{d_{x}} \times R^{d_{y}}} ℓ (h (x_{i}; w), y_{i}) d P_{i} (x_{i}, y_{i}) = E_{P_{i}} [ℓ (h (x_{i}; w), y_{i})] .

R (w) = i = 1 \sum n R_{i} (w) = i = 1 \sum n E_{P_{i}} [ℓ (h (x_{i}; w), y_{i})] .

R (w) = i = 1 \sum n R_{i} (w) = i = 1 \sum n E_{P_{i}} [ℓ (h (x_{i}; w), y_{i})] .

\overset{ˉ}{R}_{i} (w) = \frac{1}{m _{i}} j = 1 \sum m_{i} ℓ (h (x_{i}^{j}; w), y_{i}^{j}) .

\overset{ˉ}{R}_{i} (w) = \frac{1}{m _{i}} j = 1 \sum m_{i} ℓ (h (x_{i}^{j}; w), y_{i}^{j}) .

\overset{ˉ}{R} (w) = i = 1 \sum n \overset{ˉ}{R}_{i} (w) = i = 1 \sum n [\frac{1}{m _{i}} j = 1 \sum m_{i} ℓ (h (x_{i}^{j}; w), y_{i}^{j})] .

\overset{ˉ}{R} (w) = i = 1 \sum n \overset{ˉ}{R}_{i} (w) = i = 1 \sum n [\frac{1}{m _{i}} j = 1 \sum m_{i} ℓ (h (x_{i}^{j}; w), y_{i}^{j})] .

w min f (w) = w min i = 1 \sum n f_{i} (w),

w min f (w) = w min i = 1 \sum n f_{i} (w),

w_{i} (k + 1) = w_{i} (k) - β_{k} j = 1 \sum n a_{ij} (w_{i} (k) - w_{j} (k)) - α_{k} g_{i} (w_{i} (k), ξ_{i} (k)),

w_{i} (k + 1) = w_{i} (k) - β_{k} j = 1 \sum n a_{ij} (w_{i} (k) - w_{j} (k)) - α_{k} g_{i} (w_{i} (k), ξ_{i} (k)),

\displaystyle\mathbf{g}_{i}\left(\bm{w}_{i}(k),\bm{\xi}_{i}(k)\right)=\left\{\begin{array}[]{l}\nabla\ell\left(\bm{w}_{i}(k),\bm{\xi}_{i}^{k}\right),~{}~{}\mbox{or}\\ \frac{1}{n_{i}(k)}\sum\limits_{s=1}^{n_{i}(k)}\,\nabla\ell\left(\bm{w}_{i}(k),\bm{\xi}_{i}^{k,s}\right),~{}~{}\mbox{or}\\ H_{i}(k)\frac{1}{n_{i}(k)}\sum\limits_{s=1}^{n_{i}(k)}\,\nabla\ell\left(\bm{w}_{i}(k),\bm{\xi}_{i}^{k,s}\right),\end{array}\right.

\displaystyle\mathbf{g}_{i}\left(\bm{w}_{i}(k),\bm{\xi}_{i}(k)\right)=\left\{\begin{array}[]{l}\nabla\ell\left(\bm{w}_{i}(k),\bm{\xi}_{i}^{k}\right),~{}~{}\mbox{or}\\ \frac{1}{n_{i}(k)}\sum\limits_{s=1}^{n_{i}(k)}\,\nabla\ell\left(\bm{w}_{i}(k),\bm{\xi}_{i}^{k,s}\right),~{}~{}\mbox{or}\\ H_{i}(k)\frac{1}{n_{i}(k)}\sum\limits_{s=1}^{n_{i}(k)}\,\nabla\ell\left(\bm{w}_{i}(k),\bm{\xi}_{i}^{k,s}\right),\end{array}\right.

w (k + 1)

w (k + 1)

g (w (k), ξ (k)) ≜ g_{1} (w_{1} (k), ξ_{1} (k)) ⋮ g_{n} (w_{n} (k), ξ_{n} (k)) \in R^{n d_{w}} .

g (w (k), ξ (k)) ≜ g_{1} (w_{1} (k), ξ_{1} (k)) ⋮ g_{n} (w_{n} (k), ξ_{n} (k)) \in R^{n d_{w}} .

∥ f_{i} (w_{a}) - f_{i} (w_{b}) ∥_{2}

∥ f_{i} (w_{a}) - f_{i} (w_{b}) ∥_{2}

∥\nabla f_{i} (w_{a}) - \nabla f_{i} (w_{b}) ∥_{2}

F (w (k)) = i = 1 \sum n f_{i} (w_{i} (k)) .

F (w (k)) = i = 1 \sum n f_{i} (w_{i} (k)) .

∥\nabla F (w_{a}) - \nabla F (w_{b}) ∥_{2} \leq L ∥ w_{a} - w_{b} ∥_{2},

∥\nabla F (w_{a}) - \nabla F (w_{b}) ∥_{2} \leq L ∥ w_{a} - w_{b} ∥_{2},

∥\nabla F (w) ∥_{2} \leq μ_{F}, \forall w \in R^{n d_{w}},

∥\nabla F (w) ∥_{2} \leq μ_{F}, \forall w \in R^{n d_{w}},

F (w_{b})

F (w_{b})

F_{i n f} \leq F (w), \forall w \in R^{n d_{w}}

F_{i n f} \leq F (w), \forall w \in R^{n d_{w}}

α_{k} = \frac{a}{( k + 1 ) ^{δ_{2}}} and β_{k} = \frac{b}{( k + 1 ) ^{δ_{1}}},

α_{k} = \frac{a}{( k + 1 ) ^{δ_{2}}} and β_{k} = \frac{b}{( k + 1 ) ^{δ_{1}}},

x^{⊤} L x = \tilde{x}^{⊤} L \tilde{x} \geq λ_{2} (L) ∥ \tilde{x} ∥_{2}^{2},

x^{⊤} L x = \tilde{x}^{⊤} L \tilde{x} \geq λ_{2} (L) ∥ \tilde{x} ∥_{2}^{2},

W_{0} = (I_{n} - b L)

W_{0} = (I_{n} - b L)

E_{ξ} [w_{k + 1}]

E_{ξ} [w_{k + 1}]

= (W_{k} \otimes I_{d_{w}}) w_{k} - α_{k} E [g (w_{k}, ξ_{k}) ∣ F_{k}] a.s.,

E_{ξ} [g (w_{k}, ξ_{k})] = \nabla F (w_{k}), a.s.

E_{ξ} [g (w_{k}, ξ_{k})] = \nabla F (w_{k}), a.s.

E_{ξ} [g (w_{k}, ξ_{k})]

E_{ξ} [g (w_{k}, ξ_{k})]

E_{ξ} [∥ g (w_{k}, ξ_{k}) ∥_{2}^{2}] \leq \overset{μ}{ˉ}_{v_{1}} + \overset{μ}{ˉ}_{v_{2}} ∥ \nabla F (w_{k}) ∥_{2}^{2}, a.s. .

E_{ξ} [∥ g (w_{k}, ξ_{k}) ∥_{2}^{2}] \leq \overset{μ}{ˉ}_{v_{1}} + \overset{μ}{ˉ}_{v_{2}} ∥ \nabla F (w_{k}) ∥_{2}^{2}, a.s. .

k \geq 0 sup E [∥ g (w_{k}, ξ_{k}) ∥_{2}^{2}] \leq μ_{g} .

k \geq 0 sup E [∥ g (w_{k}, ξ_{k}) ∥_{2}^{2}] \leq μ_{g} .

E [∥ \tilde{w}_{k} ∥_{2}^{2}] = O (\frac{1}{( k + 1 ) ^{δ_{2}}}) .

E [∥ \tilde{w}_{k} ∥_{2}^{2}] = O (\frac{1}{( k + 1 ) ^{δ_{2}}}) .

γ_{k} = \frac{α _{k}}{β _{k}} = \frac{a / b}{( k + 1 ) ^{δ_{2} - δ_{1}}} .

γ_{k} = \frac{α _{k}}{β _{k}} = \frac{a / b}{( k + 1 ) ^{δ_{2} - δ_{1}}} .

V (γ_{k}, w_{k}) = F (w_{k}) + \frac{1}{2 γ _{k}} w_{k}^{⊤} (L \otimes I_{d_{w}}) w_{k} .

V (γ_{k}, w_{k}) = F (w_{k}) + \frac{1}{2 γ _{k}} w_{k}^{⊤} (L \otimes I_{d_{w}}) w_{k} .

\nabla V (γ_{k}, w_{k}) = \nabla F (w_{k}) + \frac{1}{γ _{k}} (L \otimes I_{d_{w}}) w_{k} .

\nabla V (γ_{k}, w_{k}) = \nabla F (w_{k}) + \frac{1}{γ _{k}} (L \otimes I_{d_{w}}) w_{k} .

k = 0 \sum \infty α_{k} E [∥ \nabla V (γ_{k}, w_{k}) ∥_{2}^{2}] < \infty.

k = 0 \sum \infty α_{k} E [∥ \nabla V (γ_{k}, w_{k}) ∥_{2}^{2}] < \infty.

k = 0 \sum \infty E [∥ w_{k + 1} - w_{k} ∥_{2}^{2}]

k = 0 \sum \infty E [∥ w_{k + 1} - w_{k} ∥_{2}^{2}]

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

jeming7/DistributedSupervisedLearning
none

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Full text

Distributed Stochastic Gradient Method for Non-Convex Problems with Applications in Supervised Learning

J. George, T. Yang, H. Bai and P. Gurram J. George is with U.S. Army Research Laboratory, Adelphi, MD 20783, USA. [email protected]. Yang is with University of North Texas, Denton, TX 76203 USA. [email protected] Bai is with Oklahoma State University, Stillwater, OK 74078, USA. [email protected]. Gurram is with Booz Allen Hamilton & U.S. Army Research Laboratory, Adelphi, MD 20783, USA. [email protected]

Abstract

We develop a distributed stochastic gradient descent algorithm for solving non-convex optimization problems under the assumption that the local objective functions are twice continuously differentiable with Lipschitz continuous gradients and Hessians. We provide sufficient conditions on step-sizes that guarantee the asymptotic mean-square convergence of the proposed algorithm. We apply the developed algorithm to a distributed supervised-learning problem, in which a set of networked agents collaboratively train their individual neural nets to recognize handwritten digits in images. Results indicate that all agents report similar performance that is also comparable to the performance of a centrally trained neural net. Numerical results also show that the proposed distributed algorithm allows the individual agents to recognize the digits even though the training data corresponding to all the digits is not locally available to each agent.

I Introduction

With the advent of smart devices, there has been an exponential growth in the amount of data collected and stored locally on the individual devices. Applying machine learning to extract value from such massive data to provide data-driven insights, decisions, and predictions has been a hot research topic as well as the focus of numerous businesses like Google, Facebook, Alibaba, Yahoo, etc. However, porting these vast amounts of data to a data center to conduct traditional machine learning has raised two main issues: (i) the communication challenge associated with transferring vast amounts of data from a large number of devices to a central location and (ii) the privacy issues associated with sharing raw data. Distributed machine learning techniques based on the server-client architecture [1, 2] have been proposed as solutions to this problem. On one extreme end of this architecture, we have the parameter server approach, where a server or group of servers initiate distributed learning by pushing the current model to a set of client nodes that host the data. Client nodes compute the local gradients or parameter updates and communicate it to the server nodes. Server nodes aggregate these values and update the current model [3, 4]. On the other extreme, we have federated learning, where each client node obtains a local solution to the learning problem and the server node computes a global model by simply averaging the local models [5, 6]. These distributed learning techniques are not truly distributed since they follow a master-slave architecture and do not involve any peer-to-peer communication. Though these techniques are not always robust and they are rendered useless if the server fails, they do provide a good business opportunity for companies that own servers and host web services. However, our aim is to develop a fully distributed machine learning architecture enabled by client-to-client interaction.

For large-scale machine learning, stochastic gradient descent (SGD) methods are often preferred over batch gradient methods [7] because (i) in many large-scale problems, there is a good deal of redundancy in data and therefore it is inefficient to use all the data in every optimization iteration, (ii) the computational cost involved in computing the batch gradient is much higher than that of the stochastic gradient, and (iii) stochastic methods are more suitable for online learning where data are arriving sequentially. Since most machine learning problems are non-convex, there is a need for distributed stochastic gradient methods for non-convex problems. Therefore, here we present a distributed stochastic gradient algorithm for non-convex problems and demonstrate its utility for distributed machine learning.

A few early examples of (non-stochastic or deterministic) distributed non-convex optimization algorithms include the Distributed Approximate Dual Subgradient (DADS) Algorithm [8], NonconvEx primal-dual SpliTTing (NESTT) algorithm [9], and the Proximal Primal-Dual Algorithm (Prox-PDA) [10]. More recently, a non-convex version of the accelerated distributed augmented Lagrangians (ADAL) algorithm is presented in [11] and successive convex approximation (SCA)-based algorithms such as iNner cOnVex Approximation (NOVA) and in-Network succEssive conveX approximaTion algorithm (NEXT) are given in [12] and [13], respectively. References [14, 15, 16] provide several distributed alternating direction method of multipliers (ADMM) based non-convex optimization algorithms. Non-convex versions of Decentralized Gradient Descent (DGD) and Proximal Decentralized Gradient Descent (Prox-DGD) are given in [17]. Finally, Zeroth-Order NonconvEx (ZONE) optimization algorithms for mesh network (ZONE-M) and star network (ZONE-S) are presented in [18].

There exist several works on distributed stochastic gradient methods, but mainly for strongly convex optimization problems. These include the stochastic subgradient-push method for distributed optimization over time-varying directed graphs given in [19], distributed stochastic optimization over random networks given in [20], the Stochastic Unbiased Curvature-aided Gradient (SUCAG) method given in [21], and distributed stochastic gradient tracking methods [22]. There are very few works on distributed stochastic gradient methods for non-convex optimization [23, 24]; however, they make very restrictive assumptions on the critical points of the problem.

Contributions of this paper are three-fold:

We propose a fully distributed machine learning architecture that does not require any server nodes. 2. 2.

We develop a distributed SGD algorithm and provide sufficient conditions on step-sizes such that the algorithm is mean-square convergent. 3. 3.

We demonstrate the utility of the proposed SGD algorithm for distributed machine learning.

I-A Notation

Let $\mathbb{R}^{n\times m}$ denote the set of $n\times m$ real matrices. For a vector $\bm{\phi}$ , $\phi_{i}$ is the $i^{\text{th}}$ entry of $\bm{\phi}$ . An $n\times n$ identity matrix is denoted as $I_{n}$ and $\mathbf{1}_{n}$ denotes an $n$ -dimensional vector of all ones. For $p\in[1,\,\infty]$ , the $p$ -norm of a vector $\mathbf{x}$ is denoted as $\left\|\mathbf{x}\right\|_{p}$ . For matrices $A\in\mathbb{R}^{m\times n}$ and $B\in\mathbb{R}^{p\times q}$ , $A\otimes B\in\mathbb{R}^{mp\times nq}$ denotes their Kronecker product.

For a graph $\mathcal{G}\left(\mathcal{V},\mathcal{E}\right)$ of order $n$ , $\mathcal{V}\triangleq\left\{v_{1},\ldots,v_{n}\right\}$ represents the agents or nodes and the communication links between the agents are represented as $\mathcal{E}\triangleq\left\{e_{1},\ldots,e_{\ell}\right\}\subseteq\mathcal{V}\times\mathcal{V}$ . Let $\mathcal{A}=\left[a_{ij}\right]\in\mathbb{R}^{n\times n}$ be the adjacency matrix with entries of $a_{ij}=1$ if $(v_{i},v_{j})\in\mathcal{E}$ and zero otherwise. Define $\Delta=\text{diag}\left(\mathcal{A}\mathbf{1}_{n}\right)$ as the in-degree matrix and $\mathcal{L}=\Delta-\mathcal{A}$ as the graph Laplacian.

II Distributed Machine Learning

Our problem formulation closely follows the centralized machine learning problem discussed in [7]. Consider a networked set of $n$ agents, each with a set of $m_{i}$ , $i=1,\ldots,n$ , independently drawn input-output samples $\{\bm{x}_{i}^{j},\,\bm{y}_{i}^{j}\}_{j=1}^{j=m_{i}}$ , where $\bm{x}_{i}^{j}\in\mathbb{R}^{d_{x}}$ and $\bm{y}_{i}^{j}\in\mathbb{R}^{d_{y}}$ are the $j$ -th input and output data, respectively, associated with the $i$ -th agent. For example, the input data could be images and the outputs could be labels. Let $h\left(\cdot\,;\,\cdot\right):\mathbb{R}^{d_{x}}\times\mathbb{R}^{d_{w}}\mapsto\mathbb{R}^{d_{y}}$ , denote the prediction function, fully parameterized by the vector $\bm{w}\in\mathbb{R}^{d_{w}}$ . Each agent aims to find the parameter vector that minimizes the losses, $\ell\left(\cdot\,;\,\cdot\right):\mathbb{R}^{d_{y}}\times\mathbb{R}^{d_{y}}\mapsto\mathbb{R}$ , incurred from inaccurate predictions. Thus, the loss function $\ell\left(h\left(\bm{x}_{i};\bm{w}\right),\bm{y}_{i}\right)$ yields the loss incurred by the $i$ -th agent, where $h\left(\bm{x}_{i};\bm{w}\right)$ and $\bm{y}_{i}$ are the predicted and true outputs, respectively for the $i$ -th node.

Assuming the input output space $\mathbb{R}^{d_{x}}\times\mathbb{R}^{d_{y}}$ associated with the $i$ -th agent is endowed with a probability measure $P_{i}~{}:~{}\mathbb{R}^{d_{x}}\times\mathbb{R}^{d_{y}}\mapsto[0,\,1]$ , the objective function an agent wishes to minimize is

[TABLE]

Here $R_{i}(\bm{w})$ denotes the expected risk given a parameter vector $\bm{w}$ with respect to the probability distribution $P_{i}$ . The total expected risk across all networked agents is given as

[TABLE]

Minimizing the expected risk is desirable but often unattainable since the distributions $P_{i}$ are unknown. Thus, in practice each agent chooses to minimize the empirical risk $\bar{R}_{i}(\bm{w})$ defined as

[TABLE]

Here, the assumption is that $m_{i}$ is large enough so that $\bar{R}_{i}(\bm{w})\approx R_{i}(\bm{w})$ . The total empirical risk across all networked agents is

[TABLE]

In order to simplify the notation, let us represent a sample input-output pair $(\bm{x}_{i},\,\bm{y}_{i})$ by a random seed $\bm{\xi}_{i}$ and let $\bm{\xi}_{i}^{j}$ denotes the $j$ -th sample associated with the $i$ -th agent. Define the loss incurred for a given $\left(\bm{w},\bm{\xi}_{i}^{j}\right)$ as $\ell\left(\bm{w},\bm{\xi}_{i}^{j}\right)$ . Now, the distributed learning problem can be posed as an optimization involving sum of local empirical risks, i.e.,

[TABLE]

where $f_{i}\left(\bm{w}\right)=\frac{1}{m_{i}}\sum_{j=1}^{m_{i}}\ell\left(\bm{w},\bm{\xi}_{i}^{j}\right)$ .

III Distributed SGD

Here we propose a distributed stochastic gradient method to solve (5). Let $\bm{w}_{i}(k)\in\mathbb{R}^{d_{w}}$ denote agent $i$ ’s estimate of the optimizer at time instant $k$ . Thus, for an arbitrary initial condition $\bm{w}_{i}(0)$ , the update rule at node $i$ is as follows:

[TABLE]

where $\alpha_{k}$ and $\beta_{k}$ are hyper parameters to be specified, $a_{ij}$ are the entries of the adjacency matrix and $\mathbf{g}_{i}\left(\bm{w}_{i}(k),\bm{\xi}_{i}(k)\right)$ represents either a simple stochastic gradient, mini-batch stochastic gradient or a stochastic quasi-Newton direction, i.e.,

[TABLE]

where $n_{i}(k)$ denotes the mini-batch size, $H_{i}(k)$ is a positive definite scaling matrix, $\bm{\xi}_{i}^{k}$ represents the single random input-output pair sampled at time instant $k$ , and $(\bm{\xi}_{i}^{k,s})$ denotes the $s$ -th input-output pair out of the $n_{i}(k)$ random input-output pairs sampled at time instant $k$ .

Define $\mathbf{w}(k)\triangleq\begin{bmatrix}\bm{w}_{1}^{\top}(k)&\ldots&\bm{w}_{n}^{\top}(k)\end{bmatrix}^{\top}\in\mathbb{R}^{nd_{w}}$ . Now (6) can be written as

[TABLE]

where $\mathcal{W}_{k}=\left(I_{n}-\beta_{k}\mathcal{L}\right)$ , $\mathcal{L}$ is the network Laplacian and

[TABLE]

III-A Assumptions

First, we state the following assumption on the individual objective functions:

Assumption 1.

Objective functions $f_{i}(\,\cdot\,)$ and its gradients $\nabla f_{i}(\,\cdot\,)$ $:\mathbb{R}^{d_{w}}\mapsto\mathbb{R}^{d_{w}}$ are Lipschitz continuous with Lipschitz constants $L^{0}_{i}>0$ and $L_{i}>0$ , respectively, i.e., $\forall\,\bm{w}_{a},\,\bm{w}_{b}\in\mathbb{R}^{d_{w}},\,i=1,\ldots,n$ , we have

[TABLE]

Now we introduce $F(\cdot):\mathbb{R}^{nd_{w}}\mapsto\mathbb{R}$ , an aggregate objective function of local variables

[TABLE]

Following Assumption 1, the function $F(\cdot)$ is Lipschitz continuous with Lipschitz continuous gradient $\nabla F(\cdot)$ , i.e., $\forall\,\mathbf{w}_{a},\,\mathbf{w}_{b}\in\mathbb{R}^{nd_{w}}$ , we have

[TABLE]

with constant $L=\max\limits_{i}\{L_{i}\}$ and $\nabla F(\,\mathbf{w}\,)\triangleq\begin{bmatrix}\nabla f_{1}(\,\bm{w}_{1}\,)^{\top}&\ldots&\nabla f_{n}(\,\bm{w}_{n}\,)^{\top}\end{bmatrix}^{\top}\in\mathbb{R}^{nd_{w}}$ .

Lemma 1.

Given Assumption 1, we have

[TABLE]

where $\mu_{F}<\infty$ is a positive constant.

**Proof : **See Lemma 3.3 in [25].

Lemma 2.

Given Assumption 1, we have $\forall\,\mathbf{w}_{a},\,\mathbf{w}_{b}\in\mathbb{R}^{nd_{w}}$ ,

[TABLE]

**Proof : **Proof follows from the mean value theorem.

Assumption 2.

The function $F(\cdot)$ is lower bounded by $F_{\inf}$ , i.e.,

[TABLE]

Without loss of generality, we assume that $F_{\inf}\geq 0$ . Now we make the following assumption regarding $\{\alpha_{k}\}$ and $\{\beta_{k}\}$ :

Assumption 3.

Sequences $\{\alpha_{k}\}$ and $\{\beta_{k}\}$ are selected as

[TABLE]

where $a>0$ , $b>0$ , $0<3\delta_{1}<\delta_{2}\leq 1$ , $\delta_{1}+\delta_{2}>1$ , and $\delta_{2}>1/2$ .

For sequences $\{\alpha_{k}\}$ and $\{\beta_{k}\}$ that satisfy Assumption 3, we have $\sum\limits_{k=1}^{\infty}\,\alpha_{k}=\infty$ , $\sum\limits_{k=1}^{\infty}\,\beta_{k}=\infty$ , $\sum\limits_{k=1}^{\infty}\,\alpha_{k}^{2}<\infty$ and $\sum\limits_{k=1}^{\infty}\,\alpha_{k}\beta_{k}<\infty$ . Thus $\alpha_{k}$ and $\beta_{k}$ are not summable sequences. However, $\alpha_{k}$ is square-summable and $\alpha_{k}\beta_{k}$ is summable.

Assumption 4.

The interaction topology of $n$ networked agents is given as a connected undirected graph $\mathcal{G}\left(\mathcal{V},\mathcal{E}\right)$ .

Lemma 3.

Given Assumption 4, for all $\mathbf{x}\in\mathbb{R}^{n}$ we have

[TABLE]

where $\tilde{\mathbf{x}}=\left(I_{n}-\frac{1}{n}\mathbf{1}_{n}\mathbf{1}^{\top}_{n}\right)\mathbf{x}$ is the average-consensus error and $\lambda_{2}(\cdot)$ denotes the smallest non-zero eigenvalue.

**Proof : **This Lemma follows from the Courant-Fischer Theorem [26].

Assumption 5.

Parameter $b$ in sequence $\{\beta_{k}\}$ is selected such that

[TABLE]

has a single eigenvalue at $1$ corresponding to the right eigenvector $\mathbf{1}_{n}$ and the remaining $n-1$ eigenvalues of $\mathcal{W}_{0}$ are strictly inside the unit circle.

In other words, $b$ is selected such that $b<1/\sigma_{\max}(\mathcal{L})$ , where $\sigma_{\max}(\cdot)$ denotes the largest singular value. Thus, $b\sigma_{\max}(\mathcal{L})<1$ .

Let $\mathbb{E}_{\xi}[\cdot]$ denote the expected value taken with respect to the distribution of the random variable $\bm{\xi}_{k}$ given the filtration $\mathcal{F}_{k}$ generated by the sequence $\{\mathbf{w}_{0},\ldots,\mathbf{w}_{k}\}$ , i.e.,

[TABLE]

where a.s. (almost surely) denote events that occur with probability one. Now we make the following assumptions regarding the stochastic gradient term $\mathbf{g}(\mathbf{w}(k),\bm{\xi}(k))$ .

Assumption 6.

Stochastic gradients are unbiased such that

[TABLE]

That is to say

[TABLE]

Assumption 7.

Stochastic gradients have conditionally bounded second moment, i.e., there exist scalars $\bar{\mu}_{v_{1}}\geq 0$ and $\bar{\mu}_{v_{2}}\geq 0$ such that

[TABLE]

Assumption 7 is the bounded variance assumption typically make in SGD literature. Finally, it follows from Assumptions 1, 7 and Lemma 1 that the stochastic gradients are bounded, which is usually just assumed in literature [27, 7, 23, 17].

Proposition 1.

There exists a positive constant $\mu_{g}<\infty$ such that

[TABLE]

**Proof : **Proof follows from taking the expectation of (23) and applying the result from Lemma 1.

IV Convergence Analysis

Our strategy for proving the convergence of the proposed distributed SGD algorithm to a critical point is as follows. First we show that the consensus error among the agents are diminishing at the rate of $O\left(\frac{1}{(k+1)^{\delta_{2}}}\right)$ (see Theorem 1). Asymptotic convergence of the algorithm is then proved in Theorem 3. Theorem 4 then establishes that the weighted expected average gradient norm is a summable sequence. Finally, Theorem 5 proves the asymptotic mean-square convergence of the algorithm to a critical point.

Theorem 1.

Consider distributed SGD algorithm (11) under Assumptions [1-7]. Then, there holds:

[TABLE]

**Proof : **See Appendix B.

Let

[TABLE]

Now define a non-negative function $V(\gamma_{k},\mathbf{w}_{k})$ as

[TABLE]

Now taking the gradient with respect to $\mathbf{w}$ yields

[TABLE]

Theorem 2.

Consider distributed SGD algorithm (11) under Assumptions [1-7]. Then, for the gradient $\nabla V(\gamma_{k},\mathbf{w}_{k})$ given in (28), there holds:

[TABLE]

**Proof : **See Appendix C.

Theorem 3.

For the distributed SGD algorithm (11) under Assumptions [1-7] we have

[TABLE]

and

[TABLE]

**Proof : **See Appendix D.

Define $\bar{\mathbf{w}}_{k}=\displaystyle\frac{1}{n}\left(\mathbf{1}_{n}\mathbf{1}_{n}^{\top}\otimes I_{d_{w}}\right)\mathbf{w}_{k}$ and $\overline{\nabla F}(\mathbf{w}_{k})=\displaystyle\frac{1}{n}\left(\mathbf{1}_{n}\mathbf{1}_{n}^{\top}\otimes I_{d_{w}}\right)\nabla F(\mathbf{w}_{k})$ .

Theorem 4.

For the distributed SGD algorithm (11) under Assumptions [1-7] we have

[TABLE]

**Proof : **See Appendix E.

Theorem 4 establishes results about the weighted sum of expected average gradient norm and the key takeaway from this result is that, for the distributed SGD in (11) with appropriate step-sizes, the expected average gradient norms cannot stay bounded away from zero (See Theorem 9 of [7]), i.e.,

[TABLE]

Finally, we present the following result to illustrate that stronger convergence results follows from the continuity assumption on the Hessian, which has not been utilized in our analysis so far.

Assumption 8.

The Hessians $\nabla^{2}f_{i}(\,\cdot\,)$ $:\mathbb{R}^{d_{w}}\mapsto\mathbb{R}^{d_{w}\times d_{w}}$ are Lipschitz continuous with Lipschitz constants $L_{H_{i}}$ , i.e., $\forall\,\bm{w}_{a},\,\bm{w}_{b}\in\mathbb{R}^{d_{w}},\,i=1,\ldots,n$ , we have

[TABLE]

It follows from Assumption 8 that the Hessian $\nabla^{2}F(\cdot)$ is Lipschitz continuous, i.e., $\forall\,\mathbf{w}_{a},\,\mathbf{w}_{b}\in\mathbb{R}^{nd_{w}}$ ,

[TABLE]

with constant $L_{H}=\max\limits_{i}\{L_{H_{i}}\}$ .

Theorem 5.

For the distributed SGD algorithm (11) under Assumptions [1-8] we have

[TABLE]

**Proof : **See Appendix F

Remark 1.

Similar to the centralized SGD [7], the analysis given here shows the mean-square convergence of the distributed algorithm to a critical point, which include the saddle points. Though SGD has shown to escape saddle points efficiently [28, 29, 30], extension of such results for distributed SGD is currently nonexistent and is the topic of future research.

V Application to Distributed Supervised Learning

We apply the proposed algorithm for distributedly training 10 different neural nets to recognize handwritten digits in images. Specifically, we consider a subset of the MNIST111http://yann.lecun.com/exdb/mnist/ data set containing 5000 images of 10 digits (0-9), of which 2500 are used for training and 2500 are used for testing. Training data are divided among ten agents connected in an undirected unweighted ring topology (see Fig. 1).

Each agent aims to train its own neural network consisting of a single hidden layer of 50 neurons (51 including the bias neuron). Since the images are $20\times 20$ , the input layer consists of 401 neurons (including the one bias neuron) and the output later consists of 10 neurons, one for each output class, i.e., one for each digits 0-9. As shown in Fig. 1, for each agent, the neural net consists of two sets of weights $W^{(1)}\in\mathbb{R}^{50\times 401}$ and $W^{(2)}\in\mathbb{R}^{10\times 51}$ . Here $W^{(1)}$ links the input layer to the hidden layer and $W^{(2)}$ connects the hidden layer to the output later. We use a logistic sigmoid function for both the hidden unit activation and the output unit activation. Therefore, the input to output mapping for the neural net under consideration takes the form

[TABLE]

where $\mathbf{x}\in\mathbb{R}^{401}$ is a single image (input) and $y_{\kappa}\in[0,\,\,1]$ for ${\kappa}=0,\ldots,9$ , can be interpreted as the conditional probability that the image contains the digit ${\kappa}$ given the input. Finally, the sigmoid function is given as $h(a)=\frac{1}{1+\exp{(-a)}}$ . Let $\mathbf{y}^{*}=\begin{bmatrix}y^{*}_{0},\ldots,y^{*}_{\kappa},\ldots,y^{*}_{9}\end{bmatrix}^{\top}$ denote the true class or label associated with input image $\mathbf{x}$ (in machine learning community, $\mathbf{y}^{*}$ is known as the target class or label). For example, if the image $\mathbf{x}$ contains the digit $9$ , then $\mathbf{y}^{*}=\begin{bmatrix}\mathbf{0}_{1\times 9}&1\end{bmatrix}^{T}$ . The conditional distribution of all target classes given inputs can be modeled as (see equation 5.22 of [31])

[TABLE]

Taking the negative logarithm of the corresponding likelihood function yields the following empirical risk function:

[TABLE]

where $y^{*}_{j{\kappa}}$ denotes the ${\kappa}$ -th entry of $\mathbf{y}^{*}_{j}$ and $\mathbf{y}^{*}_{j}$ denotes the target class associated with input image $\mathbf{x}_{j}$ . During training, each agent exchanges the weights $W^{(1)}$ and $W^{(2)}$ with its neighbors as described in the proposed algorithm. Here we conduct the following three experiments: (i) centralized SGD, where a centralized version of the SGD is implemented by a central node having all 2500 training data, (ii) a distributed SGD depicted in Fig. 1 with equally distributed data, where 10 agents distributedly train 10 different neural nets, and (iii) a distributed SGD with class-specific data distributed among the agents. For experiment (ii), each node received 250 training data, randomly sampled from the entire training set, i.e., $m_{i}=250$ for all $i=1,\ldots,10$ . For experiment (iii), data are distributed such that each agent only receives images corresponding to a particular class, i.e., agent $1$ received all the images of [math]s, agent $2$ received all the images of $1$ s, and so forth. Thus for experiment (iii), we have $m_{1}=257$ , $m_{2}=235$ , $m_{3}=257$ , $m_{4}=244$ , $m_{5}=242$ , $m_{6}=255$ , $m_{7}=244$ , $m_{8}=259$ , $m_{9}=245$ , and $m_{10}=262$ . For all three experiments, we select $\alpha_{k}=\displaystyle\frac{1}{(\varepsilon k+1)}$ , where $\varepsilon=10^{-5}$ . For experiments (ii) and (iii), we select $\beta_{k}=\displaystyle\frac{b}{(\varepsilon k+1)^{1/3}}$ , where $b=0.2525$ . Note that using a scale factor $\varepsilon$ does not affect the theoretical results provided in the previous sections.

Given in Fig. 2 are the results obtained from the three experiments. The risks obtained from experiments (i), (ii), and (iii) are given in Figs. 2(a), 2(b), and 2(c), respectively. For all three experiments, the error rate, i.e., % of images misclassified, obtained from running the trained neural net on the testing data of 2500 images are

[TABLE]

Finally, a few misclassification examples are given in Fig. 2(d), where a 7 is misclassified as a 5, 2 as a 4, and so forth. Results given here indicate that regardless of how the data are distributed, the agents are able to train their network and the distributedly trained networks are able to yield similar performance as that of a centrally trained network. More importantly, in experiment (iii), agents were able to recognize all 10 classes even though they only had access to data corresponding to a single class. This result has numerous implications for the machine learning community, specifically for federated multi-task learning under information flow constraints.

VI Conclusion

This paper presented the development of a distributed stochastic gradient descent algorithm for solving non-convex optimization problems. Here we assumed that the local objective functions are Lipschitz continuous and twice continuously differentiable with Lipschitz continuous gradients and Hessians. We provided sufficient conditions on algorithm step-sizes that guarantee asymptotic mean-square convergence of the proposed algorithm to a critical point. We applied the developed algorithm to a distributed supervised-learning problem, in which a set of 10 networked agents collaboratively train their individual neural nets to recognize handwritten digits in images. Results indicate that regardless of how the data are distributed, the agents are able to train their network and the distributedly trained networks are able to yield similar performance as that of a centrally trained network. Numerical results also show that the proposed distributed algorithm allowed individual agents to collaboratively recognize all 10 classes even though they only had access to data corresponding to a single class.

Appendix

VI-A Useful Lemmas

Lemma 4.

Let $\{z_{k}\}$ be a non-negative sequence satisfying

[TABLE]

where $\{r_{1}(k)\}$ and $\{r_{2}(k)\}$ are sequences with

[TABLE]

where $0<a_{1}$ , $0<a_{2}$ , $0\leq\epsilon_{1}<1$ , and $\epsilon_{1}<\epsilon_{2}$ . Then $(k+1)^{\epsilon_{0}}z_{k}\rightarrow 0$ as $k\rightarrow\infty$ for all $0\leq\epsilon_{0}<\epsilon_{2}-\epsilon_{1}$ .

**Proof : **This Lemma follows directly from Lemma 4.1 of [32].

Lemma 5.

Let $\{v_{k}\}$ be a non-negative sequence for which the following relation hold for all $k\geq 0$ :

[TABLE]

where $a_{k}\geq 0$ , $u_{k}\geq 0$ and $w_{k}\geq 0$ with $\sum\limits_{k=0}^{\infty}a_{k}<\infty$ and $\sum\limits_{k=0}^{\infty}w_{k}<\infty$ . Then the sequence $\{v_{k}\}$ will converge to $v\geq 0$ and we further have $\sum\limits_{k=0}^{\infty}u_{k}<\infty$ .

**Proof : **See [33].

Lemma 6.

Let $\gamma_{k}\triangleq\displaystyle\frac{a/b}{(k+1)^{\epsilon}}$ with $0<\epsilon\leq 1$ . Then it holds

[TABLE]

**Proof : **This Lemma is a direct consequence of Lemma 10 of [17].

VI-B Proof of Theorem 1

Define the average-consensus error as $\tilde{\mathbf{w}}_{k}=\left(M\otimes I_{d_{w}}\right)\mathbf{w}_{k}$ , where $M=I_{n}-\frac{1}{n}\mathbf{1}_{n}\mathbf{1}_{n}^{\top}$ . Thus from (11) we have

[TABLE]

and $\|\tilde{\mathbf{w}}_{k+1}\|_{2}\leq\|\left(\left(I_{n}-\beta_{k}\mathcal{L}\right)\otimes I_{d_{w}}\right)\tilde{\mathbf{w}}_{k}\|_{2}+\alpha_{k}\|\left(M\otimes I_{d_{w}}\right)\|_{2}\|\mathbf{g}(\mathbf{w}_{k},\bm{\xi}_{k})\|_{2}.$ Since $\mathbf{1}_{nd_{w}}^{\top}\tilde{\mathbf{w}}_{k}=0$ , it follows from Lemma 4.4 of [32] that

[TABLE]

where $\lambda_{2}(\cdot)$ denotes the second smallest eigenvalue. Thus we have

[TABLE]

Now we use the following inequality

[TABLE]

for all $x,y,\in\mathbb{R}$ and $\theta>0$ . Selecting $\theta=\beta_{k}\lambda_{2}(\mathcal{L})$ yields

[TABLE]

Now taking the expectation yields

[TABLE]

Using Proposition 1, (42) can be written as

[TABLE]

Note $\left(\displaystyle\frac{\left(1+\beta_{k}\lambda_{2}(\mathcal{L})\right)\mu_{g}}{\lambda_{2}(\mathcal{L})}\right)\leq\left(\displaystyle\frac{\left(1+b\lambda_{2}(\mathcal{L})\right)\mu_{g}}{\lambda_{2}(\mathcal{L})}\right)\triangleq\mu_{a},$ for some $\mu_{a}>0$ . Let $r_{1}(k)=\beta^{2}_{k}\lambda_{2}(\mathcal{L})^{2}=\displaystyle\frac{b^{2}\lambda_{2}(\mathcal{L})^{2}}{(k+1)^{2\delta_{1}}}$ and $r_{2}(k)=\frac{\alpha_{k}^{2}}{\beta_{k}}\left(\frac{\left(1+\beta_{k}\lambda_{2}(\mathcal{L})\right)\mu_{g}}{\lambda_{2}(\mathcal{L})}\right)\leq\frac{a^{2}\mu_{a}/b}{(k+1)^{2\delta_{2}-\delta_{1}}}$ . Now (43) can be written in the form of (37) with $\epsilon_{1}=2\delta_{1}$ and $\epsilon_{2}=2\delta_{2}-\delta_{1}$ . Thus it follows from Lemma 4 that

[TABLE]

Thus there exists a constant $0<\mu_{w}<\infty$ such that for all $k\geq 0$

[TABLE]

Now (25) follows from Assumption 3 that $\delta_{2}>3\delta_{1}$ .

VI-C Proof of Theorem 2

From (28) we have

[TABLE]

Now based on Assumption 1, for a fixed $\gamma_{k}$ , $\nabla V(\gamma_{k},\,\mathbf{w}\,)$ is Lipschitz continuous in $\mathbf{w}$ . Thus we have

[TABLE]

It follows from Lemma 2 that

[TABLE]

Note that the distributed SGD algorithm in (11) can be rewritten as

[TABLE]

Substituting (47) into (46) and taking the conditional expectation $\mathbb{E}_{\xi}\left[\,\cdot\,\right]$ yields

[TABLE]

Based on Assumption 6, there exists $\mu>0$ such that

[TABLE]

Thus we have

[TABLE]

Let

[TABLE]

Now (49) can be written as

[TABLE]

Based on Assumptions 6 and 7, there exists scalars ${\mu}_{v_{1}}\geq 0$ and ${\mu}_{v_{2}}\geq 0$ such that

[TABLE]

Thus from (51) we have

[TABLE]

Substituting $\nabla V(\gamma_{k},\mathbf{w}_{k})=\nabla F(\mathbf{w}_{k})+\displaystyle\frac{1}{\gamma_{k}}\left(\mathcal{L}\otimes I_{d_{w}}\right)\mathbf{w}_{k}$ and taking the total expectation of (53) yields

[TABLE]

Note that

[TABLE]

Combining (54) and (55) yields

[TABLE]

If we select $\epsilon=\delta_{2}-\delta_{1}$ , it follows directly from Lemma 6 that

[TABLE]

Note that from Lemma 3 we have $\mathbf{w}^{\top}_{k+1}\left(\mathcal{L}\otimes I_{d_{w}}\right)\mathbf{w}_{k+1}=\tilde{\mathbf{w}}^{\top}_{k+1}\left(\mathcal{L}\otimes I_{d_{w}}\right)\tilde{\mathbf{w}}_{k+1}\leq\sigma_{\max}\left(\mathcal{L}\right)\|\tilde{\mathbf{w}}_{k+1}\|_{2}^{2}$ . Thus

[TABLE]

We have established in (44) that for all $k\geq 0$

[TABLE]

Therefore we have

[TABLE]

Let $\mu_{c}=\frac{2b\left(\delta_{2}-\delta_{1}\right)}{a}\sigma_{\max}\left(\mathcal{L}\right)\mu_{w}$ . Now selecting $\delta_{0}=2\delta_{2}-3\delta_{1}-\varepsilon$ , where $0<\varepsilon\ll\delta_{1}$ , yields

[TABLE]

Thus if we select $\delta_{1}$ and $\delta_{2}$ such that $\delta_{2}>2\delta_{1}+\varepsilon$ , then we have

[TABLE]

where $\varepsilon_{1}>0$ and $\delta_{2}-2\delta_{1}-\varepsilon=\varepsilon_{1}$ . Now we can write (56) as

[TABLE]

Since $c_{k}$ is decreasing to zero, for sufficiently large $k$ , we have $c_{k}\mu_{v_{2}}<\mu$ . Therefore $\left(\mu-\frac{1}{2}c_{k}\mu_{v_{2}}\right)>\frac{1}{2}\mu$ for sufficiently large $k$ . Thus we have

[TABLE]

Now (61) can be written in the form of (39) after selecting $a_{k}=0$ ,

[TABLE]

Note that here we have $a_{k}=0$ , $u_{k}\geq 0$ and $w_{k}\geq 0$ with $\sum\limits_{k=0}^{\infty}a_{k}<\infty$ and $\sum\limits_{k=0}^{\infty}w_{k}<\infty$ . Note $c_{k}\alpha_{k}$ is summable because $\alpha_{k}\beta_{k}$ is summable and $\alpha_{k}$ is square-summable. Therefore from Lemma 5 we have $\mathbb{E}\left[V(\gamma_{k},\mathbf{w}_{k})\right]$ is a convergent sequence and $\sum\limits_{k=0}^{\infty}\,\frac{1}{2}\mu\alpha_{k}\mathbb{E}\left[\left\|\nabla V(\gamma_{k},\mathbf{w}_{k})\right\|^{2}_{2}\right]<\infty$ .

VI-D Proof of Theorem 3

Note that

[TABLE]

Now form (52), using the tower rule yields

[TABLE]

Now taking the expectation of (64) and substituting (65) yields

[TABLE]

Thus we have

[TABLE]

Now (30) follows from (29) and from noting that $\alpha_{k}$ is square summable. Furthermore, since every summable sequence is convergent, we have (31).

VI-E Proof of Theorem 4

Taking the conditional expectation $\mathbb{E}_{\xi}[\cdot]$ of (47) yields

[TABLE]

Thus we have

[TABLE]

Therefore

[TABLE]

Substituting (70) into (29) yields

[TABLE]

Now note that $\bar{\mathbf{w}}_{k+1}-\bar{\mathbf{w}}_{k}=\frac{1}{n}\left(\mathbf{1}_{n}\mathbf{1}_{n}^{\top}\otimes I_{d_{w}}\right)\left(\mathbf{w}_{k+1}-\mathbf{w}_{k}\right)$ . Thus $\mathbb{E}_{\xi}\left[\ \bar{\mathbf{w}}_{k+1}-\bar{\mathbf{w}}_{k}\right]=\frac{1}{n}\left(\mathbf{1}_{n}\mathbf{1}_{n}^{\top}\otimes I_{d_{w}}\right)\mathbb{E}_{\xi}\left[\ \mathbf{w}_{k+1}-\mathbf{w}_{k}\right]$ a.s. and $\left\|\mathbb{E}_{\xi}\left[\ \bar{\mathbf{w}}_{k+1}-\bar{\mathbf{w}}_{k}\right]\right\|_{2}\leq\left\|\mathbb{E}_{\xi}\left[\ \mathbf{w}_{k+1}-\mathbf{w}_{k}\right]\right\|_{2}$ a.s. Therefore it follows from (71) that

[TABLE]

From (47) we have

[TABLE]

Now substituting (73) into (72) yields (32).

VI-F Proof of Theorem 5

Define $G(\mathbf{w}_{k})\triangleq\left\|\overline{\nabla F}(\mathbf{w}_{k})\right\|^{2}_{2}$ . Thus we have

[TABLE]

where $\mathcal{J}\,=\left(\frac{1}{n}\left(\mathbf{1}_{n}\mathbf{1}_{n}^{\top}\otimes I_{d_{w}}\right)\right)$ and $\mathcal{J}^{2}=\mathcal{J}\,$ . Since $F(\cdot)$ is twice continuously differentiable and $\nabla F(\cdot)$ is Liptschitz continuous with constant $L$ , we have $\nabla^{2}F(\mathbf{w})\leq LI_{nd_{w}}$ . Therefore $\forall\,\mathbf{w}_{a},\,\mathbf{w}_{b}\in\mathbb{R}^{nd_{w}}$ ,

[TABLE]

Since $\nabla^{2}F(\mathbf{w}_{a})$ is Lipschitz continuous with constant $L_{H}$ , and $\nabla F(\mathbf{w}_{b})\leq\mu_{F}$ , we have

[TABLE]

where $L_{G}\geq 2L^{2}+2\mu_{F}L_{H}$ . Thus $\nabla G(\mathbf{w})$ is Lipschitz continuous and from Lemma 2 we have

[TABLE]

Now substituting (74) and taking the conditional expectation $\mathbb{E}_{\xi}[\,\cdot\,]$ yields

[TABLE]

Since $\nabla F(\mathbf{w}_{k})^{\top}\mathcal{J}\,=\nabla V(\gamma_{k},\mathbf{w}_{k})^{\top}\mathcal{J}$ , substituting (68) yields

[TABLE]

Now taking the total expectation yields

[TABLE]

From (29) and (30), we know that $\alpha_{k}\mathbb{E}\left[\,\left\|\nabla V(\gamma_{k},\mathbf{w}_{k})\right\|^{2}_{2}\,\right]$ and $\mathbb{E}\left[\,\left\|\mathbf{w}_{k+1}-\mathbf{w}_{k}\right\|_{2}^{2}\,\right]$ are summable. Therefore (75) can be written in the form of (39) and it follows from Lemma 5 that $\mathbb{E}\left[\,G(\mathbf{w}_{k})\,\right]$ converges. Since $\mathbb{E}\left[\,G(\mathbf{w}_{k})\,\right]=\mathbb{E}\left[\,\left\|\overline{\nabla F}(\mathbf{w}_{k})\right\|^{2}_{2}\,\right]$ it follows from Theorem 4 that $\mathbb{E}\left[\,G(\mathbf{w}_{k})\,\right]$ must converge to zero.

Bibliography33

The reference list from the paper itself. Each links out to its DOI / PubMed record.

1[1] M. Li, D. G. Andersen, J. W. Park, A. J. Smola, A. Ahmed, V. Josifovski, J. Long, E. J. Shekita, and B.-Y. Su, “Scaling distributed machine learning with the parameter server,” in 11th USENIX Symposium on Operating Systems Design and Implementation (OSDI) , 2014, pp. 583 – 598.
2[2] K. Zhang, S. Alqahtani, and M. Demirbas, “A comparison of distributed machine learning platforms,” in 26th International Conference on Computer Communication and Networks (ICCCN) , Jul. 2017, pp. 1–9.
3[3] J. Zhang, H. Tu, Y. Ren, J. Wan, L. Zhou, M. Li, and J. Wang, “An adaptive synchronous parallel strategy for distributed machine learning,” IEEE Access , vol. 6, pp. 19 222–19 230, 2018.
4[4] M. Li, D. G. Andersen, A. J. Smola, and K. Yu, “Communication efficient distributed machine learning with the parameter server,” in Advances in Neural Information Processing Systems , 2014, pp. 19 – 27.
5[5] J. Konec̆nú, H. B. Mc Mahan, F. X. Yu, P. Richtarik, A. T. Suresh, and D. Bacon, “Federated learning: Strategies for improving communication efficiency,” in NIPS Workshop on Private Multi-Party Machine Learning , 2016.
6[6] H. B. Mc Mahan, E. Moore, D. Ramage, S. Hampson, and B. A. y Arcas, “Communication-efficient learning of deep networks from decentralized data,” in Proceedings of the 20th International Conference on Artificial Intelligence and Statistics (AISTATS) , 2017.
7[7] L. Bottou, F. Curtis, and J. Nocedal, “Optimization methods for large-scale machine learning,” SIAM Review , vol. 60, no. 2, pp. 223–311, 2018.
8[8] M. Zhu and S. Martínez, “An approximate dual subgradient algorithm for multi-agent non-convex optimization,” IEEE Transactions on Automatic Control , vol. 58, no. 6, pp. 1534 – 1539, Jun. 2013.

TL;DR

Contribution

Findings

Abstract

Peer Reviews

Code & Models

Videos

Distributed Stochastic Gradient Method for Non-Convex Problems with Applications in Supervised Learning

Abstract

I Introduction

I-A Notation

II Distributed Machine Learning

III Distributed SGD

III-A Assumptions

Assumption 1**.**

Lemma 1**.**

Lemma 2**.**

Assumption 2**.**

Assumption 3**.**

Assumption 4**.**

Lemma 3**.**

Assumption 5**.**

Assumption 6**.**

Assumption 7**.**

Proposition 1**.**

IV Convergence Analysis

Theorem 1**.**

Theorem 2**.**

Theorem 3**.**

Theorem 4**.**

Assumption 8**.**

Theorem 5**.**

Remark 1**.**

V Application to Distributed Supervised Learning

VI Conclusion

Appendix

VI-A Useful Lemmas

Lemma 4**.**

Lemma 5**.**

Lemma 6**.**

VI-B Proof of Theorem 1

VI-C Proof of Theorem 2

VI-D Proof of Theorem 3

VI-E Proof of Theorem 4

VI-F Proof of Theorem 5

Assumption 1.

Lemma 1.

Lemma 2.

Assumption 2.

Assumption 3.

Assumption 4.

Lemma 3.

Assumption 5.

Assumption 6.

Assumption 7.

Proposition 1.

Theorem 1.

Theorem 2.

Theorem 3.

Theorem 4.

Assumption 8.

Theorem 5.

Remark 1.

Lemma 4.

Lemma 5.

Lemma 6.