TL;DR
This paper introduces a distributed stochastic gradient descent algorithm tailored for non-convex optimization problems, demonstrating its effectiveness in collaborative neural network training for digit recognition across networked agents.
Contribution
It presents a novel distributed stochastic gradient method with convergence guarantees for non-convex problems and applies it successfully to distributed supervised learning tasks.
Findings
Agents achieve similar performance to centralized training
Distributed training enables recognition without local data for all classes
Algorithm converges under specific step-size conditions
Abstract
We develop a distributed stochastic gradient descent algorithm for solving non-convex optimization problems under the assumption that the local objective functions are twice continuously differentiable with Lipschitz continuous gradients and Hessians. We provide sufficient conditions on step-sizes that guarantee the asymptotic mean-square convergence of the proposed algorithm. We apply the developed algorithm to a distributed supervised-learning problem, in which a set of networked agents collaboratively train their individual neural nets to recognize handwritten digits in images. Results indicate that all agents report similar performance that is also comparable to the performance of a centrally trained neural net. Numerical results also show that the proposed distributed algorithm allows the individual agents to recognize the digits even though the training data corresponding to all…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Distributed Stochastic Gradient Method for Non-Convex Problems with Applications in Supervised Learning
J. George, T. Yang, H. Bai and P. Gurram J. George is with U.S. Army Research Laboratory, Adelphi, MD 20783, USA. [email protected]. Yang is with University of North Texas, Denton, TX 76203 USA. [email protected] Bai is with Oklahoma State University, Stillwater, OK 74078, USA. [email protected]. Gurram is with Booz Allen Hamilton & U.S. Army Research Laboratory, Adelphi, MD 20783, USA. [email protected]
Abstract
We develop a distributed stochastic gradient descent algorithm for solving non-convex optimization problems under the assumption that the local objective functions are twice continuously differentiable with Lipschitz continuous gradients and Hessians. We provide sufficient conditions on step-sizes that guarantee the asymptotic mean-square convergence of the proposed algorithm. We apply the developed algorithm to a distributed supervised-learning problem, in which a set of networked agents collaboratively train their individual neural nets to recognize handwritten digits in images. Results indicate that all agents report similar performance that is also comparable to the performance of a centrally trained neural net. Numerical results also show that the proposed distributed algorithm allows the individual agents to recognize the digits even though the training data corresponding to all the digits is not locally available to each agent.
I Introduction
With the advent of smart devices, there has been an exponential growth in the amount of data collected and stored locally on the individual devices. Applying machine learning to extract value from such massive data to provide data-driven insights, decisions, and predictions has been a hot research topic as well as the focus of numerous businesses like Google, Facebook, Alibaba, Yahoo, etc. However, porting these vast amounts of data to a data center to conduct traditional machine learning has raised two main issues: (i) the communication challenge associated with transferring vast amounts of data from a large number of devices to a central location and (ii) the privacy issues associated with sharing raw data. Distributed machine learning techniques based on the server-client architecture [1, 2] have been proposed as solutions to this problem. On one extreme end of this architecture, we have the parameter server approach, where a server or group of servers initiate distributed learning by pushing the current model to a set of client nodes that host the data. Client nodes compute the local gradients or parameter updates and communicate it to the server nodes. Server nodes aggregate these values and update the current model [3, 4]. On the other extreme, we have federated learning, where each client node obtains a local solution to the learning problem and the server node computes a global model by simply averaging the local models [5, 6]. These distributed learning techniques are not truly distributed since they follow a master-slave architecture and do not involve any peer-to-peer communication. Though these techniques are not always robust and they are rendered useless if the server fails, they do provide a good business opportunity for companies that own servers and host web services. However, our aim is to develop a fully distributed machine learning architecture enabled by client-to-client interaction.
For large-scale machine learning, stochastic gradient descent (SGD) methods are often preferred over batch gradient methods [7] because (i) in many large-scale problems, there is a good deal of redundancy in data and therefore it is inefficient to use all the data in every optimization iteration, (ii) the computational cost involved in computing the batch gradient is much higher than that of the stochastic gradient, and (iii) stochastic methods are more suitable for online learning where data are arriving sequentially. Since most machine learning problems are non-convex, there is a need for distributed stochastic gradient methods for non-convex problems. Therefore, here we present a distributed stochastic gradient algorithm for non-convex problems and demonstrate its utility for distributed machine learning.
A few early examples of (non-stochastic or deterministic) distributed non-convex optimization algorithms include the Distributed Approximate Dual Subgradient (DADS) Algorithm [8], NonconvEx primal-dual SpliTTing (NESTT) algorithm [9], and the Proximal Primal-Dual Algorithm (Prox-PDA) [10]. More recently, a non-convex version of the accelerated distributed augmented Lagrangians (ADAL) algorithm is presented in [11] and successive convex approximation (SCA)-based algorithms such as iNner cOnVex Approximation (NOVA) and in-Network succEssive conveX approximaTion algorithm (NEXT) are given in [12] and [13], respectively. References [14, 15, 16] provide several distributed alternating direction method of multipliers (ADMM) based non-convex optimization algorithms. Non-convex versions of Decentralized Gradient Descent (DGD) and Proximal Decentralized Gradient Descent (Prox-DGD) are given in [17]. Finally, Zeroth-Order NonconvEx (ZONE) optimization algorithms for mesh network (ZONE-M) and star network (ZONE-S) are presented in [18].
There exist several works on distributed stochastic gradient methods, but mainly for strongly convex optimization problems. These include the stochastic subgradient-push method for distributed optimization over time-varying directed graphs given in [19], distributed stochastic optimization over random networks given in [20], the Stochastic Unbiased Curvature-aided Gradient (SUCAG) method given in [21], and distributed stochastic gradient tracking methods [22]. There are very few works on distributed stochastic gradient methods for non-convex optimization [23, 24]; however, they make very restrictive assumptions on the critical points of the problem.
Contributions of this paper are three-fold:
We propose a fully distributed machine learning architecture that does not require any server nodes. 2. 2.
We develop a distributed SGD algorithm and provide sufficient conditions on step-sizes such that the algorithm is mean-square convergent. 3. 3.
We demonstrate the utility of the proposed SGD algorithm for distributed machine learning.
I-A Notation
Let denote the set of real matrices. For a vector , is the entry of . An identity matrix is denoted as and denotes an -dimensional vector of all ones. For , the -norm of a vector is denoted as . For matrices and , denotes their Kronecker product.
For a graph of order , represents the agents or nodes and the communication links between the agents are represented as . Let be the adjacency matrix with entries of if and zero otherwise. Define as the in-degree matrix and as the graph Laplacian.
II Distributed Machine Learning
Our problem formulation closely follows the centralized machine learning problem discussed in [7]. Consider a networked set of agents, each with a set of , , independently drawn input-output samples , where and are the -th input and output data, respectively, associated with the -th agent. For example, the input data could be images and the outputs could be labels. Let , denote the prediction function, fully parameterized by the vector . Each agent aims to find the parameter vector that minimizes the losses, , incurred from inaccurate predictions. Thus, the loss function yields the loss incurred by the -th agent, where and are the predicted and true outputs, respectively for the -th node.
Assuming the input output space associated with the -th agent is endowed with a probability measure , the objective function an agent wishes to minimize is
[TABLE]
Here denotes the expected risk given a parameter vector with respect to the probability distribution . The total expected risk across all networked agents is given as
[TABLE]
Minimizing the expected risk is desirable but often unattainable since the distributions are unknown. Thus, in practice each agent chooses to minimize the empirical risk defined as
[TABLE]
Here, the assumption is that is large enough so that . The total empirical risk across all networked agents is
[TABLE]
In order to simplify the notation, let us represent a sample input-output pair by a random seed and let denotes the -th sample associated with the -th agent. Define the loss incurred for a given as . Now, the distributed learning problem can be posed as an optimization involving sum of local empirical risks, i.e.,
[TABLE]
where .
III Distributed SGD
Here we propose a distributed stochastic gradient method to solve (5). Let denote agent ’s estimate of the optimizer at time instant . Thus, for an arbitrary initial condition , the update rule at node is as follows:
[TABLE]
where and are hyper parameters to be specified, are the entries of the adjacency matrix and represents either a simple stochastic gradient, mini-batch stochastic gradient or a stochastic quasi-Newton direction, i.e.,
[TABLE]
where denotes the mini-batch size, is a positive definite scaling matrix, represents the single random input-output pair sampled at time instant , and denotes the -th input-output pair out of the random input-output pairs sampled at time instant .
Define . Now (6) can be written as
[TABLE]
where , is the network Laplacian and
[TABLE]
III-A Assumptions
First, we state the following assumption on the individual objective functions:
Assumption 1**.**
Objective functions and its gradients are Lipschitz continuous with Lipschitz constants and , respectively, i.e., , we have
[TABLE]
Now we introduce , an aggregate objective function of local variables
[TABLE]
Following Assumption 1, the function is Lipschitz continuous with Lipschitz continuous gradient , i.e., , we have
[TABLE]
with constant and .
Lemma 1**.**
Given Assumption 1, we have
[TABLE]
where is a positive constant.
**Proof : **See Lemma 3.3 in [25].
Lemma 2**.**
Given Assumption 1, we have ,
[TABLE]
**Proof : **Proof follows from the mean value theorem.
Assumption 2**.**
The function is lower bounded by , i.e.,
[TABLE]
Without loss of generality, we assume that . Now we make the following assumption regarding and :
Assumption 3**.**
Sequences and are selected as
[TABLE]
where , , , , and .
For sequences and that satisfy Assumption 3, we have , , and . Thus and are not summable sequences. However, is square-summable and is summable.
Assumption 4**.**
The interaction topology of networked agents is given as a connected undirected graph .
Lemma 3**.**
Given Assumption 4, for all we have
[TABLE]
where is the average-consensus error and denotes the smallest non-zero eigenvalue.
**Proof : **This Lemma follows from the Courant-Fischer Theorem [26].
Assumption 5**.**
Parameter in sequence is selected such that
[TABLE]
has a single eigenvalue at corresponding to the right eigenvector and the remaining eigenvalues of are strictly inside the unit circle.
In other words, is selected such that , where denotes the largest singular value. Thus, .
Let denote the expected value taken with respect to the distribution of the random variable given the filtration generated by the sequence , i.e.,
[TABLE]
where a.s. (almost surely) denote events that occur with probability one. Now we make the following assumptions regarding the stochastic gradient term .
Assumption 6**.**
Stochastic gradients are unbiased such that
[TABLE]
That is to say
[TABLE]
Assumption 7**.**
Stochastic gradients have conditionally bounded second moment, i.e., there exist scalars and such that
[TABLE]
Assumption 7 is the bounded variance assumption typically make in SGD literature. Finally, it follows from Assumptions 1, 7 and Lemma 1 that the stochastic gradients are bounded, which is usually just assumed in literature [27, 7, 23, 17].
Proposition 1**.**
There exists a positive constant such that
[TABLE]
**Proof : **Proof follows from taking the expectation of (23) and applying the result from Lemma 1.
IV Convergence Analysis
Our strategy for proving the convergence of the proposed distributed SGD algorithm to a critical point is as follows. First we show that the consensus error among the agents are diminishing at the rate of (see Theorem 1). Asymptotic convergence of the algorithm is then proved in Theorem 3. Theorem 4 then establishes that the weighted expected average gradient norm is a summable sequence. Finally, Theorem 5 proves the asymptotic mean-square convergence of the algorithm to a critical point.
Theorem 1**.**
Consider distributed SGD algorithm (11) under Assumptions [1-7]. Then, there holds:
[TABLE]
**Proof : **See Appendix B.
Let
[TABLE]
Now define a non-negative function as
[TABLE]
Now taking the gradient with respect to yields
[TABLE]
Theorem 2**.**
Consider distributed SGD algorithm (11) under Assumptions [1-7]. Then, for the gradient given in (28), there holds:
[TABLE]
**Proof : **See Appendix C.
Theorem 3**.**
For the distributed SGD algorithm (11) under Assumptions [1-7] we have
[TABLE]
and
[TABLE]
**Proof : **See Appendix D.
Define and .
Theorem 4**.**
For the distributed SGD algorithm (11) under Assumptions [1-7] we have
[TABLE]
**Proof : **See Appendix E.
Theorem 4 establishes results about the weighted sum of expected average gradient norm and the key takeaway from this result is that, for the distributed SGD in (11) with appropriate step-sizes, the expected average gradient norms cannot stay bounded away from zero (See Theorem 9 of [7]), i.e.,
[TABLE]
Finally, we present the following result to illustrate that stronger convergence results follows from the continuity assumption on the Hessian, which has not been utilized in our analysis so far.
Assumption 8**.**
The Hessians are Lipschitz continuous with Lipschitz constants , i.e., , we have
[TABLE]
It follows from Assumption 8 that the Hessian is Lipschitz continuous, i.e., ,
[TABLE]
with constant .
Theorem 5**.**
For the distributed SGD algorithm (11) under Assumptions [1-8] we have
[TABLE]
**Proof : **See Appendix F
Remark 1**.**
Similar to the centralized SGD [7], the analysis given here shows the mean-square convergence of the distributed algorithm to a critical point, which include the saddle points. Though SGD has shown to escape saddle points efficiently [28, 29, 30], extension of such results for distributed SGD is currently nonexistent and is the topic of future research.
V Application to Distributed Supervised Learning
We apply the proposed algorithm for distributedly training 10 different neural nets to recognize handwritten digits in images. Specifically, we consider a subset of the MNIST111http://yann.lecun.com/exdb/mnist/ data set containing 5000 images of 10 digits (0-9), of which 2500 are used for training and 2500 are used for testing. Training data are divided among ten agents connected in an undirected unweighted ring topology (see Fig. 1).
Each agent aims to train its own neural network consisting of a single hidden layer of 50 neurons (51 including the bias neuron). Since the images are , the input layer consists of 401 neurons (including the one bias neuron) and the output later consists of 10 neurons, one for each output class, i.e., one for each digits 0-9. As shown in Fig. 1, for each agent, the neural net consists of two sets of weights and . Here links the input layer to the hidden layer and connects the hidden layer to the output later. We use a logistic sigmoid function for both the hidden unit activation and the output unit activation. Therefore, the input to output mapping for the neural net under consideration takes the form
[TABLE]
where is a single image (input) and for , can be interpreted as the conditional probability that the image contains the digit given the input. Finally, the sigmoid function is given as . Let denote the true class or label associated with input image (in machine learning community, is known as the target class or label). For example, if the image contains the digit , then . The conditional distribution of all target classes given inputs can be modeled as (see equation 5.22 of [31])
[TABLE]
Taking the negative logarithm of the corresponding likelihood function yields the following empirical risk function:
[TABLE]
where denotes the -th entry of and denotes the target class associated with input image . During training, each agent exchanges the weights and with its neighbors as described in the proposed algorithm. Here we conduct the following three experiments: (i) centralized SGD, where a centralized version of the SGD is implemented by a central node having all 2500 training data, (ii) a distributed SGD depicted in Fig. 1 with equally distributed data, where 10 agents distributedly train 10 different neural nets, and (iii) a distributed SGD with class-specific data distributed among the agents. For experiment (ii), each node received 250 training data, randomly sampled from the entire training set, i.e., for all . For experiment (iii), data are distributed such that each agent only receives images corresponding to a particular class, i.e., agent received all the images of [math]s, agent received all the images of s, and so forth. Thus for experiment (iii), we have , , , , , , , , , and . For all three experiments, we select , where . For experiments (ii) and (iii), we select , where . Note that using a scale factor does not affect the theoretical results provided in the previous sections.
Given in Fig. 2 are the results obtained from the three experiments. The risks obtained from experiments (i), (ii), and (iii) are given in Figs. 2(a), 2(b), and 2(c), respectively. For all three experiments, the error rate, i.e., % of images misclassified, obtained from running the trained neural net on the testing data of 2500 images are
[TABLE]
Finally, a few misclassification examples are given in Fig. 2(d), where a 7 is misclassified as a 5, 2 as a 4, and so forth. Results given here indicate that regardless of how the data are distributed, the agents are able to train their network and the distributedly trained networks are able to yield similar performance as that of a centrally trained network. More importantly, in experiment (iii), agents were able to recognize all 10 classes even though they only had access to data corresponding to a single class. This result has numerous implications for the machine learning community, specifically for federated multi-task learning under information flow constraints.
VI Conclusion
This paper presented the development of a distributed stochastic gradient descent algorithm for solving non-convex optimization problems. Here we assumed that the local objective functions are Lipschitz continuous and twice continuously differentiable with Lipschitz continuous gradients and Hessians. We provided sufficient conditions on algorithm step-sizes that guarantee asymptotic mean-square convergence of the proposed algorithm to a critical point. We applied the developed algorithm to a distributed supervised-learning problem, in which a set of 10 networked agents collaboratively train their individual neural nets to recognize handwritten digits in images. Results indicate that regardless of how the data are distributed, the agents are able to train their network and the distributedly trained networks are able to yield similar performance as that of a centrally trained network. Numerical results also show that the proposed distributed algorithm allowed individual agents to collaboratively recognize all 10 classes even though they only had access to data corresponding to a single class.
Appendix
VI-A Useful Lemmas
Lemma 4**.**
Let be a non-negative sequence satisfying
[TABLE]
where and are sequences with
[TABLE]
where , , , and . Then as for all .
**Proof : **This Lemma follows directly from Lemma 4.1 of [32].
Lemma 5**.**
Let be a non-negative sequence for which the following relation hold for all :
[TABLE]
where , and with and . Then the sequence will converge to and we further have .
**Proof : **See [33].
Lemma 6**.**
Let with . Then it holds
[TABLE]
**Proof : **This Lemma is a direct consequence of Lemma 10 of [17].
VI-B Proof of Theorem 1
Define the average-consensus error as , where . Thus from (11) we have
[TABLE]
and Since , it follows from Lemma 4.4 of [32] that
[TABLE]
where denotes the second smallest eigenvalue. Thus we have
[TABLE]
Now we use the following inequality
[TABLE]
for all and . Selecting yields
[TABLE]
Now taking the expectation yields
[TABLE]
Using Proposition 1, (42) can be written as
[TABLE]
Note for some . Let and . Now (43) can be written in the form of (37) with and . Thus it follows from Lemma 4 that
[TABLE]
Thus there exists a constant such that for all
[TABLE]
Now (25) follows from Assumption 3 that .
VI-C Proof of Theorem 2
From (28) we have
[TABLE]
Now based on Assumption 1, for a fixed , is Lipschitz continuous in . Thus we have
[TABLE]
It follows from Lemma 2 that
[TABLE]
Note that the distributed SGD algorithm in (11) can be rewritten as
[TABLE]
Substituting (47) into (46) and taking the conditional expectation yields
[TABLE]
Based on Assumption 6, there exists such that
[TABLE]
Thus we have
[TABLE]
Let
[TABLE]
Now (49) can be written as
[TABLE]
Based on Assumptions 6 and 7, there exists scalars and such that
[TABLE]
Thus from (51) we have
[TABLE]
Substituting and taking the total expectation of (53) yields
[TABLE]
Note that
[TABLE]
Combining (54) and (55) yields
[TABLE]
If we select , it follows directly from Lemma 6 that
[TABLE]
Note that from Lemma 3 we have . Thus
[TABLE]
We have established in (44) that for all
[TABLE]
Therefore we have
[TABLE]
Let . Now selecting , where , yields
[TABLE]
Thus if we select and such that , then we have
[TABLE]
where and . Now we can write (56) as
[TABLE]
Since is decreasing to zero, for sufficiently large , we have . Therefore for sufficiently large . Thus we have
[TABLE]
Now (61) can be written in the form of (39) after selecting ,
[TABLE]
Note that here we have , and with and . Note is summable because is summable and is square-summable. Therefore from Lemma 5 we have is a convergent sequence and .
VI-D Proof of Theorem 3
Note that
[TABLE]
Now form (52), using the tower rule yields
[TABLE]
Now taking the expectation of (64) and substituting (65) yields
[TABLE]
Thus we have
[TABLE]
Now (30) follows from (29) and from noting that is square summable. Furthermore, since every summable sequence is convergent, we have (31).
VI-E Proof of Theorem 4
Taking the conditional expectation of (47) yields
[TABLE]
Thus we have
[TABLE]
Therefore
[TABLE]
Substituting (70) into (29) yields
[TABLE]
Now note that . Thus a.s. and a.s. Therefore it follows from (71) that
[TABLE]
From (47) we have
[TABLE]
Now substituting (73) into (72) yields (32).
VI-F Proof of Theorem 5
Define . Thus we have
[TABLE]
where and . Since is twice continuously differentiable and is Liptschitz continuous with constant , we have . Therefore ,
[TABLE]
Since is Lipschitz continuous with constant , and , we have
[TABLE]
where . Thus is Lipschitz continuous and from Lemma 2 we have
[TABLE]
Now substituting (74) and taking the conditional expectation yields
[TABLE]
Since , substituting (68) yields
[TABLE]
Now taking the total expectation yields
[TABLE]
From (29) and (30), we know that and are summable. Therefore (75) can be written in the form of (39) and it follows from Lemma 5 that converges. Since it follows from Theorem 4 that must converge to zero.
The reference list from the paper itself. Each links out to its DOI / PubMed record.
- 1[1] M. Li, D. G. Andersen, J. W. Park, A. J. Smola, A. Ahmed, V. Josifovski, J. Long, E. J. Shekita, and B.-Y. Su, “Scaling distributed machine learning with the parameter server,” in 11th USENIX Symposium on Operating Systems Design and Implementation (OSDI) , 2014, pp. 583 – 598.
- 2[2] K. Zhang, S. Alqahtani, and M. Demirbas, “A comparison of distributed machine learning platforms,” in 26th International Conference on Computer Communication and Networks (ICCCN) , Jul. 2017, pp. 1–9.
- 3[3] J. Zhang, H. Tu, Y. Ren, J. Wan, L. Zhou, M. Li, and J. Wang, “An adaptive synchronous parallel strategy for distributed machine learning,” IEEE Access , vol. 6, pp. 19 222–19 230, 2018.
- 4[4] M. Li, D. G. Andersen, A. J. Smola, and K. Yu, “Communication efficient distributed machine learning with the parameter server,” in Advances in Neural Information Processing Systems , 2014, pp. 19 – 27.
- 5[5] J. Konec̆nú, H. B. Mc Mahan, F. X. Yu, P. Richtarik, A. T. Suresh, and D. Bacon, “Federated learning: Strategies for improving communication efficiency,” in NIPS Workshop on Private Multi-Party Machine Learning , 2016.
- 6[6] H. B. Mc Mahan, E. Moore, D. Ramage, S. Hampson, and B. A. y Arcas, “Communication-efficient learning of deep networks from decentralized data,” in Proceedings of the 20th International Conference on Artificial Intelligence and Statistics (AISTATS) , 2017.
- 7[7] L. Bottou, F. Curtis, and J. Nocedal, “Optimization methods for large-scale machine learning,” SIAM Review , vol. 60, no. 2, pp. 223–311, 2018.
- 8[8] M. Zhu and S. Martínez, “An approximate dual subgradient algorithm for multi-agent non-convex optimization,” IEEE Transactions on Automatic Control , vol. 58, no. 6, pp. 1534 – 1539, Jun. 2013.
