Compressed Decentralized Proximal Stochastic Gradient Method for Nonconvex Composite Problems with Heterogeneous Data
Yonggui Yan, Jie Chen, Pin-Yu Chen, Xiaodong Cui, Songtao Lu and, Yangyang Xu

TL;DR
This paper introduces a decentralized stochastic gradient method with compression for nonconvex composite problems, effectively handling heterogeneous data and achieving optimal sample complexity for training neural networks.
Contribution
It proposes a novel decentralized proximal stochastic gradient tracking method with compression, improving communication efficiency and handling data heterogeneity in nonconvex optimization.
Findings
Achieves optimal sample complexity for near-stationary points.
Demonstrates better generalization in neural network training.
Handles heterogeneous data effectively with gradient tracking.
Abstract
We first propose a decentralized proximal stochastic gradient tracking method (DProxSGT) for nonconvex stochastic composite problems, with data heterogeneously distributed on multiple workers in a decentralized connected network. To save communication cost, we then extend DProxSGT to a compressed method by compressing the communicated information. Both methods need only samples per worker for each proximal update, which is important to achieve good generalization performance on training deep neural networks. With a smoothness condition on the expected loss function (but not on each sample function), the proposed methods can achieve an optimal sample complexity result to produce a near-stationary point. Numerical experiments on training neural networks demonstrate the significantly better generalization performance of our methods over large-batch training methods andā¦
| Methods | CMP | GRADIENTS | SMOOTHNESS | (BS, VR, MMT) | |
|---|---|---|---|---|---|
| ProxGT-SA | No | Yes | No | is smooth | (, No , No) |
| ProxGT-SR-O | No | Yes | No | mean-squared | (, Yes, No) |
| DEEPSTORM | No | Yes | No | mean-squared | (, Yes, Yes) |
| DProxSGT (this paper) | No | Yes | No | is smooth | (, No, No) |
| ChocoSGD | Yes | No | is smooth | (, No, No) | |
| BEER | Yes | No | No | is smooth | (, No, No) |
| CDProxSGT (this paper) | Yes | Yes | No | is smooth | (, No, No) |
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
Taxonomy
TopicsStochastic Gradient Optimization Techniques Ā· Sparse and Compressive Sensing Techniques Ā· Machine Learning and ELM
Compressed Decentralized Proximal Stochastic Gradient Method for Nonconvex Composite Problems with Heterogeneous Data
Yonggui Yan
āā
Jie Chen
āā
Pin-Yu Chen
āā
Xiaodong Cui
āā
Songtao Lu
āā
Yangyang Xu
Abstract
We first propose a decentralized proximal stochastic gradient tracking method (DProxSGT) for nonconvex stochastic composite problems, with data heterogeneously distributed on multiple workers in a decentralized connected network. To save communication cost, we then extend DProxSGT to a compressed method by compressing the communicated information. Both methods need only samples per worker for each proximal update, which is important to achieve good generalization performance on training deep neural networks. With a smoothness condition on the expected loss function (but not on each sample function), the proposed methods can achieve an optimal sample complexity result to produce a near-stationary point. Numerical experiments on training neural networks demonstrate the significantly better generalization performance of our methods over large-batch training methods and momentum variance-reduction methods and also, the ability of handling heterogeneous data by the gradient tracking scheme.
Machine Learning, ICML
1 Introduction
In this paper, we consider to solve nonconvex stochastic composite problems in a decentralized setting:
[TABLE]
Here, are possibly non-i.i.d data distributions on machines/workers that can be viewed as nodes of a connected graph , and each can only be accessed by the -th worker. We are interested in problems that satisfy the following structural assumption.
Assumption 1** (Problem structure).**
We assume that
- (i)
is closed convex and possibly nondifferentiable.
- (ii)
Each is -smooth in , i.e., , for any .
- (iii)
is lower bounded, i.e., .
Let be the set of nodes of and the set of edges. For each , denote as the neighbors of worker and itself, i.e., . Every worker can only communicate with its neighbors. To solve (1) collaboratively, each worker maintains a copy, denoted as , of the variable . With these notations, (1) can be formulated equivalently to
[TABLE]
Problems with a nonsmooth regularizer, i.e., in the form of (1), appear in many applications such as -regularized signal recovery (Eldar & Mendelson, 2014; Duchi & Ruan, 2019), online nonnegative matrix factorization (Guan etĀ al., 2012), and training sparse neural networks (Scardapane etĀ al., 2017; Yang etĀ al., 2020). When data involved in these applications are distributed onto (or collected by workers on) a decentralized network, it necessitates the design of decentralized algorithms.
Although decentralized optimization has attracted a lot of research interests in recent years, most existing works focus on strongly convex problems (Scaman etĀ al., 2017; Koloskova etĀ al., 2019b) or convex problems (Tsianos etĀ al., 2012; Taheri etĀ al., 2020) or smooth nonconvex problems (Bianchi & Jakubowicz, 2012; DiĀ Lorenzo & Scutari, 2016; Wai etĀ al., 2017; Lian etĀ al., 2017; Zeng & Yin, 2018). Few works have studied nonsmooth nonconvex decentralized stochastic optimization like (2) that we consider. (Chen etĀ al., 2021; Xin etĀ al., 2021a; Mancino-Ball etĀ al., 2022) are among the exceptions. However, they either require to take many data samples for each update or assume a so-called mean-squared smoothness condition, which is stronger than the smoothness condition in AssumptionĀ 1(ii), in order to perform momentum-based variance-reduction step. Though these methods can have convergence (rate) guarantee, they often yield poor generalization performance on training deep neural networks, as demonstrated in (LeCun etĀ al., 2012; Keskar etĀ al., 2016) for large-batch training methods and in our numerical experiments for momentum variance-reduction methods.
On the other side, many distributed optimization methods (Shamir & Srebro, 2014; Lian etĀ al., 2017; Wang & Joshi, 2018) often assume that the data are i.i.d across the workers. However, this assumption does not hold in many real-world scenarios, for instance, due to data privacy issue that local data has to stay on-premise. Data heterogeneity can result in significant degradation of the performance by these methods. Though some papers do not assume i.i.d. data, they require certain data similarity, such as bounded stochastic gradients (Koloskova etĀ al., 2019b, a; Taheri etĀ al., 2020) and bounded gradient dissimilarity (Tang etĀ al., 2018a; Assran etĀ al., 2019; Tang etĀ al., 2019a; Vogels etĀ al., 2020).
To address the critical practical issues mentioned above, we propose a decentralized proximal stochastic gradient tracking method that needs only a single or data samples (per worker) for each update. With no assumption on data similarity, it can still achieve the optimal convergence rate on solving problems satisfying conditions in AssumptionĀ 1 and yield good generalization performance. In addition, to reduce communication cost, we give a compressed version of the proposed algorithm, by performing compression on the communicated information. The compressed algorithm can inherit the benefits of its non-compressed counterpart.
1.1 Our Contributions
Our contributions are three-fold. First, we propose two decentralized algorithms, one without compression (named DProxSGT) and the other with compression (named CDProxSGT), for solving decentralized nonconvex nonsmooth stochastic problems. Different from existing methods, e.g., (Xin etĀ al., 2021a; Wang etĀ al., 2021b; Mancino-Ball etĀ al., 2022), which need a very large batchsize and/or perform momentum-based variance reduction to handle the challenge from the nonsmooth term, DProxSGT needs only data samples for each update, without performing variance reduction. The use of a small batch and a standard proximal gradient update enables our method to achieve significantly better generalization performance over the existing methods, as we demonstrate on training neural networks. To the best of our knowledge, CDProxSGT is the first decentralized algorithm that applies a compression scheme for solving nonconvex nonsmooth stochastic problems, and it inherits the advantages of the non-compressed method DProxSGT. Even applied to the special class of smooth nonconvex problems, CDProxSGT can perform significantly better over state-of-the-art methods, in terms of generalization and handling data heterogeneity.
Second, we establish an optimal sample complexity result of DProxSGT, which matches the lower bound result in (Arjevani etĀ al., 2022) in terms of the dependence on a target tolerance , to produce an -stationary solution. Due to the coexistence of nonconvexity, nonsmoothness, big stochasticity variance (due to the small batch and no use of variance reduction for better generalization), and decentralization, the analysis is highly non-trivial. We employ the tool of Moreau envelope and construct a decreasing Lyapunov function by carefully controlling the errors introduced by stochasticity and decentralization.
Third, we establish the iteration complexity result of the proposed compressed method CDProxSGT, which is in the same order as that for DProxSGT and thus also optimal in terms of the dependence on a target tolerance. The analysis builds on that of DProxSGT but is more challenging due to the additional compression error and the use of gradient tracking. Nevertheless, we obtain our results by making the same (or even weaker) assumptions as those assumed by state-of-the-art methods (Koloskova etĀ al., 2019a; Zhao etĀ al., 2022).
1.2 Notation
For any vector , we use for the norm. For any matrix , denotes the Frobenius norm and the spectral norm. concatinates all local variables. The superscript t will be used for iteration or communication. denotes a local stochastic gradient of at with a random sample . The column concatenation of is denoted as
[TABLE]
where . Similarly, we denote
[TABLE]
For any , we define
[TABLE]
where is the all-one vector, and is the averaging matrix. Similarly, we define the mean vectors
[TABLE]
We will use for the expectation about the random samples at the th iteration and for the full expectation. denotes the expectation about a stochastic compressor .
2 Related Works
The literature of decentralized optimization has been growing vastly. To exhaust the literature is impossible. Below we review existing works on decentralized algorithms for solving nonconvex problems, with or without using a compression technique. For ease of understanding the difference of our methods from existing ones, we compare to a few relevant methods in Table 1.
2.1 Non-compressed Decentralized Methods
For nonconvex decentralized problems with a nonsmooth regularizer, a lot of deterministic decentralized methods have been studied, e.g., (DiĀ Lorenzo & Scutari, 2016; Wai etĀ al., 2017; Zeng & Yin, 2018; Chen etĀ al., 2021; Scutari & Sun, 2019). When only stochastic gradient is available, a majority of existing works focus on smooth cases without a regularizer or a hard constraint, such as (Lian etĀ al., 2017; Assran etĀ al., 2019; Tang etĀ al., 2018b), gradient tracking based methods (Lu etĀ al., 2019; Zhang & You, 2019; Koloskova etĀ al., 2021), and momentum-based variance reduction methods (Xin etĀ al., 2021b; Zhang etĀ al., 2021). Several works such as (Bianchi & Jakubowicz, 2012; Wang etĀ al., 2021b; Xin etĀ al., 2021a; Mancino-Ball etĀ al., 2022) have studied stochastic decentralized methods for problems with a nonsmooth term . However, they either consider some special or require a large batch size. (Bianchi & Jakubowicz, 2012) considers the case where is an indicator function of a compact convex set. Also, it requires bounded stochastic gradients. (Wang etĀ al., 2021b) focuses on problems with a polyhedral , and it requires a large batch size of to produce an (expected) -stationary point. (Xin etĀ al., 2021a; Mancino-Ball etĀ al., 2022) are the most closely related to our methods. To produce an (expected) -stationary point, the methods in (Xin etĀ al., 2021a) require a large batch size, either or if variance reduction is applied. The method in (Mancino-Ball etĀ al., 2022) requires only samples for each update by taking a momentum-type variance reduction scheme. However, in order to reduce variance, it needs a stronger mean-squared smoothness assumption. In addition, the momentum variance reduction step can often hurt the generalization performance on training complex neural networks, as we will demonstrate in our numerical experiments.
2.2 Compressed Distributed Methods
Communication efficiency is a crucial factor when designing a distributed optimization strategy. The current machine learning paradigm oftentimes resorts to models with a large number of parameters, which indicates a high communication cost when the models or gradients are transferred from workers to the parameter server or among workers. This may incur significant latency in training. Hence, communication-efficient algorithms by model or gradient compression have been actively sought.
Two major groups of compression operators are quantization and sparsification. The quantization approaches include 1-bit SGD (Seide etĀ al., 2014), SignSGD (Bernstein etĀ al., 2018), QSGD (Alistarh etĀ al., 2017), TernGrad (Wen etĀ al., 2017). The sparsification approaches include Random- (Stich etĀ al., 2018), Top- (Aji & Heafield, 2017), Threshold- (Dutta etĀ al., 2019) and ScaleCom (Chen etĀ al., 2020). Direct compression may slow down the convergence especially when compression ratio is high. Error compensation or error-feedback can mitigate the effect by saving the compression error in one communication step and compensating it in the next communication step before another compression (Seide etĀ al., 2014). These compression operators are first designed to compress the gradients in the centralized setting (Tang etĀ al., 2019b; Karimireddy etĀ al., 2019).
The compression can also be applied to the decentralized setting for smooth problems, i.e., (2) with . (Tang etĀ al., 2019a) applies the compression with error compensation to the communication of model parameters in the decentralized seeting. Choco-Gossip (Koloskova etĀ al., 2019b) is another communication way to mitigate the slow down effect from compression. It does not compress the model parameters but a residue between model parameters and its estimation. Choco-SGD uses Choco-Gossip to solve (2). BEER (Zhao etĀ al., 2022) includes gradient tracking and compresses both tracked stochastic gradients and model parameters in each iteration by the Choco-Gossip. BEER needs a large batchsize of in order to produce an -stationary solution. DoCoM-SGT(Yau & Wai, 2022) does similar updates as BEER but with a momentum term for the update of the tracked gradients, and it only needs an batchsize.
Our proposed CDProxSGT is for solving decentralized problems in the form of (2) with a nonsmooth . To the best of our knowledge, CDProxSGT is the first compressed decentralized method for nonsmooth nonconvex problems without the use of a large batchsize, and it can achieve an optimal sample complexity without the assumption of data similarity or gradient boundedness.
3 Decentralized Algorithms
In this section, we give our decentralized algorithms for solving (2) or equivalently (1). To perform neighbor communications, we introduce a mixing (or gossip) matrix that satisfies the following standard assumption.
Assumption 2** (Mixing matrix).**
We choose a mixing matrix such that
- (i)
is doubly stochastic: and ; 2. (ii)
if and are not neighbors to each other; 3. (iii)
and .
The condition in (ii) above is enforced so that direct communications can be made only if two nodes (or workers) are immediate (or 1-hop) neighbors of each other. The condition in (iii) can hold if the graph is connected. The assumption is critical to ensure contraction of consensus error.
The value of depends on the graph topology. (Koloskova etĀ al., 2019b) gives three commonly used examples: when uniform weights are used between nodes, and for a fully-connected graph (in which case, our algorithms will reduce to centralized methods), for a 2d torus grid graph where every node has 4 neighbors, and for a ring-structured graph. More examples can be found in (NediÄ etĀ al., 2018).
3.1 Non-compreseed Method
With the mixing matrix , we propose a decentralized proximal stochastic gradient method with gradient tracking (DProxSGT) for (2). The pseudocode is shown in AlgorithmĀ 1. In every iteration , each node first computes a local stochastic gradient by taking a sample from its local data distribution , then performs gradient tracking in (3) and neighbor communications of the tracked gradient in (4), and finally takes a proximal gradient step in (5) and mixes the model parameter with its neighbors in (6).
Note that for simplicity, we take only one random sample in Algorithm 1 but in general, a mini-batch of random samples can be taken, and all theoretical results that we will establish in the next section still hold. We emphasize that we need only samples for each update. This is different from ProxGT-SA in (Xin etĀ al., 2021a), which shares a similar update formula as our algorithm but needs a very big batch of samples, as many as , where is a target tolerance. A small-batch training can usually generalize better than a big-batch one (LeCun etĀ al., 2012; Keskar etĀ al., 2016) on training large-scale deep learning models. Throughout the paper, we make the following standard assumption on the stochastic gradients.
Assumption 3** (Stochastic gradients).**
We assume that
- (i)
The random samples are independent.
- (ii)
There exists a finite number such that for any and ,
[TABLE]
The gradient tracking step in (3) is critical to handle heterogeneous data (DiĀ Lorenzo & Scutari, 2016; Nedic etĀ al., 2017; Lu etĀ al., 2019; Pu & NediÄ, 2020; Sun etĀ al., 2020; Xin etĀ al., 2021a; Song etĀ al., 2021; Mancino-Ball etĀ al., 2022; Zhao etĀ al., 2022; Yau & Wai, 2022; Song etĀ al., 2022). In a deterministic scenario where is used instead of , for each , the tracked gradient can converge to the gradient of the global function at , and thus all local updates move towards a direction to minimize the global objective. When stochastic gradients are used, the gradient tracking can play a similar role and make approach to the stochastic gradient of the global function. With this nice property of gradient tracking, we can guarantee convergence without strong assumptions that are made in existing works, such as bounded gradients (Koloskova etĀ al., 2019b, a; Taheri etĀ al., 2020; Singh etĀ al., 2021) and bounded data similarity over nodes (Lian etĀ al., 2017; Tang etĀ al., 2018a, 2019a; Vogels etĀ al., 2020; Wang etĀ al., 2021a).
3.2 Compressed Method
In DProxSGT, each worker needs to communicate both the model parameter and tracked stochastic gradient with its neighbors at every iteration. Communications have become a bottleneck for distributed training on GPUs. In order to save the communication cost, we further propose a compressed version of DProxSGT, named CDProxSGT. The pseudocode is shown in Algorithm 2, where and are two compression operators.
In Algorithm 2, each node communicates the non-compressed vectors and with its neighbors in (9) and (12). We write it in this way for ease of read and analysis. For efficient and equivalent implementation, we do not communicate and directly but the compressed residues Q_{\mathbf{y}}\big{[}{\mathbf{y}}_{i}^{t-\frac{1}{2}}-\underline{{\mathbf{y}}}_{i}^{t-1}\big{]} and Q_{\mathbf{x}}\big{[}{\mathbf{x}}_{i}^{t+\frac{1}{2}}-\underline{{\mathbf{x}}}_{i}^{t}\big{]}, explained as follows. Besides , , and , each node also stores and which record and . For the gradient communication, each node initializes , and then at each iteration , after receiving Q_{\mathbf{y}}\big{[}{\mathbf{y}}_{j}^{t-\frac{1}{2}}-\underline{{\mathbf{y}}}_{j}^{t-1}\big{]} from its neighbors, it updates by (8), and and by
[TABLE]
From the initialization and the updates of and , it always holds that . The model communication can be done efficiently in the same way.
The compression operators and in Algorithm 2 can be different, but we assume that they both satisfy the following assumption.
Assumption 4**.**
There exists such that
[TABLE]
for both and .
The assumption on compression operators is standard and also made in (Koloskova etĀ al., 2019a, b; Zhao etĀ al., 2022). It is satisfied by the sparsification, such as Random- (Stich etĀ al., 2018) and Top- (Aji & Heafield, 2017). It can also be satisfied by rescaled quantizations. For example, QSGD (Alistarh etĀ al., 2017) compresses by where is uniformly distributed on , is the parameter about compression level. Then with satisfies Assumption 4 with . More examples can be found in (Koloskova etĀ al., 2019b).
Below, we make a couple of remarks to discuss the relations between Algorithm 1 and Algorithm 2.
Remark 1*.*
When and are both identity operators, i.e., , and , in Algorithm 2, CDProxSGT will reduce to DProxSGT. Hence, the latter can be viewed as a special case of the former. However, we will analyze them separately. Although the big-batch training method ProxGT-SA in (Xin etĀ al., 2021a) shares a similar update as the proposed DProxSGT, our analysis will be completely different and new, as we need only samples in each iteration in order to achieve better generalization performance. The analysis of CDProxSGT will be built on that of DProxSGT by carefully controlling the variance error of stochastic gradients and the consensus error, as well as the additional compression error.
Remark 2*.*
When and are identity operators, and for each . Hence, in the compression case, and can be viewed as estimates of and . In addition, in a matrix format, we have from (9) and (12) that
[TABLE]
where When satisfies the conditions (i)-(iii) in AssumptionĀ 2, it can be easily shown that and also satisfy all three conditions. Indeed, we have
[TABLE]
Thus we can view and as the results of and by one round of neighbor communication with mixing matrices and , and the addition of the estimation error and after one round of neighbor communication.
4 Convergence Analysis
In this section, we analyze the convergence of the algorithms proposed in sectionĀ 3. Nonconvexity of the problem and stochasticity of the algorithms both raise difficulty on the analysis. In addition, the coexistence of the nonsmooth regularizer causes more significant challenges. To address these challenges, we employ a tool of the so-called Moreau envelope (Moreau, 1965), which has been commonly used for analyzing methods on solving nonsmooth weakly-convex problems.
Definition 1** (Moreau envelope).**
Let be an -weakly convex function, i.e., is convex. For , the Moreau envelope of is defined as
[TABLE]
and the unique minimizer is denoted as
[TABLE]
The Moreau envelope has nice properties. The result below can be found in (Davis & Drusvyatskiy, 2019; Nazari etĀ al., 2020; Xu etĀ al., 2022).
Lemma 2**.**
For any function , if it is -weakly convex, then for any , the Moreau envelope is smooth with gradient given by where . Moreover,
[TABLE]
LemmaĀ 2 implies that if is small, then is a near-stationary point of and is close to . Hence, can be used as a valid measure of stationarity violation at for . Based on this observation, we define the -stationary solution below for the decentralized problem (2).
Definition 3** (Expected -stationary solution).**
Let . A point is called an expected -stationary solution of (2) if for a constant ,
[TABLE]
In the definition above, before the consensus error term is to balance the two terms. This scaling scheme has also been used in existing works such as (Xin etĀ al., 2021a; Mancino-Ball etĀ al., 2022; Yau & Wai, 2022) . From the definition, we see that if is an expected -stationary solution of (2), then each local solution will be a near-stationary solution of and in addition, these local solutions are all close to each other, namely, they are near consensus.
Below we first state the convergence results of the non-compressed method DProxSGT and then the compressed one CDProxSGT. All the proofs are given in the appendix.
Theorem 4** (Convergence rate of DProxSGT).**
Under Assumptions 1 ā 3, let be generated from in AlgorithmĀ 1 with . Let \lambda=\min\big{\{}\frac{1}{4L},\frac{1}{96\rho L}\big{\}} and \eta\leq\min\big{\{}\frac{1}{4L},\frac{(1-\rho^{2})^{4}}{96\rho L}\big{\}}. Select from uniformly at random. Then
[TABLE]
where .
By TheoremĀ 4, we obtain a complexity result as follows.
Corollary 5** (Iteration complexity).**
Under the assumptions of TheoremĀ 4, for a given , take . Then can find an expected -stationary point of (2) when .
Remark 3*.*
When is small enough, will take , and will be dominated by the first term. In this case, DProxSGT can find an expected -stationary solution of (2) in O\Big{(}\frac{\sigma^{2}\left(\phi_{\lambda}({\mathbf{x}}^{0})-\phi_{\lambda}^{*}\right)}{\lambda(1-\rho^{2})^{3}\epsilon^{4}}\Big{)} iterations, leading to the same number of stochastic gradient samples and communication rounds. Our sample complexity is optimal in terms of the dependence on under the smoothness condition in AssumptionĀ 1, as it matches with the lower bound in (Arjevani etĀ al., 2022). However, the dependence on may not be optimal because of our possibly loose analysis, as the deterministic method with single communication per update in (Scutari & Sun, 2019) for nonconvex nonsmooth problems has a dependence on the graph topology.
Theorem 6** (Convergence rate of CDProxSGT).**
Under Assumptions 1 through 4, let be generated from in Algorithm 2 with . Let \lambda=\min\big{\{}\frac{1}{4L},\frac{(1-\alpha^{2})^{2}}{9L+41280}\big{\}}, and suppose
[TABLE]
Select from uniformly at random. Then
[TABLE]
*where . *
By TheoremĀ 6, we have the complexity result as follows.
Corollary 7** (Iteration complexity).**
Under the assumptions of Theorem 6, for a given , take
[TABLE]
Then can find an expected -stationary point of (2) when where
[TABLE]
Remark 4*.*
When the given tolerance is small enough, will take and will be dominated by the first term. In this case, similar to DProxSGT in Remark 3, CDProxSGT can find an expected -stationary solution of (2) in O\Big{(}\frac{\sigma^{2}\left(\phi_{\lambda}({\mathbf{x}}^{0})-\phi_{\lambda}^{*}\right)}{\lambda(1-\widehat{\rho}^{2}_{x})^{2}(1-\widehat{\rho}^{2}_{y})\epsilon^{4}}\Big{)} iterations.
5 Numerical Experiments
In this section, we test the proposed algorithms on training two neural network models, in order to demonstrate their better generalization over momentum variance-reduction methods and large-batch training methods and to demonstrate the success of handling heterogeneous data even when only compressed model parameter and gradient information are communicated among workers. One neural network that we test is LeNet5 (LeCun etĀ al., 1989) on the FashionMNIST dataset (Xiao etĀ al., 2017), and the other is FixupResNet20 (Zhang etĀ al., 2019) on Cifar10 (Krizhevsky etĀ al., 2009).
Our experiments are representative to show the practical performance of our methods. Among several closely-related works, (Xin etĀ al., 2021a) includes no experiments, and (Mancino-Ball etĀ al., 2022; Zhao etĀ al., 2022) only tests on tabular data and MNIST. (Koloskova etĀ al., 2019a) tests its method on Cifar10 but needs similar data distribution on all workers for good performance. FashionMNIST has a similar scale as MNIST but poses a more challenging classification task (Xiao etĀ al., 2017). Cifar10 is more complex, and FixupResNet20 has more layers than LeNet5.
All the compared algorithms are implemented in Python with Pytorch and MPI4PY (for distributed computing). They run on a Dell workstation with two Quadro RTX 5000 GPUs. We use the 2 GPUs as 5 workers, which communicate over a ring-structured network (so each worker can only communicate with two neighbors). Uniform weight is used, i.e., for each pair of connected workers and . Both FashionMNIST and Cifar10 have 10 classes. We distribute each data onto the 5 workers based on the class labels, namely, each worker holds 2 classes of data points, and thus the data are heterogeneous across the workers.
For all methods, we report their objective values on training data, prediction accuracy on testing data, and consensus errors at each epoch. To save time, the objective values are computed as the average of the losses that are evaluated during the training process (i.e., on the sampled data instead of the whole training data) plus the regularizer per epoch. For the testing accuracy, we first compute the accuracy on the whole testing data for each worker by using its own model parameter and then take the average. The consensus error is simply .
5.1 Sparse Neural Network Training
In this subsection, we test the non-compressed method DProxSGT and compare it with AllReduce (that is a centralized method and used as a baseline), DEEPSTORM111For DEEPSTORM, we implement DEEPSTORM v2 in (Mancino-Ball etĀ al., 2022). and ProxGT-SA (Xin etĀ al., 2021a) on solving (2), where is the loss on the whole training data and serves as a sparse regularizer that encourages a sparse model.
For training LeNet5 on FashionMNIST, we set and run each method to 100 epochs. The learning rate and batchsize are set to and 8 for AllReduce and DProxSGT. DEEPSTORM uses the same and batchsize but with a larger initial batchsize 200, and its momentum parameter is tuned to in order to yield the best performance. ProxGT-SA is a large-batch training method. We set its batchsize to 256 and accordingly apply a larger step size that is the best among .
For training FixupResnet20 on Cifar10, we set and run each method to 500 epochs. The learning rate and batchsize are set to and 64 for AllReduce, DProxSGT, and DEEPSTORM. The initial batchsize is set to 1600 for DEEPSTORM and the momentum parameter set to . ProxGT-SA uses a larger batchsize 512 and a larger stepsize that gives the best performance among .
The results for all methods are plotted in Figure 1. For LeNet5, DProxSGT produces almost the same curves as the centralized training method AllReduce, while on FixupResnet20, DProxSGT even outperforms AllReduce in terms of testing accuracy. This could be because AllReduce aggregates stochastic gradients from all the workers for each update and thus equivalently, it actually uses a larger batchsize. DEEPSTORM performs equally well as our method DProxSGT on training LeNet5. However, it gives lower testing accuracy than DProxSGT and also oscillates significantly more seriously on training the more complex neural network FixupResnet20. This appears to be caused by the momentum variance reduction scheme used in DEEPSTORM. In addition, we see that the large-batch training method ProxGT-SA performs much worse than DProxSGT within the same number of epochs (i.e., data pass), especially on training FixupResnet20.
5.2 Neural Network Training by Compressed Methods
In this subsection, we compare CDProxSGT with two state-of-the-art compressed training methods: Choco-SGD (Koloskova etĀ al., 2019b, a) and BEER (Zhao etĀ al., 2022). As Choco-SGD and BEER are studied only for problems without a regularizer, we set in (2) for the tests. Again, we compare their performance on training LeNet5 and FixupResnet20. The two non-compressed methods AllReduce and DProxSGT are included as baselines. The same compressors are used for CDProxSGT, Choco-SGD, and BEER, when compression is applied.
We run each method to 100 epochs for training LeNet5 on FashionMNIST. The compressors and are set to top- (Aji & Heafield, 2017), i.e., taking the largest elements of an input vector in absolute values and zeroing out all others. We set batchsize to 8 and tune the learning rate to for AllReduce, DProxSGT, CDProxSGT and Choco-SGD, and for CDProxSGT, we set . BEER is a large-batch training method. It uses a larger batchsize 256 and accordingly a larger learning rate , which appears to be the best among .
For training FixupResnet20 on the Cifar10 dataset, we run each method to 500 epochs. We take top- (Aji & Heafield, 2017) as the compressors and and set . For AllReduce, DProxSGT, CDProxSGT and Choco-SGD, we set their batchsize to 64 and tune the learning rate to . For BEER, we use a larger batchsize 512 and a larger learning rate , which is the best among .
The results are shown in Figure 2. For both models, CDProxSGT yields almost the same curves of objective values and testing accuracy as its non-compressed counterpart DProxSGT and the centralized non-compressed method AllReduce. This indicates about 70% saving of communication for the training of LeNet5 and 60% saving for FixupResnet20 without sacrifying the testing accuracy. In comparison, BEER performs significantly worse than the proposed method CDProxSGT within the same number of epochs in terms of all the three measures, especially on training the more complex neural network FixupResnet20, which should be attributed to the use of a larger batch by BEER. Choco-SGD can produce comparable objective values. However, its testing accuracy is much lower than that produced by our method CDProxSGT. This should be because of the data heterogeneity that ChocoSGD cannot handle, while CDProxSGT applies the gradient tracking to successfully address the challenges of data heterogeneity.
6 Conclusion
We have proposed two decentralized proximal stochastic gradient methods, DProxSGT and CDProxSGT, for nonconvex composite problems with data heterogeneously distributed on the computing nodes of a connected graph. CDProxSGT is an extension of DProxSGT by applying compressions on the communicated model parameter and gradient information. Both methods need only a single or samples for each update, which is important to yield good generalization performance on training deep neural networks. The gradient tracking is used in both methods to address data heterogeneity. An sample complexity and communication complexity is established to both methods to produce an expected -stationary solution. Numerical experiments on training neural networks demonstrate the good generalization performance and the ability of the proposed methods on handling heterogeneous data.
Appendix A Some Key Existing Lemmas
For -smoothness function , it holds for any ,
[TABLE]
From the smoothness of in Assumption 1, it follows that is also -smooth in .
When is -smooth in , we have that is convex. Since is convex, is convex, i.e., is -weakly convex for each . So is . In the following, we give some lemmas about weakly convex functions.
The following result is from Lemma II.1 in (Chen etĀ al., 2021).
Lemma 8**.**
For any function on , if it is -weakly convex, i.e., is convex, then for any , it holds that
[TABLE]
where for all and .
The first result below is from Lemma II.8 in (Chen etĀ al., 2021), and the nonexpansiveness of the proximal mapping of a closed convex function is well known.
Lemma 9**.**
For any function on , if it is -weakly convex, i.e., is convex, then the proximal mapping with satisfies
[TABLE]
For a closed convex function , its proximal mapping is nonexpansive, i.e.,
[TABLE]
Lemma 10**.**
For in Algorithm 1 and in Algorithm 2, we both have
[TABLE]
Proof.
For DProxSGT in Algorithm 1, taking the average among the workers on (3) to (6) gives
[TABLE]
where follows from Assumption 2. With , we have (16).
Similarly, for CDProxSGT in Algorithm 2, taking the average on (44) to (49) will also give (17) and (16). ā
In the rest of the analysis, we define the Moreau envelope of for as
[TABLE]
Denote the minimizer as
[TABLE]
In addition, we will use the notation and that are defined by
[TABLE]
where .
Appendix B Convergence Analysis for DProxSGT
In this section, we analyze the convergence rate of DProxSGT in Algorithm 1. For better readability, we use the matrix form of Algorithm 1. By the notation introduced in sectionĀ 1.2, we can write (3)-(6) in the more compact matrix form:
[TABLE]
Below, we first bound in LemmaĀ 11. Then we give the bounds of the consensus error and and after one step in LemmasĀ 12, 13, and 14. Finally, we prove Theorem 4 by constructing a Lyapunov function that involves , , and .
Lemma 11**.**
Let . Then
[TABLE]
Proof.
By the definition of in (18), we have , i.e.,
[TABLE]
Thus we have . Then by (5), the convexity of , and Lemma 9,
[TABLE]
where the second inequality holds by . The second term in the right hand side of (24) can be bounded by
[TABLE]
where the second equality holds by the unbiasedness of stochastic gradients, and the second inequality holds also by the independence between ās. In the last inequality, we use the bound of the variance of stochastic gradients, and the -smooth assumption. Taking the full expectation over the above inequality and summing for all give
[TABLE]
To have the inequality above, we have used
[TABLE]
where the last equality holds by from the definition of .
About the third term in the right hand side of (24), we have
[TABLE]
where \textstyle\sum_{i=1}^{n}\big{\langle}\bar{\widehat{\mathbf{x}}}^{t},{\mathbf{y}}_{i}^{t}-\bar{\mathbf{y}}^{t}\big{\rangle}=0 and is used in the second equality, is used in the first inequality, and and (26) are used in the last inequality.
Now we can bound the summation of (24) by using (25) and (27):
[TABLE]
With , we have and (23) follows from the inequality above.
ā
Lemma 12**.**
The consensus error of satisfies the following inequality
[TABLE]
Proof.
With the updates (5) and (6), we have
[TABLE]
where we have used in the third equality, in the second inequality, and Lemma 9 in the third inequality, and is used in the last inequality. ā
Lemma 13**.**
Let and . The consensus error of satisfies
[TABLE]
Proof.
By the updates (3) and (4), we have
[TABLE]
where we have used , and . For the second term on the right hand side of (30), we have
[TABLE]
For the third term on the right hand side of (30), we have
[TABLE]
where the second equality holds by , (3) and (4), the third equality holds because does not depend on ās, and the second inequality holds because and . Plugging (31) and (32) into (30), we have
[TABLE]
where we have used . For the second term in the right hand side of (33), we have
[TABLE]
where in the first inequality we have used from , and in the second inequality we have used and .
Taking expectation over both sides of (34) and using (23), we have
[TABLE]
Plugging the inequality above into (33) gives
[TABLE]
By and , we have and , and further (29). ā
Lemma 14**.**
Let . It holds
[TABLE]
Proof.
By the definition in (18), the update in (6), the -weakly convexity of , and the convexity of , we have
[TABLE]
where in the last inequality we use , from Lemma 9, and . For the first term on the right hand side of (36), with , we have
[TABLE]
where we have used and . For the second term on the right hand side of (36), with Lemma 9 and (5), we have
[TABLE]
With (37) and (38), summing up (36) from to gives
[TABLE]
Now taking the expectation on the above inequality and using (23), we have
[TABLE]
Combining like terms in the inequality above gives (35). ā
With Lemmas 12, 13 and 14, we are ready to prove Theorem 4. We build the following Lyapunov function:
[TABLE]
where will be determined later.
Proof of Theorem 4.
Proof.
Denote
[TABLE]
Then Lemmas 12, 13 and 14 imply , where
[TABLE]
For any , We have
[TABLE]
Take
[TABLE]
We have Note . Thus
[TABLE]
With and , we have and . Thus
[TABLE]
Hence, summing up (39) for gives
[TABLE]
From , we have
[TABLE]
From Assumption 1, is lower bounded and thus is also lower bounded, i.e., there is a constant satisfying . Thus
[TABLE]
With (41), (42), and the nonnegativity of and , we have
[TABLE]
By the convexity of the Frobenius norm and (43), we obtain from (40) that
[TABLE]
Note from Lemma 2, we finish the proof. ā
Appendix C Convergence Analysis for CDProxSGT
In this section, we analyze the convergence rate of CDProxSGT. Similar to the analysis of DProxSGT, we establish a Lyapunov function that involves consensus errors and the Moreau envelope. But due to the compression, compression errors and will occur. Hence, we will also include the two compression errors in our Lyapunov function.
Again, we can equivalently write a matrix form of the updates (7)-(12) in Algorithm 2 as follows:
[TABLE]
When we apply the compressor to the column-concatenated matrix in (45) and (48), it means applying the compressor to each column separately, i.e., .
Below we first analyze the progress by the half-step updates of and from to in Lemmas 15 and 16. Then we bound the one-step consensus error and compression error for in Lemma 17 and for in Lemma 18. The bound of after one-step update is given in 19. Finally, we prove Theorem 6 by building a Lyapunov function that involves all the five terms.
Lemma 15**.**
It holds that
[TABLE]
Proof.
[TABLE]
where the first inequality holds by Assumption 4, can be any positive number, and the last inequality holds by (31) which still holds for CDProxSGT. Taking in (52) gives (50). Letting in (52), we obtain and , and thus (51) follows. ā
Lemma 16**.**
Let . Then
[TABLE]
Further, if , then
[TABLE]
Proof.
The proof of (53) is the same as that of Lemma 11 because (10) and (16) are the same as (5) and (16).
For , we have from (11) that
[TABLE]
where can be any positive number. Taking in (57) gives (54). Taking in (57) and plugging (53) give (55).
About , similar to (34), we have from (14) that
[TABLE]
where in the first inequality could be any positive number, in the second inequality we use (54), and in the last inequality we take and thus with , it holds , , . Then plugging (53) into the inequality above, we obtain (56). ā
Lemma 17**.**
Let and . Then the consensus error and compression error of can be bounded by
[TABLE]
Proof.
First, let us consider the consensus error of . With the update (14), we have
[TABLE]
where is any positive number, and is used. The first term in the right hand side of (60) can be processed similarly as the non-compressed version in Lemma 12 by replacing by , namely,
[TABLE]
Plugging (61) and (54) into (60) gives
[TABLE]
Let and . Then and
[TABLE]
Thus (58) holds.
Now let us consider the compression error of . By (12), we have
[TABLE]
where we have used in the equality, and in the inequality, and can be any positive number. For the second term in the right hand side of (62), we have
[TABLE]
where we have used , , and Lemma 9. Now plugging (55) and (63) into (62) gives
[TABLE]
With and , (59) holds because , , and
[TABLE]
ā
Lemma 18**.**
Let , , , . Then the consensus error and compression error of can be bounded by
[TABLE]
Proof.
First, let us consider the consensus of . Similar to (60), we have from the update (13) that
[TABLE]
where can be any positive number. Similarly as (30)-(33) in the proof of Lemma 13, we have the bound for the first term on the right hand side of (68) by replacing with , namely,
[TABLE]
Plug (69) and (50) back to (68), and take . We have
[TABLE]
where the first inequality holds by and , the second inequality holds by and , and the third equality holds by (56). By and from , we can now obtain (66).
Next let us consider the compression error of , similar to (62), we have by (9) that
[TABLE]
where is any positive number. For \mathbb{E}\big{[}\|\mathbf{Y}^{t+\frac{1}{2}}_{\perp}\|^{2}\big{]}, we have from (7) that
[TABLE]
where we have used (31). Plug (51) and (71) back to (70) to have
[TABLE]
With and , like (64) and (65), we have , and . Thus
[TABLE]
where the second inequality holds by (56). By , we have (67) and complete the proof. ā
Lemma 19**.**
Let and . It holds
[TABLE]
Proof.
Similar to (36), we have
[TABLE]
The same as (37) and (38), for the first two terms in the right hand side of (73), we have
[TABLE]
For the last two terms on the right hand side of (73), we have
[TABLE]
where (76) holds by Lemma 9 and , and (77) holds by (54).
Sum up (73) for and take . Then with (74), (75), (76) and (77), we have
[TABLE]
where the second inequality holds by , and the third inequality holds by (53) with . Noticing
[TABLE]
we obtain (72) and complete the proof. ā
With Lemmas 17, 18 and 19, we are ready to prove the Theorem 6. We will use the Lyapunov function:
[TABLE]
where are determined later.
Proof of Theorem 6
Proof.
Denote
[TABLE]
Then Lemmas 17, 18 and 19 imply with
[TABLE]
Then for any , it holds
[TABLE]
Let and . Take
[TABLE]
We have
[TABLE]
By and , we have ,
[TABLE]
and
[TABLE]
Hence we have
[TABLE]
Thus summing up (78) for gives
[TABLE]
From , , , , , , we have
[TABLE]
Note (42) still holds here. With (80), (81), (42), and the nonnegativity of , , , , we have
[TABLE]
where we have used from Assumption 4.
By the convexity of the frobenius norm and (82), we obtain from (79) that
[TABLE]
With from Lemma 2, we complete the proof. ā
Appendix D Additional Details on FixupResNet20
FixupResNet20 (Zhang etĀ al., 2019) is amended from the popular ResNet20 (He etĀ al., 2016) by deleting the BatchNorm layers (Ioffe & Szegedy, 2015). The BatchNorm layers use the mean and variance of some hidden layers based on the data inputted into the models. In our experiment, the data on nodes are heterogeneous. If the models include BatchNorm layers, even all nodes have the same model parameters after training, their testing performance on the whole data would be different for different nodes because the mean and variance of the hidden layers are produced on the heterogeneous data. Thus we use FixupResNet20 instead of ResNet20.
The reference list from the paper itself. Each links out to its DOI / PubMed record.
- 1Aji & Heafield (2017) Aji, A. F. and Heafield, K. Sparse communication for distributed gradient descent. ar Xiv preprint ar Xiv:1704.05021 , 2017.
- 2Alistarh et al. (2017) Alistarh, D., Grubic, D., Li, J., Tomioka, R., and Vojnovic, M. QSGD: Communication-efficient SGD via gradient quantization and encoding. In Advances in Neural Information Processing Systems , pp. 1709ā1720, 2017.
- 3Arjevani et al. (2022) Arjevani, Y., Carmon, Y., Duchi, J. C., Foster, D. J., Srebro, N., and Woodworth, B. Lower bounds for non-convex stochastic optimization. Mathematical Programming , pp. 1ā50, 2022.
- 4Assran et al. (2019) Assran, M., Loizou, N., Ballas, N., and Rabbat, M. Stochastic gradient push for distributed deep learning. In International Conference on Machine Learning , pp. 344ā353. PMLR, 2019.
- 5Bernstein et al. (2018) Bernstein, J., Wang, Y.-X., Azizzadenesheli, K., and Anandkumar, A. signsgd: Compressed optimisation for non-convex problems. ar Xiv preprint ar Xiv:1802.04434 , 2018.
- 6Bianchi & Jakubowicz (2012) Bianchi, P. and Jakubowicz, J. Convergence of a multi-agent projected stochastic gradient algorithm for non-convex optimization. IEEE transactions on automatic control , 58(2):391ā405, 2012.
- 7Chen et al. (2020) Chen, C.-Y., Ni, J., Lu, S., Cui, X., Chen, P.-Y., Sun, X., Wang, N., Venkataramani, S., Srinivasan, V. V., Zhang, W., et al. Scalecom: Scalable sparsified gradient compression for communication-efficient distributed training. Advances in Neural Information Processing Systems , 33, 2020.
- 8Chen et al. (2021) Chen, S., Garcia, A., and Shahrampour, S. On distributed nonconvex optimization: Projected subgradient method for weakly convex problems in networks. IEEE Transactions on Automatic Control , 67(2):662ā675, 2021.
