Max-Diversity Distributed Learning: Theory and Algorithms
Yong Liu, Jian Li, Weiping Wang

TL;DR
This paper introduces a new distributed learning algorithm called MDD that leverages maximum diversity among local estimates to improve risk bounds, with theoretical backing and empirical validation.
Contribution
The paper provides a novel theoretical insight linking diversity to risk bounds and proposes an effective maxdiversity distributed learning algorithm (MDD).
Findings
MDD outperforms existing divide-and-conquer methods.
Larger diversity in local estimates leads to tighter risk bounds.
MDD demonstrates sound theoretical properties and empirical effectiveness.
Abstract
We study the risk performance of distributed learning for the regularization empirical risk minimization with fast convergence rate, substantially improving the error analysis of the existing divide-and-conquer based distributed learning. An interesting theoretical finding is that the larger the diversity of each local estimate is, the tighter the risk bound is. This theoretical analysis motivates us to devise an effective maxdiversity distributed learning algorithm (MDD). Experimental results show that MDD can outperform the existing divide-andconquer methods but with a bit more time. Theoretical analysis and empirical results demonstrate that our proposed MDD is sound and effective.
| madelon | space_ga | cpusmall | phishing | cadata | a8a | a9a | codrna | YearPred | |
|---|---|---|---|---|---|---|---|---|---|
| RR | 0.971 | 2.585 | 45.150 | 0.247 | 1.932 | 0.671 | 0.673 | 0.841 | 12.233 |
| DRR-5 | 0.989 | 2.814 | 53.114 | 0.262 | 2.659 | 0.681 | 0.680 | 0.855 | 14.216 |
| DRR-10 | 1.408 | 2.983 | 55.557 | 0.273 | 2.839 | 0.725 | 0.696 | 0.863 | 15.780 |
| MDD-LS-5 | 0.977 | 2.677 | 46.184 | 0.257 | 2.114 | 0.677 | 0.673 | 0.847 | 12.303 |
| MDD-LS-10 | 1.021 | 2.750 | 47.956 | 0.268 | 2.352 | 0.703 | 0.685 | 0.854 | 14.158 |
| KRR | 0.959 | 1.458 | 43.993 | 0.167 | 1.504 | 0.659 | 0.630 | 0.651 | / |
| KDRR-5 | 1.142 | 2.389 | 44.228 | 0.419 | 1.598 | 0.873 | 0.666 | 0.674 | 5.397 |
| KDRR-10 | 1.374 | 2.531 | 46.233 | 0.422 | 1.824 | 0.906 | 0.893 | 0.707 | 5.631 |
| MDD-RKHS-5 | 0.992 | 2.030 | 44.015 | 0.214 | 1.554 | 0.745 | 0.604 | 0.672 | 5.350 |
| MDD-RKHS-10 | 1.192 | 2.326 | 45.120 | 0.239 | 1.780 | 0.673 | 0.649 | 0.683 | 5.534 |
| madelon | space_ga | cpusmall | phishing | cadata | a8a | a9a | codrna | YearPred | |
|---|---|---|---|---|---|---|---|---|---|
| RR | 2.069 | 0.280 | 1.218 | 1.526 | 0.490 | 2.544 | 2.957 | 1.866 | 10.433 |
| DRR-5 | 0.849 | 0.094 | 0.463 | 0.625 | 0.363 | 0.773 | 0.881 | 0.736 | 3.709 |
| DRR-10 | 0.623 | 0.073 | 0.298 | 0.350 | 0.214 | 0.401 | 0.503 | 0.435 | 2.645 |
| MDD-LS-5 | 0.875 | 0.115 | 0.587 | 0.664 | 0.427 | 0.878 | 1.167 | 0.876 | 4.774 |
| MDD-LS-10 | 0.656 | 0.084 | 0.315 | 0.395 | 0.269 | 0.551 | 0.628 | 0.452 | 3.156 |
| KRR | 3.450 | 1.508 | 9.801 | 12.08 | 76.99 | 15.33 | 16.103 | 137.6 | / |
| KDRR-5 | 1.487 | 0.295 | 3.374 | 1.451 | 5.524 | 6.021 | 5.913 | 40.22 | 86.754 |
| KDRR-10 | 0.983 | 0.183 | 1.863 | 0.689 | 2.302 | 3.670 | 3.544 | 23.64 | 46.197 |
| MDD-RKHS-5 | 1.692 | 0.331 | 5.637 | 1.901 | 7.854 | 8.628 | 7.454 | 53.09 | 103.20 |
| MDD-RKHS-10 | 1.041 | 0.206 | 2.324 | 0.884 | 3.783 | 4.125 | 4.679 | 31.23 | 56.312 |
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsDistributed systems and fault tolerance · Optimization and Search Problems · Cooperative Communication and Network Coding
Max-Diversity Distributed Learning:
Theory and Algorithms
Yong Liu, Jian Li, Weiping Wang
This work is supported in part by the National Key Research and Development Program of China (No.2016YFB1000604), the Science and Technology Project of Beijing (No.Z181100002718004) the National Natural Science Foundation of China (No.6173396, No.61673293, No.61602467) and the Excellent Talent Introduction of Institute of Information Engineering of CAS (Y7Z0111107). Y. Liu, J. Li and W.P. Wang are with the Institute of Information Engineering, Chinese Academy of Sciences (e-mail: [email protected]).
Abstract
We study the risk performance of distributed learning for the regularization empirical risk minimization with fast convergence rate, substantially improving the error analysis of the existing divide-and-conquer based distributed learning. An interesting theoretical finding is that the larger the diversity of each local estimate is, the tighter the risk bound is. This theoretical analysis motivates us to devise an effective max-diversity distributed learning algorithm (MDD). Experimental results show that our proposed method can outperform the existing divide-and-conquer methods but with a bit more time. Theoretical analysis and empirical results demonstrate that our MDD is sound and effective.
Index Terms:
Distributed Learning, Empirical Risk Minimization
I Introduction
In the era of big data, the rapid expansion of computing capacities in automatic data generation and acquisition brings data of unprecedented size and complexity, and raises a series of scientific challenges such as storage bottleneck and algorithmic scalability [1, 2, 3]. Distributed learning based on a divide-and-conquer approach has triggered enormous recent research activities in various areas such as optimization [4] data mining [5] and machine learning [6]. This learning strategy breaks up a big problem into manageable pieces, operates learning algorithms on each piece on individual machines or processors, and then puts the individual solutions together to get a final global output. In this way, distributed learning is a feasible technique to conquer big data challenges.
This paper aims at error analysis of the distributed learning for (regularization) empirical risk minimization. Given
[TABLE]
drawn identically and independently from a fixed, but unknown probability distribution on , the (regularization) empirical risk minimization can be stated as
[TABLE]
where is a loss function, is a regularizer, and is a hypothesis space. This learning algorithm has been well studied in learning theory, see e.g. [7, 8, 9, 10, 11]. The distributed learning algorithm studied in this paper starts with partitioning the data set into disjoint subsets , . Then it assigns each data subset to one machine or processor to produce a local estimator :
[TABLE]
The finally global estimator is synthesized by
[TABLE]
Theoretical foundations of distributed learning form a hot topic in machine learning and have been explored recently in the framework of learning theory [4, 2, 3, 12]. Under local strong convexity, smoothness and a reasonable set of other conditions, [4] showed that the mean-squared error decays as
[TABLE]
where is the optimal hypothesis in the hypothesis space. Under some eigenfunction assumption, the error analysis for distributed regularized least squares in reproducing kernel Hilbert space (RKHS) was established in [2]: if is not too large,
[TABLE]
where , is the eigenvalue of a Mercer kernel function. Without any eigenfunction assumption, an improved bound was derived for some [3]:
[TABLE]
There are two main contributions in this paper. First, under strongly convex and smooth, and a reasonable set of other conditions, we derive a risk bound of fast rate:
[TABLE]
where
[TABLE]
is the diversity between all partition-based estimates, and . When the minimal risk is small, i.e., , the rate is improved to
[TABLE]
Thus, if , the order of is faster than Note that if is -Lipschitz continuous over , the order of is
[TABLE]
Thus, the order of in [4, 2, 3] at most \mathcal{O}\big{(}{\frac{1}{\sqrt{N}}}\big{)}, which is much slower than that of our bound. Our second contribution is to develop a novel max-diversity distributed learning algorithm. From Equation (2), we know that the larger the diversity is, the tighter the risk bound is. This interesting theoretical finding motivates us to devise a max-diversity distributed learning algorithm (MDD):
[TABLE]
where
[TABLE]
The last term of (3) is to make large. Experimental results on lots of datasets show that our proposed MDD is sound and efficient.
The notion of diversity is popular used in ensemble learning to improve the performance. But to the best of our knowledge, this is the first time that theoretical results w.r.t. diversity are given for a distributed setting.
The rest of the paper is organized as follows. In Section 2, we derive a risk bound of distributed learning with fast convergence rate. In Section 3, we propose two novel algorithms based on the max-diversity of each local estimate in linear space and RKHS. In Section 4, we empirically analyze the performance of our proposed algorithms. We end in Section 5 with conclusion. All the proofs are given in the last part.
II Error Analysis of Distributed Learning
In this section, we will derive a sharper risk bound under some common assumptions.
II-A Assumptions
In the following, we use to denote the norm induced by inner product of the Hilbert space . Let the expected risk and be
[TABLE]
Assumption 1**.**
The risk is an -strongly convex function, that is ,
[TABLE]
or (another equivalent definition) ,
[TABLE]
Assumption 2**.**
*The empirical risk is a convex function. *
Assumption 3**.**
The loss function is -smooth with respect to the first variable , that is ,
[TABLE]
Assumption 4**.**
The regularizer is a -smooth function, that is ,
[TABLE]
Assumption 5**.**
The function is -Lipschitz continuous with respect to the first variable , that is ,
[TABLE]
Assumptions 1, 2, 3, 4 and 5 allow us to model some popular losses, such as square loss and logistic loss, and some regularizer, such as .
Assumption 6**.**
We assume that the gradient at is upper bounded by , that is
[TABLE]
Assumption 6 is also a common assumption, which is used in [13, 4].
II-B Faster Rate of Distributed Learning
Let be the -net of with minimal cardinality, and the covering number of
Theorem 1**.**
For any , , under Assumptions 1, 2, 3, 4, 5 and 6, and when
[TABLE]
with probability at least , we have
[TABLE]
where , and .
From the above theorem, an interesting finding is that, when the larger the diversity of each local estimate is, the tighter the risk bound is. Furthermore, one can also see that when small enough,
[TABLE]
will become non-dominating. To be specific, we have the following corollary:
Corollary 1**.**
By setting in Theorem 1, when , with high probability, we have
[TABLE]
If the the minimal risk is small, i.e., , the rate can reach
[TABLE]
To the best of our knowledge, this is the first -type of distributed risk bound for (regularization) empirical risk minimization.
In the next, we will consider two popular hypothesis spaces: linear and reproducing kernel Hilbert space (RKHS).
II-C Linear Space
The linear hypothesis space we considered is defined as
[TABLE]
From [14], the cover number of linear hypothesis space can be bounded by
[TABLE]
Thus, if we set , from Corollary 1, we have
[TABLE]
When the minimal risk is small, i.e., , the rate is improved to
[TABLE]
Therefore, if , the order of risk bound can even faster than
II-D Reproducing Kernel Hilbert Space (RKHS)
The reproducing kernel Hilbert space associated with the kernel is defined to be the closure of the linear span of the set of functions with the inner product satisfying
[TABLE]
The hypothesis space of the reproducing kernel Hilbert space we considered in this paper is
[TABLE]
From [15], if the kernel function is the popular Gaussian kernel over :
[TABLE]
then for ,
[TABLE]
From Corollary 1, if we set , and assume , we have
[TABLE]
Therefore, if , the order is faster than .
II-E Comparison with Related Work
In this subsection, we will compare our bound with the related work [4, 2, 3]. Under the smooth, strongly convex and other some assumptions, a distributed risk bound is given in [4]:
[TABLE]
Under some eigenfunction assumption, the error analysis for distributed regularized least squares were established in [2],
[TABLE]
By removing the eigenfunction assumptions with a novel integral operator method of [2], a new bound was derived [3]:
[TABLE]
Note that, if is -Lipschitz continuous over , that is
[TABLE]
we can obtain that
[TABLE]
Thus, the order of [2, 3, 4] of is at most
According to the subsections II-C and II-D, if is not very large, and is small, the order of this paper can even faster than , which is much faster than those of in the related work [4, 2, 3].
III Max-Discrepant Distributed Learning (MDD)
In this section, we will propose two novel algorithms for linear space and RKHS. From corollary 1, we know that
[TABLE]
Thus, to obtain tighter bound, the diversity of each local estimate , should be larger.
III-A Linear Hypothesis Space
When is a linear Hypothesis space, we consider the following optimization problem:
[TABLE]
where . Note that, if given , has following closed form solution:
[TABLE]
where , , , . In the next, we will give an iterative algorithm to solve the optimization problem (11). In each iteration, we should compute , which needs if given , which is computational intensive. Fortunately, from Lemma 4 (see in supplementary material), the can be computed by
[TABLE]
where , which only needs .
The Max-Discrepant Distributed Learning algorithm for linear space is given in Algorithm 1. Compared with the traditional divide-and-conquer method, our MDD for linear space only need add in each iteration for each worker node.
III-B Reproducing Kernel Hilbert Space
When is a reproducing kernel Hilbert space, that is , we consider the following optimization problem:
[TABLE]
where \mathbf{K}_{\mathcal{S}_{i}}=\Big{[}K(\mathbf{x}_{t_{j}},\mathbf{x}_{t_{j^{\prime}}})\Big{]}_{j,j^{\prime}=1}^{n}, , \mathbf{K}_{\mathcal{S}_{i},\mathcal{S}_{j}}=\Big{[}K(\mathbf{x}_{t_{j}},\mathbf{x}_{t_{k}})\Big{]}_{j,k=1}^{n}, . Note that can be written as
[TABLE]
where and .
Similar with the linear space, we need to compute in each iterative. From Lemma 4 (see in supplementary material), we know that
[TABLE]
The Max-Discrepant Distributed Learning algorithm for RKHS is given in Algorithm 2. Compared with the traditional divide-and-conquer method, our MDD for RKHS only need add in each iteration for local machine.
Remark 1**.**
The motivation of this paper was inspired by the ensemble learning, but one more thing should be emphasized, the theoretical proof and algorithm design of this paper are not from the ensemble learning.
III-C Complexity
Linear space: At the very beginning, we need to compute the , to compute for each worker node. In each iteration, worker nodes cost to compute and the server node costs to compute . So, the sequential computation complexity is , where is the number of iteration. Moreover, the total communication complexity is .
RKHS: At the very beginning, we need to compute the and to compute . In each iteration, worker nodes cost to compute and the server node costs to compute . So, the sequential computation complexity is , where is the number of iteration. Moreover, the total communication complexity is .
Divide-and-conquer approach: The sequential complexities of linear space and RKHS are and , respectively. Meanwhile, the communication complexities are and .
Global approach: The total complexities of linear space and RKHS are and , respectively.
IV Experiments
In this section, we will compare our MDD methods with the global method and divide-and-conquer method in both Linear and RKHS Hypothesis. Actually, we compare six approaches: global Ridge Regression (RR) [16], divide-and-conquer Ridge Regression (DRR) and our MDD-LS (Algorithm 1) in Linear Hypothesis Space, meanwhile, global Kernel Ridge Regression (KRR) [17], divide-and-conquer Kernel Ridge Regression (KDRR) [2] and our MDD-RKHS (Algorithm 2) in Reproducing Kernel Hilbert Space. Based on the recent distributed machine learning platform PARAMETER SERVER [18], we implemented divide-and-conquer methods and MDD methods and do experiments on this framework.
We experiment on 10 publicly available datasets from LIBSVM data 111Available at https://www.csie.ntu.edu.tw/ cjlin/libsvmtools/datasets/. We run all methods on a computer node with 32 cores (2.40GHz) and 64 GB memory. While global methods only use a single CPU core, distributed methods use all cores to simulate parallel environment. For RKHS methods, we use the popular Gaussian kernels
[TABLE]
as candidate kernels, and choose the best kernel from by 5-folds cross-validation. The regularized parameterized in all methods and in MDD methods are determined by 5-folds cross-validation on training data. For each data set, we run all methods 30 times with random partitions on all data sets of non-overlapping 70% training data and 30% testing data. All statements of statistical significance in the remainder refer to a 95% level of significance under -test.
The root mean square error of all methods is reported in Table I. Meanwhile, we repeat distributed methods on different amount of worker nodes, 5 and 10 for simplification. Table I can be summarized as follows:
Our MDD-LS and MDD-RKHS exhibit better prediction accuracy than the DRR and KDRR over almost all data sets. This demonstrates the advantage of MDD methods in generalization performance.
- 2)
Our MDD-LS and MDD-RKHS give comparable result with global methods on most of data sets.
- 3)
Kernel methods can usually get more optimal results than linear methods do;
- 4)
Some data sets are sensitive to data partition, whose results existing huge gap between global methods and distributed methods, such as space_ga and phishing for RKHS, while others are not.
- 5)
The increase of worker nodes causes higher root mean square error.
The running time is reported in Table II, which can be summarized as follows:
Global methods cost more time than distributed methods do on all data sets.
- 2)
Kernel methods always spend more time than linear methods, because of higher computation complexity.
- 3)
Distributed methods lead great speedup on some data sets.
- 4)
The running time of distributed methods decays almost linearly associated with the increase of worker nodes.
- 5)
Compared with global methods, our MDD methods own higher computational efficiency, while existing small distance away from divide-and-conquer methods.
The above results show that MDD methods need a bit more training time but make the performance gap between global methods and traditional distributed methods tighter, which is consistent with our theoretical analysis.
V Conclusion
In this paper, we studied the generalization performance of distributed learning, and derived a sharper generalization error bound, which is much sharper than existing generalization bounds of divide-and-conquer based distributed learning. Then, we designed two algorithms with statistical guarantees and fast convergence rates for linear space and RKHS: MDD-LS and MDD-RKHS. As we see from theoretical analysis and empirical results, our MDD is highly competitive with the existing divide-and-conquer methods, in terms of both practical performance and computational cost. Based on max-diversity of each local estimate, our analysis can be used as a solid basis for the design of new distributed learning algorithms.
VI Proof
VI-A The Key Idea
From the -strongly convex of of equation (5), we can obtain that
[TABLE]
Therefore, we have
[TABLE]
In the next, we will estimate , which is built upon the following inequality from (4):
[TABLE]
By the convexity of and the optimality condition of [19], we have
[TABLE]
Substituting (15) into (14), we have
[TABLE]
VI-B Proof of Theorem 1
To prove Theorem 1, we first give the following two lemmas (the proofs are given at the last part of this section).
Lemma 1**.**
Under Assumptions 3 and 7, with probability at least , for any , we have
[TABLE]
Lemma 2**.**
Under Assumptions 3, with probability at least , we have
[TABLE]
where .
Proof of Theorem 1.
From the property of -net, we know that there exists a point such that
[TABLE]
According to Assumptions 3 and 4, we know that and are both -smooth. Thus, we have
[TABLE]
Substituting (20) and (19) into (17), with probability at least , we have
[TABLE]
Note that
[TABLE]
Therefore, we can obtain that
[TABLE]
Substituting the above inequation into (21), we can obtain that
[TABLE]
Thus, with , we have
[TABLE]
Combining (13) and (22), with , we have
[TABLE]
∎
VI-C Proof of Lemma 1
Lemma 3** ([10]).**
Let be a Hilbert space and let be a random variable with values in . Assume almost surely. Denote . Let be independent drawers of . For any , with confidence ,
[TABLE]
Proof.
According to Assumption 3 and 7, we know that is -smooth, so we have
[TABLE]
Because is -smooth and convex, by (2.1.7) of [20], , we have
[TABLE]
Taking expectation over both sides, we have
[TABLE]
where the last inequality follows from the optimality condition of , i.e.,
[TABLE]
Following Lemma 3, with probability at least , we have
[TABLE]
We obtain Lemma 1 by taking the union bound over all . ∎
VI-D Appendix: Proof of Lemma 2
Proof.
Since is -smooth and nonegative, from Lemma 4 of [21], we have
[TABLE]
and thus
[TABLE]
From the Assumption, we have , . Let and . Then, according to Lemma 3, with probability at least , we have
[TABLE]
∎
VI-E Proof of Lemma 4
Lemma 4**.**
For all , If is a symmetric matrix and , , then we have
[TABLE]
where .
Proof.
Since a symmetric matrix, we have
[TABLE]
Therefore, we can obtain that . ∎
The reference list from the paper itself. Each links out to its DOI / PubMed record.
- 1[1] Z.-H. Zhou, N. V. Chawla, Y. Jin, and G. J. Williams, “Big data opportunities and challenges: Discussions from data analytics perspectives [discussion forum],” IEEE Computational Intelligence Magazine , vol. 9, no. 4, pp. 62–74, 2014.
- 2[2] Y. Zhang, J. Duchi, and M. Wainwright, “Divide and conquer kernel ridge regression,” in Proceedings of Conference on Learning Theory (COLT 2013) , 2013, pp. 592–617.
- 3[3] S.-B. Lin, X. Guo, and D.-X. Zhou, “Distributed learning with regularized least squares,” The Journal of Machine Learning Research , vol. 18, no. 1, pp. 3202–3232, 2017.
- 4[4] Y. Zhang, M. J. Wainwright, and J. C. Duchi, “Communication-efficient algorithms for statistical optimization,” in Advances in Neural Information Processing Systems , 2012, pp. 1502–1510.
- 5[5] X. Wu, X. Zhu, G.-Q. Wu, and W. Ding, “Data mining with big data,” IEEE transactions on knowledge and data engineering , vol. 26, no. 1, pp. 97–107, 2014.
- 6[6] D. Gillick, A. Faria, and J. De Nero, “Mapreduce: Distributed computing for machine learning,” Berkley, Dec , vol. 18, 2006.
- 7[7] E. D. Vito, A. Caponnetto, and L. Rosasco, “Model selection for regularized least-squares algorithm in learning theory,” Foundations of Computational Mathematics , vol. 5, no. 1, pp. 59–85, 2005.
- 8[8] A. Caponnetto and E. D. Vito, “Optimal rates for the regularized least-squares algorithm,” Foundations of Computational Mathematics , vol. 7, no. 3, pp. 331–368, 2007.
