Max-Diversity Distributed Learning: Theory and Algorithms

Yong Liu; Jian Li; Weiping Wang

arXiv:1812.07738·cs.LG·January 21, 2019

Max-Diversity Distributed Learning: Theory and Algorithms

Yong Liu, Jian Li, Weiping Wang

PDF

Open Access

TL;DR

This paper introduces a new distributed learning algorithm called MDD that leverages maximum diversity among local estimates to improve risk bounds, with theoretical backing and empirical validation.

Contribution

The paper provides a novel theoretical insight linking diversity to risk bounds and proposes an effective maxdiversity distributed learning algorithm (MDD).

Findings

01

MDD outperforms existing divide-and-conquer methods.

02

Larger diversity in local estimates leads to tighter risk bounds.

03

MDD demonstrates sound theoretical properties and empirical effectiveness.

Abstract

We study the risk performance of distributed learning for the regularization empirical risk minimization with fast convergence rate, substantially improving the error analysis of the existing divide-and-conquer based distributed learning. An interesting theoretical finding is that the larger the diversity of each local estimate is, the tighter the risk bound is. This theoretical analysis motivates us to devise an effective maxdiversity distributed learning algorithm (MDD). Experimental results show that MDD can outperform the existing divide-andconquer methods but with a bit more time. Theoretical analysis and empirical results demonstrate that our proposed MDD is sound and effective.

Tables2

Table 1. Table I: Comparison of average root mean square error of our MDD-LS and MDD-RKHS with RR , DRR , KRR , DKRR . We bold the numbers of the best method and underline the numbers of the other methods which are not significantly worse than the best one.

	madelon	space_ga	cpusmall	phishing	cadata	a8a	a9a	codrna	YearPred
RR	0.971	2.585	45.150	0.247	1.932	0.671	0.673	0.841	12.233
DRR-5	0.989	2.814	53.114	0.262	2.659	0.681	0.680	0.855	14.216
DRR-10	1.408	2.983	55.557	0.273	2.839	0.725	0.696	0.863	15.780
MDD-LS-5	0.977	2.677	46.184	0.257	2.114	0.677	0.673	0.847	12.303
MDD-LS-10	1.021	2.750	47.956	0.268	2.352	0.703	0.685	0.854	14.158
KRR	0.959	1.458	43.993	0.167	1.504	0.659	0.630	0.651	/
KDRR-5	1.142	2.389	44.228	0.419	1.598	0.873	0.666	0.674	5.397
KDRR-10	1.374	2.531	46.233	0.422	1.824	0.906	0.893	0.707	5.631
MDD-RKHS-5	0.992	2.030	44.015	0.214	1.554	0.745	0.604	0.672	5.350
MDD-RKHS-10	1.192	2.326	45.120	0.239	1.780	0.673	0.649	0.683	5.534

Table 2. Table II: Comparison of run time (second) amound our proposed MDD-LS and MDD-RKHS with other methods.

	madelon	space_ga	cpusmall	phishing	cadata	a8a	a9a	codrna	YearPred
RR	2.069	0.280	1.218	1.526	0.490	2.544	2.957	1.866	10.433
DRR-5	0.849	0.094	0.463	0.625	0.363	0.773	0.881	0.736	3.709
DRR-10	0.623	0.073	0.298	0.350	0.214	0.401	0.503	0.435	2.645
MDD-LS-5	0.875	0.115	0.587	0.664	0.427	0.878	1.167	0.876	4.774
MDD-LS-10	0.656	0.084	0.315	0.395	0.269	0.551	0.628	0.452	3.156
KRR	3.450	1.508	9.801	12.08	76.99	15.33	16.103	137.6	/
KDRR-5	1.487	0.295	3.374	1.451	5.524	6.021	5.913	40.22	86.754
KDRR-10	0.983	0.183	1.863	0.689	2.302	3.670	3.544	23.64	46.197
MDD-RKHS-5	1.692	0.331	5.637	1.901	7.854	8.628	7.454	53.09	103.20
MDD-RKHS-10	1.041	0.206	2.324	0.884	3.783	4.125	4.679	31.23	56.312

Equations198

S = {z_{i} = (x_{i}, y_{i})}_{i = 1}^{N} \in (Z = X \times Y)^{N},

S = {z_{i} = (x_{i}, y_{i})}_{i = 1}^{N} \in (Z = X \times Y)^{N},

\hat{f} = f \in H arg min \hat{R} (f) := \frac{1}{N} j = 1 \sum N ℓ (f, z_{j}) + r (f)

\hat{f} = f \in H arg min \hat{R} (f) := \frac{1}{N} j = 1 \sum N ℓ (f, z_{j}) + r (f)

\hat{f}_{i} = f \in H arg min \hat{R}_{i} (f) := \frac{1}{∣ S _{i} ∣} z_{j} \in S_{i} \sum ℓ (f, z_{j}) + r (f) .

\hat{f}_{i} = f \in H arg min \hat{R}_{i} (f) := \frac{1}{∣ S _{i} ∣} z_{j} \in S_{i} \sum ℓ (f, z_{j}) + r (f) .

\overset{ˉ}{f} = \frac{1}{m} i = 1 \sum m \hat{f}_{i} .

\overset{ˉ}{f} = \frac{1}{m} i = 1 \sum m \hat{f}_{i} .

E [\overset{ˉ}{f} - f^{*}^{2}] = O (\frac{1}{N} + \frac{1}{n ^{2}}),

E [\overset{ˉ}{f} - f^{*}^{2}] = O (\frac{1}{N} + \frac{1}{n ^{2}}),

E [\overset{ˉ}{f} - f^{*}^{2}] = O (∥ f_{*} ∥_{H}^{2} + \frac{γ ( λ )}{N}),

E [\overset{ˉ}{f} - f^{*}^{2}] = O (∥ f_{*} ∥_{H}^{2} + \frac{γ ( λ )}{N}),

E [\overset{ˉ}{f} - f^{*}] = O ((\frac{γ ( λ )}{N})^{\frac{1}{2} (1 - \frac{1}{p})} (\frac{1}{N})^{\frac{1}{2 p}}) .

E [\overset{ˉ}{f} - f^{*}] = O ((\frac{γ ( λ )}{N})^{\frac{1}{2} (1 - \frac{1}{p})} (\frac{1}{N})^{\frac{1}{2 p}}) .

R (\overset{ˉ}{f}) - R (f_{*}) = O (\frac{H _{*}}{n} + \frac{1}{n ^{2}} - Δ_{\overset{ˉ}{f}}),

R (\overset{ˉ}{f}) - R (f_{*}) = O (\frac{H _{*}}{n} + \frac{1}{n ^{2}} - Δ_{\overset{ˉ}{f}}),

Δ_{\overset{ˉ}{f}} = O \frac{1}{m ^{2}} i, j = 1, i \neq = j \sum m ∥ \hat{f}_{i} - \hat{f}_{j} ∥^{2}

Δ_{\overset{ˉ}{f}} = O \frac{1}{m ^{2}} i, j = 1, i \neq = j \sum m ∥ \hat{f}_{i} - \hat{f}_{j} ∥^{2}

R (\overset{ˉ}{f}) - R (f_{*}) = O (\frac{1}{n ^{2}} - Δ_{\overset{ˉ}{f}}) .

R (\overset{ˉ}{f}) - R (f_{*}) = O (\frac{1}{n ^{2}} - Δ_{\overset{ˉ}{f}}) .

R (\overset{ˉ}{f}) - R (f^{*})

R (\overset{ˉ}{f}) - R (f^{*})

= O (L E [\overset{ˉ}{f} - f^{*}^{2}]) .

\hat{f}_{i} = f \in H arg min \frac{1}{∣ S _{i} ∣} z_{j} \in S_{i} \sum ℓ (f, z_{j}) + r (f) - γ ∥ f - \overset{ˉ}{f}_{\ i} ∥_{H},

\hat{f}_{i} = f \in H arg min \frac{1}{∣ S _{i} ∣} z_{j} \in S_{i} \sum ℓ (f, z_{j}) + r (f) - γ ∥ f - \overset{ˉ}{f}_{\ i} ∥_{H},

\overset{ˉ}{f}_{\ i} = \frac{1}{m - 1} j = 1, j \neq = i \sum m \hat{f}_{j} .

\overset{ˉ}{f}_{\ i} = \frac{1}{m - 1} j = 1, j \neq = i \sum m \hat{f}_{j} .

R (f) = E_{z} [ℓ (f, z)] + r (f) and f_{*} = f \in H arg min R (f) .

R (f) = E_{z} [ℓ (f, z)] + r (f) and f_{*} = f \in H arg min R (f) .

⟨ \nabla R (f), f - f^{'} ⟩_{H} + \frac{η}{2} ∥ f - f^{'} ∥_{H}

⟨ \nabla R (f), f - f^{'} ⟩_{H} + \frac{η}{2} ∥ f - f^{'} ∥_{H}

R (t f + (1 - t) f^{'}) \leq tR (f) + (1 - t) R (f^{'}) - \frac{1}{2} η t (t - 1) ∥ f - f^{'} ∥_{H}^{2} .

R (t f + (1 - t) f^{'}) \leq tR (f) + (1 - t) R (f^{'}) - \frac{1}{2} η t (t - 1) ∥ f - f^{'} ∥_{H}^{2} .

∥ \nabla ℓ (f, \cdot) - \nabla ℓ (f^{'}, \cdot) ∥_{H}

∥ \nabla ℓ (f, \cdot) - \nabla ℓ (f^{'}, \cdot) ∥_{H}

∥ \nabla r (f) - \nabla r (f^{'}) ∥_{H}

∥ \nabla r (f) - \nabla r (f^{'}) ∥_{H}

∥ ν (f, \cdot) - ν (f^{'}, \cdot) ∥_{H}

∥ ν (f, \cdot) - ν (f^{'}, \cdot) ∥_{H}

∥\nabla ℓ (f^{*}, \cdot) ∥_{H} \leq M .

∥\nabla ℓ (f^{*}, \cdot) ∥_{H} \leq M .

m \leq \frac{N η}{4 τ ~ lo g C ( H , ϵ )},

m \leq \frac{N η}{4 τ ~ lo g C ( H , ϵ )},

\leavevmode \leavevmode \leavevmode R (\overset{ˉ}{f}) - R (f_{*}) \leq \frac{16 τ ~ lo g ( 4 m / δ )}{n ^{2} η} + \frac{128 τ H _{*} lo g ( 4 m / δ )}{n η} \leavevmode \leavevmode \leavevmode + \frac{32 τ ~ ^{2} ϵ ^{2}}{η} + \frac{64 τ ~ L ϵ lo g C ( H , ϵ )}{n η} \leavevmode \leavevmode \leavevmode + \frac{64 τ ~ ϵ ^{2} lo g ^{2} C ( H , ϵ )}{n ^{2} η} - Δ_{\overset{ˉ}{f}},

\leavevmode \leavevmode \leavevmode R (\overset{ˉ}{f}) - R (f_{*}) \leq \frac{16 τ ~ lo g ( 4 m / δ )}{n ^{2} η} + \frac{128 τ H _{*} lo g ( 4 m / δ )}{n η} \leavevmode \leavevmode \leavevmode + \frac{32 τ ~ ^{2} ϵ ^{2}}{η} + \frac{64 τ ~ L ϵ lo g C ( H , ϵ )}{n η} \leavevmode \leavevmode \leavevmode + \frac{64 τ ~ ϵ ^{2} lo g ^{2} C ( H , ϵ )}{n ^{2} η} - Δ_{\overset{ˉ}{f}},

\frac{32 τ ~ ^{2} ϵ ^{2}}{η} + \frac{64 τ ~ L ϵ lo g C ( H , ϵ )}{n η} + \frac{64 τ ~ ϵ ^{2} lo g ^{2} C ( H , ϵ )}{n ^{2} η}

\frac{32 τ ~ ^{2} ϵ ^{2}}{η} + \frac{64 τ ~ L ϵ lo g C ( H , ϵ )}{n η} + \frac{64 τ ~ ϵ ^{2} lo g ^{2} C ( H , ϵ )}{n ^{2} η}

R (\overset{ˉ}{f}) - R (f_{*}) = O (\frac{H _{*} lo g ( m )}{n} + \frac{lo g ( N ( H , \frac{1}{n} ))}{n ^{2}} - Δ_{\overset{ˉ}{f}}) .

R (\overset{ˉ}{f}) - R (f_{*}) = O (\frac{H _{*} lo g ( m )}{n} + \frac{lo g ( N ( H , \frac{1}{n} ))}{n ^{2}} - Δ_{\overset{ˉ}{f}}) .

O (\frac{lo g ( m )}{n ^{2}} + \frac{lo g ( N ( H , \frac{1}{n} ))}{n ^{2}} - Δ_{\overset{ˉ}{f}}) .

O (\frac{lo g ( m )}{n ^{2}} + \frac{lo g ( N ( H , \frac{1}{n} ))}{n ^{2}} - Δ_{\overset{ˉ}{f}}) .

\displaystyle\mathcal{H}=\left\{f=\mathbf{w}^{\mathrm{T}}\mathbf{x}\Big{|}\mathbf{w}\in\mathbb{R}^{d},\|\mathbf{w}\|_{2}\leq B\right\}.

\displaystyle\mathcal{H}=\left\{f=\mathbf{w}^{\mathrm{T}}\mathbf{x}\Big{|}\mathbf{w}\in\mathbb{R}^{d},\|\mathbf{w}\|_{2}\leq B\right\}.

lo g (C (H, ϵ)) \leq d lo g (\frac{6 B}{ϵ}) .

lo g (C (H, ϵ)) \leq d lo g (\frac{6 B}{ϵ}) .

R (\overset{ˉ}{f}) - R (f_{*})

R (\overset{ˉ}{f}) - R (f_{*})

O (\frac{d lo g ( mn )}{n ^{2}} - Δ_{\overset{ˉ}{f}}) = O (\frac{d lo g N}{n ^{2}} - Δ_{\overset{ˉ}{f}}) .

O (\frac{d lo g ( mn )}{n ^{2}} - Δ_{\overset{ˉ}{f}}) = O (\frac{d lo g N}{n ^{2}} - Δ_{\overset{ˉ}{f}}) .

⟨ K (x, \cdot), f ⟩_{K} = f (x), \forall x \in X, f \in H_{K} .

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsDistributed systems and fault tolerance · Optimization and Search Problems · Cooperative Communication and Network Coding

Full text

Max-Diversity Distributed Learning:

Theory and Algorithms

Yong Liu, Jian Li, Weiping Wang

This work is supported in part by the National Key Research and Development Program of China (No.2016YFB1000604), the Science and Technology Project of Beijing (No.Z181100002718004) the National Natural Science Foundation of China (No.6173396, No.61673293, No.61602467) and the Excellent Talent Introduction of Institute of Information Engineering of CAS (Y7Z0111107). Y. Liu, J. Li and W.P. Wang are with the Institute of Information Engineering, Chinese Academy of Sciences (e-mail: [email protected]).

Abstract

We study the risk performance of distributed learning for the regularization empirical risk minimization with fast convergence rate, substantially improving the error analysis of the existing divide-and-conquer based distributed learning. An interesting theoretical finding is that the larger the diversity of each local estimate is, the tighter the risk bound is. This theoretical analysis motivates us to devise an effective max-diversity distributed learning algorithm (MDD). Experimental results show that our proposed method can outperform the existing divide-and-conquer methods but with a bit more time. Theoretical analysis and empirical results demonstrate that our MDD is sound and effective.

Index Terms:

Distributed Learning, Empirical Risk Minimization

I Introduction

In the era of big data, the rapid expansion of computing capacities in automatic data generation and acquisition brings data of unprecedented size and complexity, and raises a series of scientific challenges such as storage bottleneck and algorithmic scalability [1, 2, 3]. Distributed learning based on a divide-and-conquer approach has triggered enormous recent research activities in various areas such as optimization [4] data mining [5] and machine learning [6]. This learning strategy breaks up a big problem into manageable pieces, operates learning algorithms on each piece on individual machines or processors, and then puts the individual solutions together to get a final global output. In this way, distributed learning is a feasible technique to conquer big data challenges.

This paper aims at error analysis of the distributed learning for (regularization) empirical risk minimization. Given

[TABLE]

drawn identically and independently from a fixed, but unknown probability distribution $\mathbb{P}$ on $\mathcal{Z}=\mathcal{X}\times\mathcal{Y}$ , the (regularization) empirical risk minimization can be stated as

[TABLE]

where $\ell(f,z)$ is a loss function, $r(f)$ is a regularizer, and $\mathcal{H}$ is a hypothesis space. This learning algorithm has been well studied in learning theory, see e.g. [7, 8, 9, 10, 11]. The distributed learning algorithm studied in this paper starts with partitioning the data set $\mathcal{S}$ into $m$ disjoint subsets $\{\mathcal{S}_{i}\}_{i=1}^{m}$ , $|\mathcal{S}_{i}|=\frac{N}{m}=:n$ . Then it assigns each data subset $\mathcal{S}_{i}$ to one machine or processor to produce a local estimator $\hat{f}_{i}$ :

[TABLE]

The finally global estimator $\bar{f}$ is synthesized by

[TABLE]

Theoretical foundations of distributed learning form a hot topic in machine learning and have been explored recently in the framework of learning theory [4, 2, 3, 12]. Under local strong convexity, smoothness and a reasonable set of other conditions, [4] showed that the mean-squared error decays as

[TABLE]

where $f^{\ast}$ is the optimal hypothesis in the hypothesis space. Under some eigenfunction assumption, the error analysis for distributed regularized least squares in reproducing kernel Hilbert space (RKHS) was established in [2]: if $m$ is not too large,

[TABLE]

where $\gamma(\lambda)=\sum_{j=1}^{\infty}\frac{\mu_{j}}{\lambda+\mu_{j}}$ , $\mu_{j}$ is the eigenvalue of a Mercer kernel function. Without any eigenfunction assumption, an improved bound was derived for some $1\leq p\leq\infty$ [3]:

[TABLE]

There are two main contributions in this paper. First, under strongly convex and smooth, and a reasonable set of other conditions, we derive a risk bound of fast rate:

[TABLE]

where

[TABLE]

is the diversity between all partition-based estimates, $R(f)=\mathbb{E}_{z}[\ell(f,z)]+r(f),$ and $H_{\ast}=\mathbb{E}_{z}\left[\ell(f_{\ast},z)\right]$ . When the minimal risk is small, i.e., $H_{\ast}=\mathcal{O}\left(\frac{1}{n}\right)$ , the rate is improved to

[TABLE]

Thus, if $m\leq\sqrt{N}$ , the order of $R(\bar{f})-R(f_{\ast})$ is faster than $\mathcal{O}\left(\frac{1}{N}-\Delta_{\bar{f}}\right).$ Note that if $\ell(f,z)+r(f)$ is $L$ -Lipschitz continuous over $f$ , the order of $R(\bar{f})-R(f^{\ast})$ is

[TABLE]

Thus, the order of $R(\bar{f})-R(f^{\ast})$ in [4, 2, 3] at most $\mathcal{O}\big{(}{\frac{1}{\sqrt{N}}}\big{)}$ , which is much slower than that of our bound. Our second contribution is to develop a novel max-diversity distributed learning algorithm. From Equation (2), we know that the larger the diversity $\Delta_{\bar{f}}$ is, the tighter the risk bound is. This interesting theoretical finding motivates us to devise a max-diversity distributed learning algorithm (MDD):

[TABLE]

where

[TABLE]

The last term of (3) is to make $\Delta_{\bar{f}}$ large. Experimental results on lots of datasets show that our proposed MDD is sound and efficient.

The notion of diversity is popular used in ensemble learning to improve the performance. But to the best of our knowledge, this is the first time that theoretical results w.r.t. diversity are given for a distributed setting.

The rest of the paper is organized as follows. In Section 2, we derive a risk bound of distributed learning with fast convergence rate. In Section 3, we propose two novel algorithms based on the max-diversity of each local estimate in linear space and RKHS. In Section 4, we empirically analyze the performance of our proposed algorithms. We end in Section 5 with conclusion. All the proofs are given in the last part.

II Error Analysis of Distributed Learning

In this section, we will derive a sharper risk bound under some common assumptions.

II-A Assumptions

In the following, we use $\|\cdot\|_{\mathcal{H}}$ to denote the norm induced by inner product of the Hilbert space $\mathcal{H}$ . Let the expected risk $R(f)$ and $f_{\ast}$ be

[TABLE]

Assumption 1.

The risk $R(f)$ is an $\eta$ -strongly convex function, that is $\forall f,f^{\prime}\in\mathcal{H}$ ,

[TABLE]

or (another equivalent definition) $\forall f,f^{\prime}\in\mathcal{H},t\in[0,1]$ ,

[TABLE]

Assumption 2.

*The empirical risk $\hat{R}(f)$ is a convex function. *

Assumption 3.

The loss function $\ell(f,z)$ is $\tau$ -smooth with respect to the first variable $f$ , that is $\forall f,f^{\prime}\in\mathcal{H}$ ,

[TABLE]

Assumption 4.

The regularizer $r(f)$ is a $\tau^{\prime}$ -smooth function, that is $\forall f,f^{\prime}\in\mathcal{H}$ ,

[TABLE]

Assumption 5.

The function $\nu(f,z)=\ell(f,z)+r(f)$ is $L$ -Lipschitz continuous with respect to the first variable $f$ , that is $\forall f,f^{\prime}\in\mathcal{H}$ ,

[TABLE]

Assumptions 1, 2, 3, 4 and 5 allow us to model some popular losses, such as square loss and logistic loss, and some regularizer, such as $r(f)=\lambda\|f\|_{\mathcal{H}}^{2}$ .

Assumption 6.

We assume that the gradient at $f_{\ast}$ is upper bounded by $M$ , that is

[TABLE]

Assumption 6 is also a common assumption, which is used in [13, 4].

II-B Faster Rate of Distributed Learning

Let $\mathcal{N}(\mathcal{H},\epsilon)$ be the $\epsilon$ -net of $\mathcal{H}$ with minimal cardinality, and $C(\mathcal{H},\epsilon)$ the covering number of $|\mathcal{N}(\mathcal{H},\epsilon)|$

Theorem 1.

For any $0<\delta<1$ , $\epsilon\geq 0$ , under Assumptions 1, 2, 3, 4, 5 and 6, and when

[TABLE]

with probability at least $1-\delta$ , we have

[TABLE]

where $\Delta_{\bar{f}}=\frac{\eta}{4m^{2}}\sum_{i,j=1,i\not=j}^{m}\|\hat{f}_{i}-\hat{f}_{j}\|_{\mathcal{H}}^{2}$ , $H_{\ast}=\mathbb{E}_{z}\left[\ell(f_{\ast},z)\right]$ and $\tilde{\tau}=\tau+\tau^{\prime}$ .

From the above theorem, an interesting finding is that, when the larger the diversity of each local estimate is, the tighter the risk bound is. Furthermore, one can also see that when $\epsilon$ small enough,

[TABLE]

will become non-dominating. To be specific, we have the following corollary:

Corollary 1.

By setting $\epsilon=\frac{1}{n}$ in Theorem 1, when $m\leq\frac{N\eta}{4\tilde{\tau}\log C(\mathcal{H},1/n)}$ , with high probability, we have

[TABLE]

If the the minimal risk $H_{\ast}$ is small, i.e., $H_{\ast}=\mathcal{O}(\frac{1}{n})$ , the rate can reach

[TABLE]

To the best of our knowledge, this is the first $\tilde{\mathcal{O}}\left(\frac{1}{n^{2}}\right)$ -type of distributed risk bound for (regularization) empirical risk minimization.

In the next, we will consider two popular hypothesis spaces: linear and reproducing kernel Hilbert space (RKHS).

II-C Linear Space

The linear hypothesis space we considered is defined as

[TABLE]

From [14], the cover number of linear hypothesis space can be bounded by

[TABLE]

Thus, if we set $\epsilon=\frac{1}{n}$ , from Corollary 1, we have

[TABLE]

When the minimal risk is small, i.e., $H_{\ast}=\mathcal{O}\left(\frac{d}{n}\right)$ , the rate is improved to

[TABLE]

Therefore, if $m\leq\sqrt{\frac{N}{d\log N}}$ , the order of risk bound can even faster than $\mathcal{O}\left(\frac{1}{N}\right).$

II-D Reproducing Kernel Hilbert Space (RKHS)

The reproducing kernel Hilbert space $\mathcal{H}_{K}$ associated with the kernel $K$ is defined to be the closure of the linear span of the set of functions $\left\{K(\mathbf{x},\cdot):\mathbf{x}\in\mathcal{X}\right\}$ with the inner product satisfying

[TABLE]

The hypothesis space of the reproducing kernel Hilbert space we considered in this paper is

[TABLE]

From [15], if the kernel function $K$ is the popular Gaussian kernel over $[0,1]^{d}$ :

[TABLE]

then for $0\leq\epsilon\leq\frac{B}{2}$ ,

[TABLE]

From Corollary 1, if we set $\epsilon=\frac{1}{n}$ , and assume $R_{\ast}=\mathcal{O}\left(\frac{1}{n}\right)$ , we have

[TABLE]

Therefore, if $m\leq\min\left\{\sqrt{\frac{N}{d\log N}},\sqrt{\frac{N}{\log^{d}n}}\right\}$ , the order is faster than $\mathcal{O}\left(\frac{1}{N}\right)$ .

II-E Comparison with Related Work

In this subsection, we will compare our bound with the related work [4, 2, 3]. Under the smooth, strongly convex and other some assumptions, a distributed risk bound is given in [4]:

[TABLE]

Under some eigenfunction assumption, the error analysis for distributed regularized least squares were established in [2],

[TABLE]

By removing the eigenfunction assumptions with a novel integral operator method of [2], a new bound was derived [3]:

[TABLE]

Note that, if $\nu(f,z)$ is $L$ -Lipschitz continuous over $f$ , that is

[TABLE]

we can obtain that

[TABLE]

Thus, the order of [2, 3, 4] of $R(f)-R(f_{\ast})$ is at most $\mathcal{O}\left(\frac{1}{\sqrt{N}}\right).$

According to the subsections II-C and II-D, if $m$ is not very large, and $H_{\ast}$ is small, the order of this paper can even faster than $\mathcal{O}\left(\frac{1}{N}\right)$ , which is much faster than those of in the related work [4, 2, 3].

III Max-Discrepant Distributed Learning (MDD)

In this section, we will propose two novel algorithms for linear space and RKHS. From corollary 1, we know that

[TABLE]

Thus, to obtain tighter bound, the diversity of each local estimate $\hat{f}_{i},i=1,\ldots,m$ , should be larger.

III-A Linear Hypothesis Space

When $\mathcal{H}$ is a linear Hypothesis space, we consider the following optimization problem:

[TABLE]

where $\bar{\mathbf{w}}_{\backslash i}=\frac{1}{m-1}\sum_{j=1,j\not=i}\hat{\mathbf{w}}_{j}$ . Note that, if given $\bar{\mathbf{w}}_{\backslash i}$ , $\hat{\mathbf{w}}_{i}$ has following closed form solution:

[TABLE]

where $\mathbf{X}_{\mathcal{S}_{i}}=(\mathbf{x}_{t_{1}},\mathbf{x}_{t_{2}},\ldots,\mathbf{x}_{t_{n}})$ , $\mathbf{y}_{\mathcal{S}_{i}}=(y_{t_{1}},y_{t_{2}},\ldots,y_{t_{n}})^{\mathrm{T}}$ , $z_{t_{j}}\in\mathcal{S}_{i}$ , $j=1,\ldots,n$ . In the next, we will give an iterative algorithm to solve the optimization problem (11). In each iteration, we should compute $\mathbf{A}_{i}^{-1}\bar{\mathbf{w}}_{\backslash i}$ , which needs $\mathcal{O}\left(d^{2}\right)$ if given $\mathbf{A}_{i}^{-1}$ , which is computational intensive. Fortunately, from Lemma 4 (see in supplementary material), the $\mathbf{A}_{i}^{-1}\bar{\mathbf{w}}_{\backslash i}$ can be computed by

[TABLE]

where $a./\mathbf{c}=(a/c_{1},\ldots a/c_{d})^{\mathrm{T}}$ , which only needs $\mathcal{O}(d)$ .

The Max-Discrepant Distributed Learning algorithm for linear space is given in Algorithm 1. Compared with the traditional divide-and-conquer method, our MDD for linear space only need add $\mathcal{O}(d)$ in each iteration for each worker node.

III-B Reproducing Kernel Hilbert Space

When $\mathcal{H}$ is a reproducing kernel Hilbert space, that is $f(\mathbf{x})=\sum_{j=1}^{n}w_{j}K(\mathbf{x}_{j},\mathbf{x})$ , we consider the following optimization problem:

[TABLE]

where $\mathbf{K}_{\mathcal{S}_{i}}=\Big{[}K(\mathbf{x}_{t_{j}},\mathbf{x}_{t_{j^{\prime}}})\Big{]}_{j,j^{\prime}=1}^{n}$ , $z_{t_{j}},z_{t_{j^{\prime}}}\in\mathcal{S}_{i}$ , $\mathbf{K}_{\mathcal{S}_{i},\mathcal{S}_{j}}=\Big{[}K(\mathbf{x}_{t_{j}},\mathbf{x}_{t_{k}})\Big{]}_{j,k=1}^{n}$ , $z_{t_{j}}\in\mathcal{S}_{i},z_{t_{k}}\in\mathcal{S}_{j}$ . Note that $\hat{\mathbf{w}}_{i}$ can be written as

[TABLE]

where $\mathbf{g}_{j}=\mathbf{K}_{\mathcal{S}_{i},\mathcal{S}_{j}}\hat{\mathbf{w}}_{j}$ and $\bar{\mathbf{g}}_{\backslash i}=\frac{1}{m-1}\sum_{j=1,j\not=i}^{m}\hat{\mathbf{g}}_{j}$ .

Similar with the linear space, we need to compute $\mathbf{A}_{i}^{-1}\bar{\mathbf{g}}_{\backslash i}$ in each iterative. From Lemma 4 (see in supplementary material), we know that

[TABLE]

The Max-Discrepant Distributed Learning algorithm for RKHS is given in Algorithm 2. Compared with the traditional divide-and-conquer method, our MDD for RKHS only need add $\mathcal{O}(n)$ in each iteration for local machine.

Remark 1.

The motivation of this paper was inspired by the ensemble learning, but one more thing should be emphasized, the theoretical proof and algorithm design of this paper are not from the ensemble learning.

III-C Complexity

Linear space: At the very beginning, we need $\mathcal{O}\left(nd^{2}\right)$ to compute the $\mathbf{A}_{i}$ , $\mathcal{O}(d^{3})$ to compute $\mathbf{A}_{i}^{-1}$ for each worker node. In each iteration, worker nodes cost $\mathcal{O}(d)$ to compute $\mathbf{d}^{t}_{i}$ and the server node costs $O(md)$ to compute $\bar{\mathbf{w}}^{t}_{\backslash i}$ . So, the sequential computation complexity is $\mathcal{O}\left(nd^{2}+d^{3}+Tmd\right)$ , where $T$ is the number of iteration. Moreover, the total communication complexity is $O(Td)$ .

RKHS: At the very beginning, we need $\mathcal{O}\left(n^{2}d\right)$ to compute the $\mathbf{A}_{i}$ and $\mathcal{O}(n^{3})$ to compute $\mathbf{A}_{i}^{-1}$ . In each iteration, worker nodes cost $\mathcal{O}(n)$ to compute $\mathbf{d}^{t}_{i}$ and the server node costs $O(mn)$ to compute $\bar{\mathbf{g}}^{t}_{\backslash i}$ . So, the sequential computation complexity is $\mathcal{O}\left(n^{2}d+n^{3}+Tmn\right)$ , where $T$ is the number of iteration. Moreover, the total communication complexity is $O(Tn)$ .

Divide-and-conquer approach: The sequential complexities of linear space and RKHS are $\mathcal{O}\left(nd^{2}+d^{3}\right)$ and $\mathcal{O}\left(n^{2}d+n^{3}\right)$ , respectively. Meanwhile, the communication complexities are $O(d)$ and $O(n)$ .

Global approach: The total complexities of linear space and RKHS are $\mathcal{O}\left(Nd^{2}+d^{3}\right)$ and $\mathcal{O}\left(N^{2}d+N^{3}\right)$ , respectively.

IV Experiments

In this section, we will compare our MDD methods with the global method and divide-and-conquer method in both Linear and RKHS Hypothesis. Actually, we compare six approaches: global Ridge Regression (RR) [16], divide-and-conquer Ridge Regression (DRR) and our MDD-LS (Algorithm 1) in Linear Hypothesis Space, meanwhile, global Kernel Ridge Regression (KRR) [17], divide-and-conquer Kernel Ridge Regression (KDRR) [2] and our MDD-RKHS (Algorithm 2) in Reproducing Kernel Hilbert Space. Based on the recent distributed machine learning platform PARAMETER SERVER [18], we implemented divide-and-conquer methods and MDD methods and do experiments on this framework.

We experiment on 10 publicly available datasets from LIBSVM data 111Available at https://www.csie.ntu.edu.tw/ cjlin/libsvmtools/datasets/. We run all methods on a computer node with 32 cores (2.40GHz) and 64 GB memory. While global methods only use a single CPU core, distributed methods use all cores to simulate parallel environment. For RKHS methods, we use the popular Gaussian kernels

[TABLE]

as candidate kernels, and choose the best kernel from $\sigma\in\{2^{i},i=-10,-9,\dots,10\}$ by 5-folds cross-validation. The regularized parameterized $\lambda\in\{10^{i},i=-6,-5,\dots,3\}$ in all methods and $\gamma\in\{10^{i},i=-6,-5,\dots,3\}$ in MDD methods are determined by 5-folds cross-validation on training data. For each data set, we run all methods 30 times with random partitions on all data sets of non-overlapping 70% training data and 30% testing data. All statements of statistical significance in the remainder refer to a 95% level of significance under $t$ -test.

The root mean square error of all methods is reported in Table I. Meanwhile, we repeat distributed methods on different amount of worker nodes, 5 and 10 for simplification. Table I can be summarized as follows:

Our MDD-LS and MDD-RKHS exhibit better prediction accuracy than the DRR and KDRR over almost all data sets. This demonstrates the advantage of MDD methods in generalization performance.

2)

Our MDD-LS and MDD-RKHS give comparable result with global methods on most of data sets.

3)

Kernel methods can usually get more optimal results than linear methods do;

4)

Some data sets are sensitive to data partition, whose results existing huge gap between global methods and distributed methods, such as space_ga and phishing for RKHS, while others are not.

5)

The increase of worker nodes causes higher root mean square error.

The running time is reported in Table II, which can be summarized as follows:

Global methods cost more time than distributed methods do on all data sets.

2)

Kernel methods always spend more time than linear methods, because of higher computation complexity.

3)

Distributed methods lead great speedup on some data sets.

4)

The running time of distributed methods decays almost linearly associated with the increase of worker nodes.

5)

Compared with global methods, our MDD methods own higher computational efficiency, while existing small distance away from divide-and-conquer methods.

The above results show that MDD methods need a bit more training time but make the performance gap between global methods and traditional distributed methods tighter, which is consistent with our theoretical analysis.

V Conclusion

In this paper, we studied the generalization performance of distributed learning, and derived a sharper generalization error bound, which is much sharper than existing generalization bounds of divide-and-conquer based distributed learning. Then, we designed two algorithms with statistical guarantees and fast convergence rates for linear space and RKHS: MDD-LS and MDD-RKHS. As we see from theoretical analysis and empirical results, our MDD is highly competitive with the existing divide-and-conquer methods, in terms of both practical performance and computational cost. Based on max-diversity of each local estimate, our analysis can be used as a solid basis for the design of new distributed learning algorithms.

VI Proof

VI-A The Key Idea

From the $\eta$ -strongly convex of $R(f)$ of equation (5), we can obtain that

[TABLE]

Therefore, we have

[TABLE]

In the next, we will estimate $R(\hat{f}_{i})-R(f_{\ast})$ , which is built upon the following inequality from (4):

[TABLE]

By the convexity of $\hat{R}_{i}(\cdot)$ and the optimality condition of $\hat{f}_{i}$ [19], we have

[TABLE]

Substituting (15) into (14), we have

[TABLE]

VI-B Proof of Theorem 1

To prove Theorem 1, we first give the following two lemmas (the proofs are given at the last part of this section).

Lemma 1.

Under Assumptions 3 and 7, with probability at least $1-\delta$ , for any $f\in\mathcal{N}(\mathcal{H},\epsilon)$ , we have

[TABLE]

Lemma 2.

Under Assumptions 3, with probability at least $1-\delta$ , we have

[TABLE]

where $H_{\ast}=\mathbb{E}_{z}\left[\ell(f_{\ast},z)\right]$ .

Proof of Theorem 1.

From the property of $\epsilon$ -net, we know that there exists a point $\tilde{f}\in\mathcal{N}(\mathcal{H},\epsilon)$ such that

[TABLE]

According to Assumptions 3 and 4, we know that $R(f)$ and $\hat{R}(f)$ are both $(\tau+\tau^{\prime})$ -smooth. Thus, we have

[TABLE]

Substituting (20) and (19) into (17), with probability at least $1-2\delta$ , we have

[TABLE]

Note that

[TABLE]

Therefore, we can obtain that

[TABLE]

Substituting the above inequation into (21), we can obtain that

[TABLE]

Thus, with $1-2\delta$ , we have

[TABLE]

Combining (13) and (22), with $1-\delta$ , we have

[TABLE]

∎

VI-C Proof of Lemma 1

Lemma 3 ([10]).

Let $\mathcal{H}$ be a Hilbert space and let $\xi$ be a random variable with values in $\mathcal{H}$ . Assume $\|\xi\|\leq M\leq\infty$ almost surely. Denote $\sigma^{2}(\xi)=\mathbb{E}[\|\xi\|^{2}]$ . Let $\{\xi_{i}\}_{i=1}^{n}$ be $m$ independent drawers of $\xi$ . For any $0\leq\delta\leq 1$ , with confidence $1-\delta$ ,

[TABLE]

Proof.

According to Assumption 3 and 7, we know that $\nu(f,\cdot)=\nu(f,z)=\ell(f,z)+r(f)$ is $(\tau+\tau^{\prime})$ -smooth, so we have

[TABLE]

Because $\nu(f,\cdot)$ is $(\tau+\tau^{\prime})$ -smooth and convex, by (2.1.7) of [20], $\forall z\in\mathcal{Z}$ , we have

[TABLE]

Taking expectation over both sides, we have

[TABLE]

where the last inequality follows from the optimality condition of $f_{\ast}$ , i.e.,

[TABLE]

Following Lemma 3, with probability at least $1-\delta$ , we have

[TABLE]

We obtain Lemma 1 by taking the union bound over all $f\in\mathcal{N}(\mathcal{H},\epsilon)$ . ∎

VI-D Appendix: Proof of Lemma 2

Proof.

Since $\ell(f,\cdot)$ is $\eta$ -smooth and nonegative, from Lemma 4 of [21], we have

[TABLE]

and thus

[TABLE]

From the Assumption, we have $\nabla\|\ell(f_{\ast},z)\|\leq M$ , $\forall z\in\mathcal{Z}$ . Let $H(f)=R(f)-r(f)$ and $\hat{H}(f)=\hat{R}(f)-r(f)$ . Then, according to Lemma 3, with probability at least $1-\delta$ , we have

[TABLE]

∎

VI-E Proof of Lemma 4

Lemma 4.

For all $\ell\geq 1$ , If $\mathbf{A}\in\mathbb{R}^{l\times l}$ is a symmetric matrix and $\mathbf{b},\mathbf{d}\in\mathbb{R}^{l}$ , $\mathbf{c}=\mathbf{A}^{-1}\mathbf{b}\in\mathbb{R}^{l}$ , then we have

[TABLE]

where $a./\mathbf{c}=(a/c_{1},\ldots a/c_{l})^{\mathrm{T}}$ .

Proof.

Since $\mathbf{A}$ a symmetric matrix, we have

[TABLE]

Therefore, we can obtain that $\mathbf{A}^{-1}\mathbf{d}=(\mathbf{d}^{\mathrm{T}}\mathbf{c})./\mathbf{b}$ . ∎

Bibliography21

The reference list from the paper itself. Each links out to its DOI / PubMed record.

1[1] Z.-H. Zhou, N. V. Chawla, Y. Jin, and G. J. Williams, “Big data opportunities and challenges: Discussions from data analytics perspectives [discussion forum],” IEEE Computational Intelligence Magazine , vol. 9, no. 4, pp. 62–74, 2014.
2[2] Y. Zhang, J. Duchi, and M. Wainwright, “Divide and conquer kernel ridge regression,” in Proceedings of Conference on Learning Theory (COLT 2013) , 2013, pp. 592–617.
3[3] S.-B. Lin, X. Guo, and D.-X. Zhou, “Distributed learning with regularized least squares,” The Journal of Machine Learning Research , vol. 18, no. 1, pp. 3202–3232, 2017.
4[4] Y. Zhang, M. J. Wainwright, and J. C. Duchi, “Communication-efficient algorithms for statistical optimization,” in Advances in Neural Information Processing Systems , 2012, pp. 1502–1510.
5[5] X. Wu, X. Zhu, G.-Q. Wu, and W. Ding, “Data mining with big data,” IEEE transactions on knowledge and data engineering , vol. 26, no. 1, pp. 97–107, 2014.
6[6] D. Gillick, A. Faria, and J. De Nero, “Mapreduce: Distributed computing for machine learning,” Berkley, Dec , vol. 18, 2006.
7[7] E. D. Vito, A. Caponnetto, and L. Rosasco, “Model selection for regularized least-squares algorithm in learning theory,” Foundations of Computational Mathematics , vol. 5, no. 1, pp. 59–85, 2005.
8[8] A. Caponnetto and E. D. Vito, “Optimal rates for the regularized least-squares algorithm,” Foundations of Computational Mathematics , vol. 7, no. 3, pp. 331–368, 2007.

TL;DR

Contribution

Findings

Abstract

Peer Reviews

Videos

Taxonomy

Max-Diversity Distributed Learning:

Abstract

Index Terms:

I Introduction

II Error Analysis of Distributed Learning

II-A Assumptions

Assumption 1**.**

Assumption 2**.**

Assumption 3**.**

Assumption 4**.**

Assumption 5**.**

Assumption 6**.**

II-B Faster Rate of Distributed Learning

Theorem 1**.**

Corollary 1**.**

II-C Linear Space

II-D Reproducing Kernel Hilbert Space (RKHS)

II-E Comparison with Related Work

III Max-Discrepant Distributed Learning (MDD)

III-A Linear Hypothesis Space

III-B Reproducing Kernel Hilbert Space

Remark 1**.**

III-C Complexity

IV Experiments

V Conclusion

VI Proof

VI-A The Key Idea

VI-B Proof of Theorem 1

Lemma 1**.**

Lemma 2**.**

Proof of Theorem 1.

VI-C Proof of Lemma 1

Lemma 3** ([10]).**

Proof.

VI-D Appendix: Proof of Lemma 2

Proof.

VI-E Proof of Lemma 4

Lemma 4**.**

Proof.

Assumption 1.

Assumption 2.

Assumption 3.

Assumption 4.

Assumption 5.

Assumption 6.

Theorem 1.

Corollary 1.

Remark 1.

Lemma 1.

Lemma 2.

Lemma 3 ([10]).

Lemma 4.