Towards Sharp Analysis for Distributed Learning with Random Features
Jian Li, Yong Liu, Weiping Wang

TL;DR
This paper advances the theoretical understanding of distributed learning with random features by extending optimal rates to non-attainable cases, reducing feature requirements, and improving partition scalability, supported by experiments.
Contribution
It introduces refined analysis techniques for non-attainable cases, data-dependent feature generation, and enhanced partitioning strategies in distributed learning with random features.
Findings
Extended optimal rates to non-attainable cases
Reduced number of random features needed
Improved scalability with additional unlabeled data
Abstract
In recent studies, the generalization properties for distributed learning and random features assumed the existence of the target concept over the hypothesis space. However, this strict condition is not applicable to the more common non-attainable case. In this paper, using refined proof techniques, we first extend the optimal rates for distributed learning with random features to the non-attainable case. Then, we reduce the number of required random features via data-dependent generating strategy, and improve the allowed number of partitions with additional unlabeled data. Theoretical analysis shows these techniques remarkably reduce computational cost while preserving the optimal generalization accuracy under standard assumptions. Finally, we conduct several experiments on both simulated and real-world datasets, and the empirical results validate our theoretical findings.
| datasets | training | testing | |||
| EEG | |||||
| EEG∗ | |||||
| covtype | |||||
| SUSY | |||||
| HIGGS |
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsDistributed Sensor Networks and Detection Algorithms · Stochastic Gradient Optimization Techniques · Sparse and Compressive Sensing Techniques
Towards Sharp Analysis for Distributed Learning with Random Features
Jian Li
Institute of Information Engineering, Chinese Academy of Sciences
Yong Liu*†*
Gaoling School of Artificial Intelligence, Renmin University of China Yong Liu is also the corresponding author.
Weiping Wang
Institute of Information Engineering, Chinese Academy of Sciences
Abstract
In recent studies, the generalization properties for distributed learning and random features assumed the existence of the target concept over the hypothesis space. However, this strict condition is not applicable to the more common non-attainable case. In this paper, using refined proof techniques, we first extend the optimal rates for distributed learning with random features to the non-attainable case. Then, we reduce the number of required random features via data-dependent generating strategy, and improve the allowed number of partitions with additional unlabeled data. Theoretical analysis shows these techniques remarkably reduce computational cost while preserving the optimal generalization accuracy under standard assumptions. Finally, we conduct several experiments on both simulated and real-world datasets, and the empirical results validate our theoretical findings.
1 Introduction
A fundamental problem in machine learning is to achieve tradeoffs between statistical properties and computational costs [1, 2], while this challenge is more severe in kernel methods. Despite the excellent theoretical guarantees, kernel methods do not scale well in large-scale settings because of high time and memory complexities, typically at least quadratic in the number of examples. To break the scalability bottlenecks, researchers developed a wide range of practical algorithms, including distributed learning, which produces a global model after training disjoint subset on individual machines with necessary communications [3, 4], Nyström approximation [5, 6, 7] and random Fourier features [8, 9] to alleviate memory bottleneck, as well as stochastic methods [10] to improve the training efficiency.
From the theoretical perspective, many researchers have studied the statistical properties of those large-scale approaches together with kernel ridge regression (KRR) [6, 11, 4]. Using integral operator techniques [12] and the effective dimension to control the capability of RKHS [13], the generalization bounds have achieved the optimal learning rates. Recent statistical learning studies on KRR together with large-scale approaches demonstrate that these approaches can not only obtain great computational gains but still remain the optimal theoretical properties, such as KRR together with divide-and-conquer [14, 15], with random projections including Nyström approximation [6] and random features [9, 16, 17, 18]. Since the communication cost is high to combine local kernel estimators in RKHS, it’s more practical to combine the linear estimator in the feature space, e.g. federated learning [19]. Therefore, the generalization analysis for the combination of distributed learning and random features is rather important in distributed learning.
The existing works on DKRR [14, 4, 15] and random features [9, 20, 18] mainly focus on the attainable case that the true regression belongs to the hypothesis space, ignoring the non-attainable case where the true regression is out of the hypothesis space. Since it’s hard to select the suitable kernel via kernel selection to guarantee that the target function belongs to the kernel space, the non-attainable case is more common in practice. Therefore, the statistical guarantees for the non-attainable are of practical and theoretical interest in the context of the statistical learning theory. The optimal rates for DKRR have been extended to a part of the non-attainable case via sharp analysis for the distributed error [10] and multiple communications [21, 22], but these techniques are hard to improve the results for random features. Meanwhile, some recent studies extended the capacity-independent optimality to the non-attainable, including distributed learning [23], random features [24] and Nyström approximation [25], but the capacity-independent results are suboptimal when the capacity of RKHS is small. The capacity-optimality for the combination of distributed learning and random features to the non-attainable case is still an open problem.
In this paper, we aim at extending the capacity-dependent optimal guarantees to the non-attainable case and improve the computational efficiency with more partitions and fewer random features. Firstly, using the refined estimation of operators’ similarity, we refine the optimal generalization error bound that allows much more partitions and pertains to a part of the non-attainable case. Then, generating random features in a data-dependent manner, we relax the restriction on the dimension of random features, and thus fewer random features are sufficient to reach the optimal rates. By using additional unlabeled data to reduce label-independent error terms, we further enlarge the number of partitions and improve the applicable scope in the non-attainable case. Finally, we validate our theoretical findings with extensive experiments. Note that, we leave the full proofs in the appendix.
1.1 Our Contributions
We highlight our contributions as follows:
- •
On the algorithmic front: much higher computational efficiency. This work presents the currently maximum number of partitions and the minimal dimension of random features, extremely improving the computational efficiency.
- –
More partitions. To achieve the optimal learning rate, the traditional distributed KRR methods [4, 14] impose a strict constraint on the number of partitions , which heavily limits the computational efficiency. In this paper, using a novel estimation of the key quantity, we first relax the restriction to . Then, introducing a few additional unlabeled examples, we improve the number of partitions to for the first time.
- –
Fewer random features. By generating random features in a data-dependent manner rather than in a data-independent manner, we reduce the requirement on the number of random features from to , where is the number of random features and indicates the bigger one.
- •
On the theoretical front: covering the non-attainable case. The conventional optimal learning properties for KRR [13, 9, 14] only pertain to the attainable case , assuming the true regression belongs to the hypothesis space where the problems can not be too difficult. However, the condition is too ideal and the non-attainable assuming deserve more attention. In this paper, we first restate the classic results in the attainable . Then, by relaxing the restriction on the number of partitions, we extend the optimal theoretical guarantees to the non-attainable case with the constraints and . Note that we prove KRR with random features applies to all non-attainable cases .
- •
Extensive experimental validation. To validate our theoretical findings, we conduct extensive experiments on simulated data and real-world data. We first construct simulated experiments under different difficulties to validate the learning rate and training time. Then, we perform comparison on a small real-world dataset to verify the effectiveness of data-dependence random features (with a novel approximate leverage score function) and additional unlabeled examples. Finally, we compare the proposed DKRR-RF with related work in terms of the performance on three real-world datasets.
- •
Technical challenges.
- –
More partitions with additional unlabeled examples. In the error decomposition, only sample variance is label-dependent. At the same time, other terms are label-independent, and thus we employ additional unlabeled examples to reduce the estimation of label-independent error terms. We further improve the applicable scope in the non-attainable case to .
- –
Random features error in all non-attainable cases. Using an appropriate decomposition on the operatorial level for random features error, we prove KRR with random features pertains to both attainable and non-attainable case .
Overall, by overcoming several technical hurdles, we present the optimal theoretical guarantees for the combination of DKRR and RF. With more partitions and fewer random features, the theoretical results not only obtain significant computational gains but also preserve the optimal learning properties to both the attainable and non-attainable case . Indeed, KRR [13], DKRR [14], and KRR-RF [9] are special cases of this paper. Thus, the techniques presented here pave the way for studying the statistical guarantees of other types kernel approaches (even neural networks) that can apply to the non-attainable case.
2 Distributed Learning with Random Feature
In a standard framework of supervised learning, there is a probability space with a fixed but unknown distribution , where is the input space and is the output space. The training set is sampled i.i.d. from with respect to . The primary objective is to fit the target regression on . The Reproducing Kernel Hilbert Space (RKHS) induced by a Mercer kernel is defined as the completion of the linear span of with respect to the inner product . In the view of feature mappings, an underlying nonlinear feature mapping associated with the kernel is , so it holds .
2.1 Kernel Ridge Regression (KRR)
With an RKHS norm term, kernel ridge regression (KRR) is one of the popular empirical approaches to conducting a nonparametric regression. KRR can be stated as
[TABLE]
Using the representation theorem, the nonlinear regression problem (1) admits a closed form solution with
[TABLE]
where and is the kernel matrix with . Although KRR characterizes optimal statistical properties [12, 13], it is unfeasible for large-scale settings because of memory to store kernel matrix and time to solve the linear system (2).
2.2 Distributed KRR with Random Features (DKRR-RF)
Assume that the kernel have an integral representation
[TABLE]
where is a probability space and . We define analogous operators for the constructed kernel to approximate the primal kernel in (3) with its corresponding random features via Monte Carlo sampling
[TABLE]
where are sampled w.r.t .
Let the training set be randomly partitioned into disjoint subsets with . The local estimator on the subset is defined as
[TABLE]
where the estimator is . It admits a closed-form solution
[TABLE]
where . Note that for -th subset , it holds and The average of local estimators (6) yields a global estimator
[TABLE]
3 Theoretical Assessment
In this section, we present the theoretical analysis on the generalization performance of kernel ridge regression with divide-and-conquer and random features.
The generalization ability of a regression predictor is measured in terms of the expected risk
[TABLE]
In this case, the target regression minimizes the expected risk over all measurable functions . The generalization ability of a KRR estimator is measured by the excess risk, i.e. , where is the square integral Hilbert space with respect to the marginal distribution on the input space .
3.1 Assumptions
We first introduce two standard assumptions, which are also used in statistical learning theory [12, 13, 9].
Assumption 1** (Random features are continuous and bounded).**
Assume that is continuous and there is a , such that .
Assumption 2** (Moment assumption).**
Assume there exists and , such that for all with ,
[TABLE]
According to Assumption 1, the kernel is bounded by . The moment assumption on the output holds when is bounded, sub-gaussian or sub-exponential. Assumptions 1 and 2 are standard in the generalization analysis of KRR, always leading to the learning rate [12] in general cases.
Definition 1** (Integral operators).**
, the integral operators are defined by the kernel and the random features , respectively
[TABLE]
Definition 2** (Effective dimension).**
The effective dimension of the RKHS induce by the kernel is defined as
[TABLE]
The effective dimension is used to measure the complexity of RKHS , and its empirical counterpart is also called degree of freedom [26]. Similarly, we define the effective dimension for the random features mapping to measure the size of the approximate RKHS , which is induced by finite dimensional random features .
Assumption 3** (Capacity assumption).**
Assume there exists and , such that for any
[TABLE]
Assumption 4** (Regularity assumption).**
Assume there exists , , and , such that
[TABLE]
where is the target regression, and the operator denotes the -th power of the integral operator , thus it is also a positive trace class operator.
Assumption 3 holds when the eigenvalues of the integral operator have a polynomial decay [9, 20]. Thus, faster convergence rates are derived when the eigenvalues decay faster, a.k.a. approaches [math], while corresponds to the capacity-independent case. Assumption 4 (source condition) controls the regularity of the target function . The bigger the is, the stronger regularity of the regression is, and the easier the learning problem is. Both these two assumptions are widely used in the optimal theory for KRR [13, 9, 14].
3.2 General Results with Fast Rates
One can prove the optimal generalization guarantees for DKRR-RF by combining the theories in KRR-DC [4] and KRR-RF [9]. The attainable case requires the existence of , such that almost surely [27], which is widely used in KRR and its variants including distributed KRR and random features based KRR [13, 9, 14].
Theorem 1**.**
*Under Assumptions 1, 2, 3 and 4, if , and then *
[TABLE]
are enough to guarantee, with a high probability, that
[TABLE]
The optimal learning rate stated in Theorem 1 in the above bound is optimal in a minimax sense for KRR approaches [13]. Distributed KKR methods have obtained the same optimal error bounds with a stronger condition on the number of partitions, such as KRR-DC [4, 15] with . In particular, for the general case , the number of local processors becomes a constant number that is independent of the sample size . The time complexity of DKRR-RF is and the space complexity , thus we report the computational complexities of Theorem 1 in Figure 1.
Remark 1**.**
The general results in Theorem 1 have three fatal drawbacks: 1) the above bound is only suitable for the attainable case and fail to apply to the non-attainable case induced by more complicated problems; 2) random features generated via Monte Carlo are data-independent, which requires much more features than the data-dependent generating features; 3) the constraint on the number of partitions is too strict, leading to a constant number of partitions when is close to .
3.3 Refined Results in the Non-attainable Case
Theorem 2**.**
Under Assumptions 1, 2, 3 and 4, if , , and then the number of partitions corresponding to
[TABLE]
and the number of random features satisfying
[TABLE]
are enough to guarantee, with a high probability, that
[TABLE]
Compared to Theorem 1, Theorem 2 allows more partitions and extends the optimal learning guarantees to the non-attainable case where the true regression does not lie in RKHS . Thus, it achieves significant improvements in both computational efficiency and statistical guarantees. With the same optimal learning rates, Theorem 2 relaxes the restriction on from to , which allows more partitions and relaxes the constraints from to . When , the number of random features increases as the approaches zero, because becomes far away from when is near zero. When , we obtain the same level of the number of random features as KRR-RF [9], which is continuous to at the critical points . Compared to Figure 1, Figure 2 illustrates Theorem 2 not only enlarge the applicable case but also improve the computational efficiency.
Remark 2**.**
Theorem 2 extends the optimal generalization theories from only attainable case to the non-attainable case , which include a part of difficult problems . However, there are also many cases satisfying in the non-attainable case , where the optimal learning guarantees in Theorem 2 are no longer valid. Inspired the literature [28], we employ additional unlabeled samples to relax the restriction in Section 3.5.
3.4 Fewer Features with Data-dependent Sampling
Assumption 5** (Compatibility assumption).**
Define the maximum effective dimension as
[TABLE]
Assume there exists and , such that
[TABLE]
Using the definition of , we characterize the lower bounds for :
[TABLE]
Compared to the (average) effective dimension used in Assumption 3, the maximum effective dimension offers a finer-grained estimate for the capacity of RKHS [29, 9, 30], which often leads to shaper estimate for the related quantities. Using the compatibility assumption, we relax the constraints on the dimension of random features and the number of partitions by generating features in a data-dependent manner, as shown in [30, 31, 20].
Theorem 3**.**
Under the same assumptions of Theorem 2 and Assumption 5, if , , and then the number of partitions satisfying
[TABLE]
and the number of random features satisfying
[TABLE]
is sufficient to guarantee, with a high probability, that
[TABLE]
The learning rates of the above theorem are optimal, same as Theorems 2. Achieving the same optimal learning rates, Theorem 3 reduce the computational costs with fewer random features. The number of required random features is reduced from \mathcal{O}\big{(}N^{\frac{1}{2r+\gamma}}\big{)} to \mathcal{O}\big{(}N^{\frac{\alpha}{2r+\gamma}}\big{)} when and \mathcal{O}\big{(}N^{\frac{(2r-1)\gamma+1}{2r+\gamma}}\big{)} to \mathcal{O}\big{(}N^{\frac{(2r-1)\gamma+1+2(r-1)(1-\alpha)}{2r+\gamma}}\big{)} when , where the term . We report the applicable area and computational complexities of Theorem 3 in Figure 3. It shows the use of data-dependent sampling significantly reduce both the time and space complexities. The situations near the boarder line are away from the same computational complexities as the exact KRR.
Remark 3**.**
From Theorem 1 in [20], we find that the requirement on the data-dependent random features is bounded as , where . The condition is the same as Theorem 3 in the non-attainable and milder than Theorem 3 in the attainable case . However, the theoretical analysis provided in [20] only pertains to the general case and obtains error bounds with the convergence rate .
Remark 4**.**
According to the definition of , the sampling probability of random features is independent of data, which leads to a pessimistic estimate of . However, generating random features in a data-dependent manner relaxes the estimate of closer to . A theoretical example of data-dependent random features was given in Example 2 [9], which guarantees (such that ) by constructing random features generated in a data-dependent way. In practice, leverage sampling algorithms were proposed to obtain data-dependent random features [20], where is close to . To intuitively illustrate the improvement of data-dependent random features, we boldly assume by generating data-dependent random features.
3.5 More Partitions with Unlabeled Data
In this part, we introduce the additional unlabeled samples to relax this restriction further. We consider the merged dataset on the -th processor, with
[TABLE]
Let and . We define semi-supervised kernel ridge regression with divide-and-conquer and random features by
[TABLE]
Theorem 4**.**
Under the same assumptions of Theorem 3, if and then the total number of samples corresponding to
[TABLE]
the number of local processors satisfying
[TABLE]
and the number of random features satisfying
[TABLE]
are sufficient to guarantee, with a high probability, that
[TABLE]
To our best knowledge, for the first time, we prove that the number of partitions can achieve , while the existing constraints on of the existing work [10, 22] are . Such that, much more partitions are allowed in distributed KRR methods. The relaxation of condition on the partition number can not only lead to better computational efficiency but also covers more difficult problems, where the suitable problems are enlarged from the situation to the situation . Figure 4 reveals the advantages of DKRR-RF with unlabeled data. Theorem 4 provides the largest applicable area but also the highest computational efficiency owing to more partitions.
Remark 5**.**
From the error decomposition, there are two error terms related to the number of partitions : sample variance and empirical error. Sample variance depends on the number of labeled samples , while empirical error is input-dependent but output-independent; thus, it is related to the number of total samples . Meanwhile, the similarity between empirical and expected covariance operators is also label-free, and thus it is related to the total sample size rather than . To achieve the optimal learning rates, we consider the constraints on both the required labeled samples and the total samples . Considering both conditions for supervised learning and semi-supervised learning , we then obtain two constraints on the number of partitions and consolidate them together.
4 Compared with Related Work
The existing optimal learning guarantees of KRR [13], KRR-DC [14, 15] and KRR-RF [9, 22] only apply to the attainable case . In this paper, we apply the optimal generalization error bounds to the non-attainable case with some restrictions, including in Theorem 2 and in Theorem 4. Using refined estimation, we extend the random features error to the non-attainable case.
4.1 Applicable Area from to
The key to obtaining the optimal learning rates with integral-operator approach is to bound the identity as a constant, where and are the expected and empirical covariance operators defined in Definition 4. In conventional distributed KRR [4, 28], they estimated the operator difference after first order (or second order) decomposition
[TABLE]
To bound the identity as a constant, the local sample size should larger enough . it holds for KRR-DC and only applies to . However, this paper directly estimates the identity in total (rather than in parts after decomposition) based on concentration inequalities for self-adjoint operators and obtain
[TABLE]
To bound the identity as a constant, the local sample size only needs , which is smaller than [14] with . Therefore, our estimation of in Theorem (2) is tighter than that in [14]. To bound identity as a constant, we then have , which is the key to obtain more partitions and extends the optimal learning guarantees to the non-attainable case .
4.2 Applicable Area from to
Only sample variance is dependent on the labeled samples, while other error terms involving the estimate of are label-free. Thus, there are two restrictions on the number of partitions : sample variance (label-dependent) and the estimate of (label-free).
As shown in the proof of Theorem 3, the global sample variance (label-dependent) can be estimated
[TABLE]
To achieve the optimal learning rates , the number of partitions should satisfy . Then, we utilize additional unlabeled samples to relax the condition on the estimate of . Using Assumption 5, one can further relax the condition of due to
[TABLE]
To guarantee the key quantity be a constant, we have . We then consider the dominant constraints:
- •
The case . It holds , thus the number of partition is .
- •
The case . It holds and we make use of additional unlabeled examples to guarantee .
4.3 Random Features Error in the Non-attainable Case
Using appropriate decomposition on operatorial level, we derive the random features error for both attainable and non-attainable case, where the dimension of random features should satisfy for the non-attainable case . The extension from the attainable case to the non-attainable case is non-trivial, where the non-attainable case requires refined estimations for operators similarity.
The operatorial definitions of intermediate estimators , and in Lemma 1 involve the true regression , where (under Assumption 4) is related the range of . Such that, we estimate the last there error terms (empirical error , random features error and approximation error ) that involves , and for the non-attainable case. Meanwhile, because the empirical error satisfies and the approximation error naturally applies to the non-attainable case, only random features error is needed to specifically estimated for the non-attainable case.
5 Experiments
To validate the theoretical findings, we conduct experiments on both simulated data and real-world data. In the numerical experiments, we study the computational and statistical tradeoffs of DKRR-RF, KRR-DC, KRR-RF, and KRR. In real-world experiments, we first explore the effectiveness of data-dependent random features and additional unlabeled samples on a small world dataset. Then, we compare the statistical performance of DKRR-RF, KRR-DC, and KRR-RF on three large-scale real-world datasets w.r.t. the number of random features and the number of partitions .
5.1 Numerical Experiments (for Theorem 2)
In this section, to validate our theoretical findings, we perform experiments on simulated data. From Theorem 2, we find that the learning rates become slower as the ratio increases, which is
[TABLE]
As the ratio increases, the hardness of the problem increases. Such that, given a fixed , a smaller leads to a slower converge rate of generalization error bounds. As decreases from to near zero, the learning rates are in the range . Inspired by numerical experiments in [9, 32], we introduce the spline kernel of order , where more details are referred in [33] (Eq. 2.1.7)
[TABLE]
More importantly, the spline kernels naturally construct random features for any
[TABLE]
Using the following settings, we perform experiments on both easy and difficult problems
Input distribution: and is the uniform distribution. 2. -
Output distribution: the target function with a variance . 3. -
Kernel and Random features: . According to (3) and (13), with sampled i.i.d from uniform distribution The random features of the spline kernel are
[TABLE]
Then, conditions used in Theorem 2 are satisfied [9], including Assumption 3, 4 with and no unlabeled data. As shown in Figure 5 (a), the smaller ratio leads to a smoother curve, which corresponds to a easier problem. We explore regression problems with different difficulties in terms of different settings for and .
According to the target regression and a variance , the training data is generated with various sample size and samples for testing. To study the difference between the simulated excess risk and the theoretical excess risk, we repeat the data generating and the training times and estimate the averaged excess risk on the testing data. On each training, we perform DKRR-RF (7), KRR-DC [14], KRR-RF [9] and KRR [13] by evaluating both statistical performance (mean square error, MSE) and computational costs (training time). Meanwhile, according to Theorem 2, we set , and , where is an estimation of the constant .
5.1.1 Easy Problem
Easy problem with the learning rate is given by setting , where the target function is and leads to a smooth curve in Figure 5 (b). Figure 5 (b) illustrates that the problem is easy, and a smaller number of training samples () is enough to fit the target curve perfectly.
The left of Figure 6 shows the empirical learning rate of DKRR-RF is very close to the theoretical rate for the benign case . From the middle of Figure 6, we find that the empirical MSE of KRR, KRR-DC, KRR-RF, and DKRR-RF are the same and extremely small, where the target problem is easy and even the approximate methods achieve the same optimal learning performance as the exact KRR.
5.1.2 General Problem
General problem with the learning rate is given by setting , which is seemed as the worst one for the attainable case and well-studied in [9, 22]. The target function is Figure 5 (c) shows that the curve becomes sharp and thus the problem is of medium difficulty. A few samples bring noises, and it needs more samples to achieve perfect fitting.
The left of Figure 7 demonstrates the empirical error of DKRR-RF converges at near the expected rate . The comparison of MSE in the middle of Figure 7 shows the errors of DKRR-RF are mainly due to more partitions rather than random features. The empirical performance for the general problem in Figure 7 (b) is much worse than the easy problem in Figure 6. The gap of test errors between distributed methods and centralized methods is negligible. The right of Figure 7 shows the training time of DKRR-RF is higher than KRR-DC when more random features are used.
5.1.3 Diffiult Problem
The difficult problem with the learning rate is given by setting , which is almost unable to be learned. According to Figure 5 (d), we find that provides the difficult problem where the curve steepens rapidly near [math] or . A large number of training samples are still unable to fit the curve perfectly.
From the left of Figure 8, we can see that the learning rate is near , such that the target function is hard to learn. The middle and right of Figure 8 illustrates that errors and training time are similar for KRR, KRR-RF, KRR-DC, and DKRR-RF. Compared with test errors of the easy problem in Figure 6 (b) and the general problem in Figure 7 (b), MSE of the difficult problem in Figure 8 (b) is much higher and the performance gab between distributed learning and centralized learning is significant. Therefore, distributed learning approaches are not suitable for difficult problems when approaches zero, which coincides with the theoretical findings in Theorem 2.
5.2 Influence of data-dependent random features (for Theorem 3)
Inspired by the leverage weighted random Fourier features [34, 20, 35], we proposed the leverage weighted random features (not just for shift-invariant kernels). Based on (4), the data-dependent random features are defined as
[TABLE]
where . Using the ideal matrix to replace the kernel matrix, we obtain the following leverage score function
[TABLE]
where \mathbf{z}_{\boldsymbol{w}_{i}}({\boldsymbol{X}})=1/\sqrt{L}\big{[}\psi(\boldsymbol{x}_{1},\boldsymbol{w}_{i}),\cdots,\psi(\boldsymbol{x}_{n},\boldsymbol{w}_{i})\big{]}^{\top} and . Removing the data-independent terms, there holds
[TABLE]
and
[TABLE]
The time complexity of generating data-dependent random features is on a global machine, while it can be further reduced by computing in local machines and as a part of data preprocessing. Then, we re-sample features from using the multinomial distribution given by . Then, using (14), we compute the data-dependent random features.
To validate Theorem 3, we compare the empirical performance of the following methods:
- •
Leverage RF with : the proposed approximate leverage weighted random features (14) without distributed learning, similar to [35].
- •
Leverage RF with : the proposed approximate leverage weighted random features (14) with partitions, a.k.a. DKRR-RF in Theorem 3.
- •
Plain RF with : the exact random features with Monte Carlo sampling (4), which is KRR-RF given in [9].
- •
Plain RF with : the exact random features with Monte Carlo sampling (4) with partitions, namely DKRR-RF defined in Theorem 2.
In terms of different values of , we perform different random features generating algorithms on the EGG dataset to evaluate the test accuracies, time costs for generating random features, and time costs for trials. Figure 9 reports the mean and one standard deviation of test accuracies, time costs for generating random features, and time costs for training versus different settings of . We use Gaussian kernel in experiments and the corresponding probability density function . The kernel parameter and the regularity parameter are tuned via -folds cross-validation over grids of and . The statistical information of the dataset and hyperparameter settings are reported in Table 1. From Figure 9, we find that:
Centralized learning with data-dependent random features (blue) achieves the best accuracy but also leads to the highest computational costs for training the model on a centralized machine, while the generating times for centralized learning and distributed learning are the same.
- 2)
With a slight increase in random features generating time in Figure 9 (b), two kinds of data-dependent approaches (blue and red ones) brings significant improvements on the classification accuracy as the number of random features increases in Figure 9 (a), which validates the effectiveness of data-dependent random features.
- 3)
The use of data-dependent features generating approaches does not sacrifice too much computational efficiency as shown in Figure 9 (b). Meanwhile, in Figure 9 (c), distributed learning dramatically improves training efficiency.
- 4)
Consuming a little bit more generating time and similar training time, data-dependent DKRR-RF (red) markedly surpasses data-independent DKRR-RF (purple), which reveals the superiority of Theorem 3 than Theorem 2.
Overall, data-dependent DKRR-RF achieves a good tradeoff on accuracy and efficiency, which coincides with the theoretical findings in Theorem 3.
5.3 Influence of Unlabeled Data (For Theorem 4)
To validate Theorem 4, we split the EGG dataset into three parts: examples as the labeled training data, ones as the unlabeled training data, and examples as the test data, which is illustrated in Table 1. There are two compared methods:
- •
Semi-supervised DKRR-RF (defined in Theorem 4) is constructed as (10), where the unlabeled examples are marked as zeros.
- •
Supervised DKRR-RF (defined in Theorem 3) only uses the labeled samples.
In the following experiments, we fix the labeled sample size , the number of random features and the number of partitions . In Figure 10, we perform the compared methods across trials to plot the mean and one standard deviation of test accuracies, time costs for random feature generating, and times costs for training under varying unlabeled samples size . Figure 10 illustrates that:
The use of additional unlabeled samples improves the empirical performance of DKRR-RF. In other words, we can increase the number of partitions without losing accuracy by using additional unlabeled samples. It is consistent with the theoretical findings in Theorem 4 that additional unlabeled examples can relax the restriction on .
- 2)
The increase of the number of unlabeled examples aggravates the computational burden. It is worthy of balancing the accuracy gains and the computational costs in terms of different sizes of unlabeled data. Generating time is larger than training time, thus generating time dominates in the time-consuming.
In Figure 11, we fixed unlabeled sample size as and perform semi-supervised/supervised methods on the different number of partitions . Figure 11 reports the mean and one standard deviation of test accuracies and training times under different partitions, which shows:
Both two test accuracies decrease as the number of partitions increases, and the accuracy gap between semi-supervised DKRR-RF and supervised DKRR-RF becomes larger and larger.
- 2)
Both the training time drops as the number of partitions increases, but there are no more significant computational gains when the number of partitions is greater than . Both the generating times of semi-supervised DKRR-RF and supervised DKRR-RF is almost stationary and plays a dominant role. Semi-supervised DKRR-RF costs much more time for generating data-dependent random features than supervised DKRR-RF.
- 3)
Semi-supervised DKRR-RF (Theorem 4) always provides better empirical performance than supervised DKRR-RF (Theorem 3), but also the training times of them are similar when .
Therefore, semi-supervised DKRR-RF achieves a good balance between the test accuracy and the training time at .
5.4 Large-scale Real Data
As listed in Table 1, we study the empirical performance of DKRR-RF algorithm on three large-scale binary classification datasets, including covtype 111https://archive.ics.uci.edu/ml/datasets/covertype and SUSY 222https://archive.ics.uci.edu/ml/datasets/susy and HIGGS 333https://archive.ics.uci.edu/ml/datasets/higgs. For the sake of comparison, we random sampled data points as the training data and data points as the test data. We use random Fourier features [8] to approximate Gaussian kernel . Random Fourier features are in the form , where is drawn from the corresponding Gaussian distribution and is drawn from uniform distribution . In the following experiments, we tune parameters and via -folds cross-validation over grids of and respectively for each dataset, and report average errors over 10 repetitions.
The difficulties of those tasks are unknown ( and are unknown), such that for each dataset, we evaluate the classification errors in two cases:
For fixed and different , we compare the empirical performance of DKRR-RF and KRR-DC [14].
- 2)
For fixed and different , we compare the empirical performance of DKRR-RF and KRR-RF [9].
To explore how the number of random features affects the classification accuracy, we fix the number of partitions as and vary the number of RFs. As shown in left plots of Figures 12, 13, 14, when the number of RFs is small, classification errors of DKRR-RF decrease dramatically as the number of RFs increases. However, when the number of features is more than certain thresholds, classification errors of DKRR-RF converge at some rate near the classification error of KRR-DC. The certain threshold is near for covtype, for SUSY, and for HIGGS. According to Theorem 4, the number of random features is around , thus smaller thresholds represent smaller and lead to higher computational efficiency.
To study the influence of partitions on the accuracy, we fix the number of RFs as and increase the number of partitions . As demonstrated in the right plots of Figures 12, 13, 14, when the numbers of partitions are less than certain thresholds, DKRR-RF provides preferable classification accuracy that is close to the accuracy of KRR-RF. After that, errors increase quickly when the number of partitions increases. The threshold is near for covtype, for SUSY and for HIGGS. According to Theorem 4, the number of partitions is near , thus larger the number of partitions leads to larger and less computational costs.
Indeed, when and are settled as the corresponding thresholds for each data, DKRR-RF achieves the optimal tradeoff of empirical performance and computational efficiency, for example for covtype, for SUSY, and for HIGGS. We think that the optimal learning rates can still be achieved when the number of partitions is smaller than corresponding thresholds and the number of random features is larger than corresponding thresholds. Nevertheless, more partitions or fewer random features break the optimal statistical properties of DKRR-RF, and thus the performance drops very fast. Comparing the thresholds for and on those tasks, we find the following relationships on trainability: SUSY covtype HIGGS, which means the dataset SUSY is easier to obtain a good accuracy-efficiency tradeoff for DKRR-RF.
6 Conclusion
This paper explores the generalization performance of kernel ridge regression with two commonly used efficient large-scale techniques: divide-and-conquer and random features. We first present a general result with the optimal learning rates under standard assumptions. We then refine the theoretical results with more partitions and applicability in the non-attainable case. Further, we reduce the number of random features by generating features in a data-dependent manner. Finally, we present the theoretical results that substantially relax the constraint on the number of partitions with extra unlabeled data, which apply to both the attainable case and non-attainable case. The proposed optimal theoretical guarantees are state-of-the-art in the theoretical analysis for KRR approaches. With extensive experiments on both simulated and real-world data, we validate our theoretical findings with experimental results.
This paper can be extended in several ways: (a) the combination with gradient algorithms such as multi-pass SGD [36, 10] and preconditioned conjugate gradient [34] to further reduce the time complexity. (b) using asynchronous distributed methods or a few of communications [21, 22] instead of one-shot approach to alleviate the saturation phenomenon when .
acknowledgment
This work was supported in part by the Excellent Talents Program of Institute of Information Engineering, CAS, the Special Research Assistant Project of CAS, the Beijing Outstanding Young Scientist Program (No. BJJWZYJH012019100020098), and National Natural Science Foundation of China (No. 62076234, No. 62106257).
Appendix A Proofs
We denote the operatorial norm, specifically the norm to represent the norm in the estimate of error terms. Moreover, we denote with the operator , where is a bounded self-adjoint linear operator, and the identity operator, so for example , , and , where operators are defined in Definition 3 and Definition 4. The estimates of error bounds are based on local estimators (18) and (19) rather than global ones, such that they are associated with the number of local samples , where .
A.1 Definitions of Linear Operators
Since KRR has closed-form solutions, we represent the intermediate estimators in error decomposition by the redirection operators and their adjoint operators. In this part, we first provide useful linear operators associated with kernel (Definition 3) and with random features (Definition 4), respectively. To bound the excess risk , we present the closed-form solutions of the estimators used in Lemma 2 based on those operators, which can be estimated by the difference between integral operator and random features based covariance operator .
To clearly state the relationships among estimators in Lemma 2, we introduce linear operators (both expected and empirical) associated with the RKHS induced by the kernel and the feature space induced by the random features .
Definition 3** (Operators with kernel ).**
For any , and , we have
- •
.
- •
\widehat{S}_{\mathcal{H}}:\mathcal{H}\to\mathbb{R}^{n},\quad\widehat{S}_{\mathcal{H}}\beta=\frac{1}{\sqrt{n}}\big{(}\langle\beta,\phi(\boldsymbol{x}_{i})\rangle\big{)}_{i=1}^{n}\in\mathbb{R}^{n}.
- •
.
- •
.
- •
.
- •
\mathbf{K}:\mathbb{R}^{n}\to\mathbb{R}^{n},\quad\mathbf{K}=\widehat{S}_{\mathcal{H}}\widehat{S}_{\mathcal{H}}^{*},\quad\text{such that}~{}~{}\mathbf{K}=\frac{1}{n}\big{(}K(\boldsymbol{x}_{i},\boldsymbol{x}_{j})\big{)}_{i,j=1}^{n}.
- •
.
- •
.
Here, we denote the inclusion operator and the sampling operator, while are their adjoint operators. Note that is the covariance operator given by , and the integral operator given by . The kernel matrix and the covariance matrix are the empirical counterparts of the integral operator and the covariance operator , respectively. Using Singular Value Decomposition shows that and have the same eigenvalues, and the corresponding eigenvectors are closely related [37]. A similar relationship holds for the kernel matrix and the covariance matrix . Those kernels-related operators are widely used in the proof of optimal learning theory for standard KRR. Using Assumption 1, the integral operator and the covariance operator are positive trace class operators (and hence compact) and bounded by For any function , the estimator is obtained by . Thus, the RKHS norm can be related to the -norm by [38]:
[TABLE]
Definition 4** (Operators with random features).**
For any , , and , we have
- •
.
- •
\widehat{S}_{M}:\mathbb{R}^{M}\to\mathbb{R}^{n},\quad\widehat{S}_{M}\beta=\frac{1}{\sqrt{n}}\big{(}\langle\beta,\phi_{M}(\boldsymbol{x}_{i})\rangle\big{)}_{i=1}^{n}\in\mathbb{R}^{n}.
- •
.
- •
.
- •
.
- •
\mathbf{K}_{M}:\mathbb{R}^{n}\to\mathbb{R}^{n},\,\mathbf{K}_{M}=\widehat{S}_{M}\widehat{S}_{M}^{*},\,\text{such that}~{}\mathbf{K}_{M}=\frac{1}{n}\big{(}K_{M}(\boldsymbol{x}_{i},\boldsymbol{x}_{j})\big{)}_{i,j=1}^{n}.
- •
.
- •
.
Similarly, we also define the redirection operators and their adjoint operators . Using the random features , random features based operators and are close to the kernel based operators . and are the integral operator and the covariance operator defined by random features on and , respectively. The kernel matrix is give by the kernel associated with random features . Random features use Monte Carlo sampling approximates the kernel with features , thus the operators are the expectation counterparts of in terms of kernel probability density .
In Figure 15, we discuss the relationships among operators given in Definition 3 and Definition 4, where are self-adjoint integral operators on and are self-adjoint integral operators on RKHS and , respectively. Operators are the corresponding empirical counterparts of . To estimate the error terms in Lemma 2, we utilize those operators to represent the estimators with closed-form solutions. As shown in Figure 15, we measure the excess risk of DKRR-RF by estimating the difference between the covariance matrix and the integral operator following the approximation chain .
Remark 6**.**
Under Assumption 1, the integral operator is trace class [13] and are finite dimensional. Moreover we have that , and . Finally are self-adjoint and positive operators, with spectrum is .
To represent the noise-free estimator (15), we define a sampling operator :
[TABLE]
A.2 Error decomposition
To decompose the excess risk of DKRR-RF clearly, we provide some intermediate estimators , , and , where
[TABLE]
The local estimator in (15) is the noise-free version of , making up a global noise-free estimator . The estimator in (16) is a data-free version (expected version on ) of , which is still an approximation estimator by random features mapping . The last one in (17) is the expected version of primal KRR with implicit feature mappings associated with the kernel by . Using these estimators, we provide the following decomposition in the equality form of the excess risk to further analyses the components of errors.
Lemma 1**.**
Using operators defined in Definition 3 and Definition 4, the intermediate estimators (6), (15), (16) and (17) admit the following closed-form solutions:
[TABLE]
Here, represents the normalized labels of empirical samples.
Proof.
The objective of the local estimator is given in (5). Using the representation theorem , we let the derivative of the objective be zero and get the solution (6) for the local estimator, which is
[TABLE]
According to the definitions of operators in Definition 4, . Using the definitions of and , we obtain
[TABLE]
According to the objective of (15), we replace the noisy labels with noise-free ones . Meanwhile, using instead of , it holds
[TABLE]
Let the derivative of (16) be zero, and we can get with
[TABLE]
Thus, it holds
[TABLE]
Similarly, using operators related the kernel in Definition 3, we let the derivative of (17) be zero, and obtain with
[TABLE]
Thus, it holds
[TABLE]
∎
We denote for any . According to Assumption 1 and the fact that is a finite measure, we have that almost surely. is the linear combination of , such that almost surely. Since and using definitions of the operators, it holds that and it is natural to estimate the differences of them in the -norm.
Remark 7**.**
In some KRR theory literature, an estimator and its weight are defined as the same symbol. The excess risk measures the difference of the RKHS weight and the estimator (may not belong to the hypothesis space induced by the kernel ), which confuses the error decomposition of the excess risk. In this paper, to cover the situation that is out of the hypothesis space, we measure the difference between rather than . Meanwhile, for sake of clarity, we define the estimators and estimate the excess risk in the -norm.
Proposition 1**.**
For any , we use to denote the local samples and their labels on , where and represent the samples and labels on all partitions. For the estimators , and , there holds
[TABLE]
Here, denotes the conditional expectation with respect to given on the -th partition .
Proof.
Using the operator based solutions of the local estimator of DKRR-RF (18), and the local noise-free estimator (19), we have
[TABLE]
Taking the expectation over in terms of , it holds
[TABLE]
that proves the identity (22). ∎
Taking the expectation over the conditional distribution , we first prove the equivalence between the local estimators. We then establish the equivalence relationship between and . Next, we derive relationships between global estimators and local ones to prove the error decomposition in Lemma 2. It is easy to bridge connection between the excess risk and the discrepancy of two estimators: and the target regression [12] that
[TABLE]
Here, \|f\|_{\rho}=\|f\|_{L^{2}_{\rho_{X}}}=\big{(}\int_{\mathcal{X}}|f(\boldsymbol{x})|^{2}d\rho_{X}\big{)}^{1/2}, is the induced marginal measure on the input space .
Lemma 2**.**
For any , let and be defined as the above, we have
[TABLE]
Proof.
Using the noise-free estimator as the intermedium, we have
[TABLE]
Taking the conditional expectation with respect to given on both sides, using (22) in Proposition 1 which indicates
[TABLE]
we thus have
[TABLE]
Using the fact , we have
[TABLE]
Following the proof of Proposition 5 in [28], we establish the relationship between global and local empirical error
[TABLE]
Substituting (30), (31) and (32) to (23), we get the desired result
[TABLE]
∎
Sample variance (25) is brought by noise on labels , which is output-dependent. Distributed error (26) reflects errors from distributed learning. Empirical error (27) represents the gap between expected learning and empirical learning. Note that empirical error focuses on noise-free data, and thus it can be reduced by additional unlabeled data, resulting in Theorem 4. Independent on the sample, random features error (28) is caused by the discrepancy between the kernel approximated by random features and the kernel, while approximation error (29) reflects the bias of the algorithm. Data-dependent features can reduce random features error (28) that motivates Theorem 3.
The global sample variance is reduced to of the local one, illustrating that distributed learning can reduce the sample error than any local estimator. But also, the empirical error is output independent and can be reduced by using unlabeled data. Thus, with the same optimal error convergence rate, we improve the number of partitions by introducing more unlabeled examples in Theorem 4. Sample variance relies on both samples and labels, while random features error and approximation error are independent of the data, so additional unlabeled data do not influence other errors.
A.3 Estimate Error Terms
To analysis the excess risk, in this part, we estimate four error terms and . The estimate of sample variance and empirical error are related to the key quantity . To relax the restriction on the number of partitions, we provide a sharper upper bound for the critical quantity as a constant based on Bernstein’s inequality. The estimate of random features error is also associated with a critical quantity , where we estimate this term separately. The sample variance is related to the number local labeled sample size , while the key quantity is related to the local total sample size . Those two parts lead to two constraints on the number of partitions. Random features error is related to the dimension of random features and independent of sample size.
A.3.1 Estimates for Sample Variance
Lemma 3**.**
For the sample variance (25) in the error decomposition, the following holds
[TABLE]
For any , when , there exists with the probability at least
[TABLE]
Proof.
Let and be defined as (18) and (19), we have
[TABLE]
The last step is due to Cauchy–Schwarz inequality. Note that
[TABLE]
where . Thus, we have , and it holds that
[TABLE]
The term can be rewritten as
[TABLE]
Combining (33), (34) and (35), one can prove
[TABLE]
From Lemma 10, we know that with high probability if . Substituting Lemma 4, Lemma 5 and Lemma 10 to (LABEL:eq.sample_variance.proof.eq3), if , it holds with the probability at least
[TABLE]
∎
Lemma 4** (Lemma 6 in [9]).**
For , under assumptions 1, 2, the following holds with the probability at least
[TABLE]
Using Bernstein’s inequality (Proposition 2), we prove the following lemma.
Lemma 5**.**
For under Assumptions 1, 2 with the probability at least , we have
[TABLE]
Proof.
Let on in the Hilbert space . We see that
[TABLE]
Thus, the error term to bound can be stated as
[TABLE]
The rhs of the above identity can be bounded by Bernstein’s inequality (Proposition 2), thus we need to estimate and first.
Note that under Assumption 2 when setting implies that the regression function is bounded almost surely [10]
[TABLE]
With the inequality
[TABLE]
we thus have
[TABLE]
Note that
[TABLE]
Substituting (38) and (39) to (37), by Bernstein’s inequality (Proposition 2), one can prove that with the probability at least
[TABLE]
∎
A.3.2 Estimates for Empirical Error
Lemma 6**.**
For the empirical error (27) in error decomposition, the following holds
[TABLE]
Under Assumptions 1 and 5, for and , when the number of local examples satisfies and the dimension of random features satisfies
[TABLE]
there exists with the probability at least
[TABLE]
Proof.
Using the definition of , we have
[TABLE]
Under definitions in (19) and (20), using the above identity and for positive operators , we have
[TABLE]
To obtain the key term , we introduce additional terms in the last step of the above identity. Note that, the following inequalities hold , , and . Thus, one can obtain that
[TABLE]
When , by Lemma 10, it holds with the probability
[TABLE]
Using Lemma 7 and Lemma 8, we have with probability at least
[TABLE]
∎
A.3.3 Estimates for Random Features Error
The next lemma bounds the distance between the Tikhonov solution with RF and the Tikhonov solution without RF, reflecting the approximation ability of random features.
Lemma 7**.**
Under Assumptions 1 and 5, for , when
[TABLE]
the following holds with a probability at least
[TABLE]
Proof.
According to the operator representations of (20) and (21)
[TABLE]
Using the identity and , we have
[TABLE]
Applying Assumption 3, there exists and , so we have
[TABLE]
Note that , , and if due to Lemma 12, we thus have with probability at least
[TABLE]
Then, we estimate the bound in two cases and .
- •
When , there exists
[TABLE]
The last step is due to for any .
Note that using Lemma 12, thus for , it holds with probability at least
[TABLE]
- •
When , there exists
[TABLE]
with and .
Using Proposition 4 with and , one can obtain that
[TABLE]
Thus, applying the above inequality to (40), we have
[TABLE]
To obtain we need the mixed term be bounded by
[TABLE]
From Lemma 11, with the condition , it holds
[TABLE]
where and . Similarly, Lemma 13 can be stated as
[TABLE]
where , and .
Note that, according to Minkowski’s inequality, we have
[TABLE]
Therefore, substituting (44) (45), (46) to (43), there holds
[TABLE]
To make the mixed term bounded by , we consider the following condition
[TABLE]
and obtain the bound of mixed term
[TABLE]
The third step is due to and
[TABLE]
due to to guarantee bounded effective dimension in Proposition 10 [9]. The last step is due to since and since .
Thus, with the condition and we have with probability at least
[TABLE]
Combing the results in (41) and (LABEL:eq.rf-errors.easy_problems), we prove the lemma. ∎
A.3.4 Estimates for Approximation Error
The last term we need to estimate is approximation error whose proof is standard [12, 13, 9].
Lemma 8**.**
Under Assumption 1 and 4, the following holds for any and ,
[TABLE]
Proof.
Under Assumption 4, there exists such that with . The identity is valid for and the bounded self-adjoint positive operator and by the definition of (21), we have
[TABLE]
Note that and , while according to Assumption 4. The proof is completed. ∎
A.4 Proofs of Main Results
Theorem 5** (General excess risk bound).**
Let and be defined by (7). Under Assumptions 1, 2, 3, 4 and 5, when , the number of local processors satisfies
[TABLE]
and the dimension of random features satisfies
[TABLE]
then the following holds with a probability at least ,
[TABLE]
where , and is a constant independent on that
[TABLE]
Proof.
From Lemma 2, there holds the upper bound for excess risk
[TABLE]
In the following, we use Lemma 3, Lemma 6 and Lemma 7 to bound error terms. Therefore, we need to take into account the conditions in those lemmas. There are constraints on the number of local examples and the dimension of random features :
[TABLE]
Here, we merge the constraints on because it is difficult to acknowledge which range the regularity belongs to. Meanwhile, is dependent on the number of partitions , where . Due the constraint on the number of samples and , we use Assumption 5 to obtain the restrict on the number of partitions
[TABLE]
- •
When , using Assumption 5 that , to ensure , we sholud have
[TABLE]
Thus, it holds
[TABLE]
- •
When , using Assumption 3 and Assumption 5, we should have
[TABLE]
where and .
To ensure M\geq 16\kappa^{2}\log(2/\delta)\big{[}(\mathcal{N}_{\infty}(\lambda)+1)\vee\lambda^{1-2r}\mathcal{N}(\lambda)^{2r-1}(\mathcal{N}_{\infty}(\lambda)+1)^{2-2r}\big{]}, using the above inequality it holds for
[TABLE]
due to the fact .
By Lemma 3, and , it holds for the global sample variance
[TABLE]
The last step is due the inequality From Assumption 3, we have . Note that, we can obtain by Proposition 10 of [9] and . Using and the worst case , it holds
[TABLE]
where
According to Lemma 6, there holds for the empirical error
[TABLE]
Using Lemma 7, for random features error, it holds
[TABLE]
Using Lemma 8, for approximation error, it holds
[TABLE]
Substituting the above inequalities (LABEL:eq.proof.excess_risk.variance_bounded) (54) (55) (56) to Lemma 2, we then get the final result
[TABLE]
where . Note that, the proof use inequalities with high probability , including Lemmas 4, 5, 10, 12, 13, and thus the final result holds with the probability at least . ∎
Proof of Theorem 1.
The results in Theorem 1 is a trivial extension of Theorem 2 in [9] and Corollary 1 in [14]. Only considering the attainable case , this theorem can be proved by combining the proofs in [14] and [9].
Following the error decomposition and proof process in the proof of Theorem 3, one can easily prove Theorem 1. However, the main difference is how to bound the term as a constant. Using Proposition 1 and the second-order decomposition of operator difference in [14], one can obtain the following identities
[TABLE]
Applying , , the facts and , it holds
[TABLE]
With confidence at least , there holds for and can be found in [13, 4]
[TABLE]
To guarantee the term be a constant, it requires
[TABLE]
to make sure that
[TABLE]
Using Assumption 3 and , one can obtain the condition same as in [13, 4, 14]. However, in Lemma 9 and Lemma 10, we directly apply a relaxed condition by Bernstein’s inequality to guarantee the term be a constant.
To prove Theorem 1, we just need use the condition to replace the condition in the proof of Lemma 3 and Lemma 6. Then, following the proof of Theorem 5 for , we prove the result with due to ∎
Proof of Theorem 2.
Consider the worst case of Assumption 5, it is equivalent to making no assumption on , and there always exists Applying Theorem 5 with and , we prove the result. ∎
Proof of Theorem 3.
Theorem 5 is the detailed version of Theorem 3. ∎
Theorem 6** (Improved Bounds with Additional Unlabeled Samples).**
Let and be defined by (10). Under Assumptions 1, 2, 3, 4 and 5, when , the total number of samples satisfies
[TABLE]
the number of local processors satisfies
[TABLE]
and the dimension of random features satisfies
[TABLE]
then the following holds with a probability at least ,
[TABLE]
where , and is a constant independent on that
[TABLE]
Proof.
From Lemma 2, there holds the upper bound for excess risk
[TABLE]
Using the above equality and Lemma 6, we find that empirical error is data-dependent but output-independent. Meanwhile, the sample variance (Lemma 3) is dependent on the number of labeled samples , while other terms (including ) can be related to total sample size .
Based the sample variance, we first estimate the number of required labeled samples . Using Lemma 3 and (LABEL:eq.proof.excess_risk.variance_bounded), we have
[TABLE]
To guarantee the optimal learning rate, we need mN^{\frac{1-4r-2\gamma}{2r+\gamma}}\leq\mathcal{O}\big{(}N^{\frac{-2r}{2r+\gamma}}\big{)}, and thus
[TABLE]
We then consider the additional unlabeled samples to reduce empirical error, where the local samples is label-free and the constraint is related to total sample size from Lemma 10:
[TABLE]
Let , then the restriction on the dimension of random features is same to Theorem 5. But the restriction on the number of partitions is changed to
[TABLE]
From the constraint (57) due to sample variance, we know that the number of partitions can not be bigger than and plays the leading role. Thus, combining (57) and (58), one can obtain
[TABLE]
We consider the following two conditions for
- •
The case . It holds , thus the constraint of the number of partition is .
- •
The case . It holds and we make use of additional unlabeled examples to guarantee .
Therefore, using unlabeled examples, the number of partitions always achieves .
Considering the following constraints on the number of partitions and the dimension of random features :
[TABLE]
We first estimate the output-dependent error term: sample variance. Using , and (LABEL:eq.proof.excess_risk.variance_bounded), the global sample variance is bounded by
[TABLE]
where .
We then bound the label-free terms in Lemma 2 with . Using Lemma 6, Lemma 7 and Lemma 8, it holds
[TABLE]
Combining the above inequalities (LABEL:eq.unlabel.proof.global_sample_variance) and (60) to Lemma 2, one can prove the desired result. ∎
Proof of Theorem 4.
Theorem 6 is the detailed version of Theorem 4. ∎
Corollary 1**.**
Under the same assumptions of Theorem 3, if , and , then and the number of random features satisfying
[TABLE]
are sufficient to guarantee, with a high probability, that
[TABLE]
The above error bound is a special case of Theorem 3 with only using one partition , namely KRR-RF. Compared to theoretical results in [9] which only take effect in the attainable case , Corollary 1 pertain to both the attainable and non-attainable cases , covering all difficult problems. Meanwhile, the requirements on the number of random features are reasonable and lead to higher computational efficiency.
Corollary 2**.**
Under the same assumptions of Theorem 4, if and then the total number of samples corresponding to
[TABLE]
and the number of local processors satisfying
[TABLE]
are sufficient to guarantee, with a high probability, that
[TABLE]
The above Corollary is a special case of DKRR-RF with the induced kernel rather than random features, i.e., KRR-DC. The existing theoretical results on KRR-DC are still restricted with , while we improve the condition to for the first time, which admits higher computational complexities and covers more complicated problems in the non-attainable cases. Using the condition , it is worthy of devising more efficient distributed KRR methods together with Nyström subsampling, random projections, stochastic optimization, and other techniques in the future.
A.5 Probabilistic Inequalities
Proposition 2** (Lemma 2 in [12]).**
Let be a separable Hilbert space and be a sequence of i.i.d random variables in . Assume the bound be and the variance be for any . For any , with confidence ,
[TABLE]
The above Bernstein’s inequality is the key to analyzing the relationship between the empirical random vector and its expected counterpart, which is used to prove Lemma 9 and Lemma 4. The above Bernstein’s inequality for random vectors was provided in [12, 9] and later was extended to the random operator case in Lemma 24 of [10].
Proposition 3** (Lemma E.2 of [39]).**
For any self-adjoint and positive semi-definite operators and , if there exists such that the following inequality holds
[TABLE]
then
[TABLE]
The above inequality [39] was used to establish the connection between and . In this paper, those two terms and often exist on the left parts of the estimates of error terms, where we make use of Proposition 3 to guarantee both of two terms of lhs as constants.
Proposition 4** (Proposition 9 in [9]).**
Let be two separable Hilbert spaces and be bounded linear operators, with and be positive semidefinite. The following holds
[TABLE]
Lemma 9**.**
Given \phi_{M}(\boldsymbol{x})=M^{-1/2}\big{[}\psi(\boldsymbol{x},\omega_{1}),\cdots,\psi(\boldsymbol{x},\omega_{M})\big{]}^{\top}, let random vectors \bigl{[}\phi_{M}(\boldsymbol{x}_{1}),\cdots,\phi_{M}(\boldsymbol{x}_{n})\bigr{]} with be on a separable Hilbert space such that and are trace class. Then for any with the probability at least , the following holds
[TABLE]
Proof.
Let and
[TABLE]
thus we have
[TABLE]
The left of the desired inequality becomes
[TABLE]
Note that
[TABLE]
To use Bernstein’s inequality (Proposition 2), we need to bound and as follows
[TABLE]
Substituting the above two identities to Bernstein’s inequality (61), we prove the result. ∎
Lemma 10**.**
When the number of the local samples , then for any , there exists with the confidence
[TABLE]
Proof.
From the Proposition 9, we set and obtain that
[TABLE]
From Proposition 3 and the above inequality, there exists
[TABLE]
∎
Lemma 11**.**
Let with , be random vectors on a separable Hilbert space such that and are trace class. Then for any with the probability at least , the following holds
[TABLE]
Proof.
Let and
[TABLE]
thus we have
[TABLE]
The left of the desired inequality becomes
[TABLE]
Note that
[TABLE]
To use Bernstein’s inequality (Proposition 2), we need to bound and . Note that
[TABLE]
Substituting the above two identities to Bernstein’s inequality (61), we prove the result. ∎
Lemma 12**.**
When the dimension of random features , then for any , there exists with the confidence
[TABLE]
Proof.
From the Proposition 11, we set and obtain that
[TABLE]
From Proposition 3 and the above inequality, there exists
[TABLE]
∎
Lemma 13**.**
Let with , be random vectors on a separable Hilbert space such that and are trace class. Then for any with the probability at least , the following holds
[TABLE]
Proof.
Let and
[TABLE]
thus we have
[TABLE]
The left of the desired inequality becomes
[TABLE]
Note that
[TABLE]
To use Bernstein’s inequality (Proposition 2), we need to bound and . Note that
[TABLE]
The last step is due to Substituting the above two identities and to Bernstein’s inequality (61), we prove the result. ∎
The reference list from the paper itself. Each links out to its DOI / PubMed record.
- 1[1] Léon Bottou and Olivier Bousquet. The tradeoffs of large scale learning. In Advances in Neural Information Processing Systems 21 (NIPS) , pages 161–168, 2008.
- 2[2] Jian Li, Yong Liu, Rong Yin, Hua Zhang, Lizhong Ding, and Weiping Wang. Multi-class learning: From theory to algorithm. In Advances in Neural Information Processing Systems 31 , pages 1591–1600, 2018.
- 3[3] Yuchen Zhang, John Duchi, and Martin Wainwright. Divide and conquer kernel ridge regression: A distributed algorithm with minimax optimal rates. Journal of Machine Learning Research , 16(1):3299–3340, 2015.
- 4[4] Shao-Bo Lin, Xin Guo, and Ding-Xuan Zhou. Distributed learning with regularized least squares. The Journal of Machine Learning Research , 18(1):3202–3232, 2017.
- 5[5] Christopher KI Williams and Matthias Seeger. Using the nyström method to speed up kernel machines. In Advances in Neural Information Processing Systems 14 (NIPS) , pages 682–688, 2001.
- 6[6] Alessandro Rudi, Raffaello Camoriano, and Lorenzo Rosasco. Less is more: Nyström computational regularization. In Advances in Neural Information Processing Systems 28 (NIPS) , pages 1657–1665, 2015.
- 7[7] Jian Li, Yong Liu, Rong Yin, and Weiping Wang. Approximate manifold regularization: Scalable algorithm and generalization analysis. In IJCAI , pages 2887–2893, 2019.
- 8[8] Ali Rahimi and Benjamin Recht. Random features for large-scale kernel machines. In Advances in Neural Information Processing Systems 21 (NIPS) , pages 1177–1184, 2007.
