Towards Sharp Analysis for Distributed Learning with Random Features

Jian Li; Yong Liu; Weiping Wang

arXiv:1906.03155·cs.LG·August 30, 2023

Towards Sharp Analysis for Distributed Learning with Random Features

Jian Li, Yong Liu, Weiping Wang

PDF

Open Access 1 Repo

TL;DR

This paper advances the theoretical understanding of distributed learning with random features by extending optimal rates to non-attainable cases, reducing feature requirements, and improving partition scalability, supported by experiments.

Contribution

It introduces refined analysis techniques for non-attainable cases, data-dependent feature generation, and enhanced partitioning strategies in distributed learning with random features.

Findings

01

Extended optimal rates to non-attainable cases

02

Reduced number of random features needed

03

Improved scalability with additional unlabeled data

Abstract

In recent studies, the generalization properties for distributed learning and random features assumed the existence of the target concept over the hypothesis space. However, this strict condition is not applicable to the more common non-attainable case. In this paper, using refined proof techniques, we first extend the optimal rates for distributed learning with random features to the non-attainable case. Then, we reduce the number of required random features via data-dependent generating strategy, and improve the allowed number of partitions with additional unlabeled data. Theoretical analysis shows these techniques remarkably reduce computational cost while preserving the optimal generalization accuracy under standard assumptions. Finally, we conduct several experiments on both simulated and real-world datasets, and the empirical results validate our theoretical findings.

Tables1

Table 1. Table 1: Datasets statistics. EEG ∗ indicates 3000 3000 3000 examples are used as labeled examples, 7000 7000 7000 examples are used as unlabeled ones and the other 4980 4980 4980 examples as the test data.

datasets	$d$	$#$ training	$#$ testing	$σ$	$λ$
EEG	$14$	$7, 490$	$7, 490$	$1$	$2^{- 4}$
EEG^∗	$14$	$3, 000 (7, 000^{*})$	$4, 980$	$1$	$2^{- 4}$
covtype	$54$	$250, 000$	$50, 000$	$2$	$2^{- 6}$
SUSY	$18$	$250, 000$	$50, 000$	$2^{2.5}$	$2$
HIGGS	$28$	$250, 000$	$50, 000$	$2^{3}$	$2^{- 6}$

Equations477

f_{λ} := f \in H arg min {\frac{1}{N} i = 1 \sum N (f (x_{i}) - y_{i})^{2} + λ ∥ f ∥_{H}^{2}} .

f_{λ} := f \in H arg min {\frac{1}{N} i = 1 \sum N (f (x_{i}) - y_{i})^{2} + λ ∥ f ∥_{H}^{2}} .

α = (K_{N} + λ N I)^{- 1} y_{N},

α = (K_{N} + λ N I)^{- 1} y_{N},

K (x, x^{'}) = \int_{Ω} ψ (x, ω) ψ (x^{'}, ω) p (ω) d ω, \forall x, x^{'} \in X,

K (x, x^{'}) = \int_{Ω} ψ (x, ω) ψ (x^{'}, ω) p (ω) d ω, \forall x, x^{'} \in X,

\displaystyle\phi_{M}(\boldsymbol{x})=\frac{1}{\sqrt{M}}\big{(}\psi(\boldsymbol{x},\omega_{1}),\cdots,\psi(\boldsymbol{x},\omega_{M})\big{)}^{\top},

\displaystyle\phi_{M}(\boldsymbol{x})=\frac{1}{\sqrt{M}}\big{(}\psi(\boldsymbol{x},\omega_{1}),\cdots,\psi(\boldsymbol{x},\omega_{M})\big{)}^{\top},

w_{j} = w \in R^{M} arg min {\frac{1}{n} i = 1 \sum n (f (x_{i}) - y_{i})^{2} + λ ∥ f ∥^{2}},

w_{j} = w \in R^{M} arg min {\frac{1}{n} i = 1 \sum n (f (x_{i}) - y_{i})^{2} + λ ∥ f ∥^{2}},

\displaystyle\widehat{\boldsymbol{w}}_{j}=\big{[}\Phi_{M}^{\top}\Phi_{M}+\lambda I\big{]}^{-1}\Phi_{M}^{\top}\widehat{y}_{n},

\displaystyle\widehat{\boldsymbol{w}}_{j}=\big{[}\Phi_{M}^{\top}\Phi_{M}+\lambda I\big{]}^{-1}\Phi_{M}^{\top}\widehat{y}_{n},

f_{D, λ}^{M} (x) = \frac{1}{m} j = 1 \sum m f_{D_{j}, λ}^{M} (x) .

f_{D, λ}^{M} (x) = \frac{1}{m} j = 1 \sum m f_{D_{j}, λ}^{M} (x) .

E (f) = \int_{X \times Y} (f (x) - y)^{2} d ρ (x, y) .

E (f) = \int_{X \times Y} (f (x) - y)^{2} d ρ (x, y) .

\int_{R} ∣ y ∣^{p} d ρ (y ∣ x) \leq \frac{1}{2} p! B^{p - 2} σ^{2} .

\int_{R} ∣ y ∣^{p} d ρ (y ∣ x) \leq \frac{1}{2} p! B^{p - 2} σ^{2} .

(Lg) (\cdot)

(Lg) (\cdot)

(L_{M} g) (\cdot)

N (λ)

N (λ)

N_{M} (λ)

N (λ) \leq Q^{2} λ^{- γ} .

N (λ) \leq Q^{2} λ^{- γ} .

f_{ρ} = L^{r} g,

f_{ρ} = L^{r} g,

1 ≲ m ≲ N^{\frac{2 r - 1}{2 r + γ}}, M ≳ N^{\frac{( 2 r - 1 ) γ + 1}{2 r + γ}},

1 ≲ m ≲ N^{\frac{2 r - 1}{2 r + γ}}, M ≳ N^{\frac{( 2 r - 1 ) γ + 1}{2 r + γ}},

\displaystyle\mathbb{E}~{}\mathcal{E}(\widehat{f}_{D,\lambda}^{M})-\mathcal{E}(f_{\mathcal{H}})=\mathcal{O}\Big{(}N^{-\frac{2r}{2r+\gamma}}\Big{)}.

\displaystyle\mathbb{E}~{}\mathcal{E}(\widehat{f}_{D,\lambda}^{M})-\mathcal{E}(f_{\mathcal{H}})=\mathcal{O}\Big{(}N^{-\frac{2r}{2r+\gamma}}\Big{)}.

1 ≲ m ≲ N^{\frac{2 r + γ - 1}{2 r + γ}}

1 ≲ m ≲ N^{\frac{2 r + γ - 1}{2 r + γ}}

M

M

M

\displaystyle\mathbb{E}~{}\mathcal{E}(\widehat{f}_{D,\lambda}^{M})-\mathcal{E}(f_{\rho})=\mathcal{O}\Big{(}N^{-\frac{2r}{2r+\gamma}}\Big{)}.

\displaystyle\mathbb{E}~{}\mathcal{E}(\widehat{f}_{D,\lambda}^{M})-\mathcal{E}(f_{\rho})=\mathcal{O}\Big{(}N^{-\frac{2r}{2r+\gamma}}\Big{)}.

N_{\infty} (λ) = ω \in Ω sup ∥ (L + λ I)^{- 1/2} ψ (\cdot, ω) ∥_{ρ_{X}}^{2}, λ > 0.

N_{\infty} (λ) = ω \in Ω sup ∥ (L + λ I)^{- 1/2} ψ (\cdot, ω) ∥_{ρ_{X}}^{2}, λ > 0.

N_{\infty} (λ) \leq F λ^{- α} .

N_{\infty} (λ) \leq F λ^{- α} .

N (λ)

N (λ)

\leq ω \in Ω sup ∥ (L + λ I)^{- 1/2} ψ (\cdot, ω) ∥_{ρ_{X}}^{2} = N_{\infty} (λ) .

1 ≲ m

1 ≲ m

M

M

M

\displaystyle\mathbb{E}~{}\mathcal{E}(\widehat{f}_{D^{*},\lambda}^{M})-\mathcal{E}(f_{\rho})=\mathcal{O}\Big{(}N^{-\frac{2r}{2r+\gamma}}\Big{)}.

\displaystyle\mathbb{E}~{}\mathcal{E}(\widehat{f}_{D^{*},\lambda}^{M})-\mathcal{E}(f_{\rho})=\mathcal{O}\Big{(}N^{-\frac{2r}{2r+\gamma}}\Big{)}.

y_{i}^{*}=\left\{\begin{array}[]{lr}\frac{|D_{j}^{*}|}{|D_{j}|}y_{i},&\quad\text{if}(\boldsymbol{x}_{i},y_{i})\in D_{j},\\ 0,&\quad\text{otherwise}.\end{array}\right.

y_{i}^{*}=\left\{\begin{array}[]{lr}\frac{|D_{j}^{*}|}{|D_{j}|}y_{i},&\quad\text{if}(\boldsymbol{x}_{i},y_{i})\in D_{j},\\ 0,&\quad\text{otherwise}.\end{array}\right.

f_{D^{*}, λ}^{M} = \frac{1}{m} j = 1 \sum m f_{D_{j}^{*}, λ}^{M} .

f_{D^{*}, λ}^{M} = \frac{1}{m} j = 1 \sum m f_{D_{j}^{*}, λ}^{M} .

N^{*} ≳ N N^{\frac{γ + α - 1}{2 r + γ}} \lor N,

N^{*} ≳ N N^{\frac{γ + α - 1}{2 r + γ}} \lor N,

1 ≲ m ≲ N^{\frac{2 r + 2 γ - 1}{2 r + γ}}

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

superlj666/Distributed-Learning-with-Random-Features
none

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsDistributed Sensor Networks and Detection Algorithms · Stochastic Gradient Optimization Techniques · Sparse and Compressive Sensing Techniques

Full text

Towards Sharp Analysis for Distributed Learning with Random Features

Jian Li

Institute of Information Engineering, Chinese Academy of Sciences

Yong Liu*†*

Gaoling School of Artificial Intelligence, Renmin University of China Yong Liu is also the corresponding author.

Weiping Wang

Institute of Information Engineering, Chinese Academy of Sciences

Abstract

In recent studies, the generalization properties for distributed learning and random features assumed the existence of the target concept over the hypothesis space. However, this strict condition is not applicable to the more common non-attainable case. In this paper, using refined proof techniques, we first extend the optimal rates for distributed learning with random features to the non-attainable case. Then, we reduce the number of required random features via data-dependent generating strategy, and improve the allowed number of partitions with additional unlabeled data. Theoretical analysis shows these techniques remarkably reduce computational cost while preserving the optimal generalization accuracy under standard assumptions. Finally, we conduct several experiments on both simulated and real-world datasets, and the empirical results validate our theoretical findings.

1 Introduction

A fundamental problem in machine learning is to achieve tradeoffs between statistical properties and computational costs [1, 2], while this challenge is more severe in kernel methods. Despite the excellent theoretical guarantees, kernel methods do not scale well in large-scale settings because of high time and memory complexities, typically at least quadratic in the number of examples. To break the scalability bottlenecks, researchers developed a wide range of practical algorithms, including distributed learning, which produces a global model after training disjoint subset on individual machines with necessary communications [3, 4], Nyström approximation [5, 6, 7] and random Fourier features [8, 9] to alleviate memory bottleneck, as well as stochastic methods [10] to improve the training efficiency.

From the theoretical perspective, many researchers have studied the statistical properties of those large-scale approaches together with kernel ridge regression (KRR) [6, 11, 4]. Using integral operator techniques [12] and the effective dimension to control the capability of RKHS [13], the generalization bounds have achieved the optimal learning rates. Recent statistical learning studies on KRR together with large-scale approaches demonstrate that these approaches can not only obtain great computational gains but still remain the optimal theoretical properties, such as KRR together with divide-and-conquer [14, 15], with random projections including Nyström approximation [6] and random features [9, 16, 17, 18]. Since the communication cost is high to combine local kernel estimators in RKHS, it’s more practical to combine the linear estimator in the feature space, e.g. federated learning [19]. Therefore, the generalization analysis for the combination of distributed learning and random features is rather important in distributed learning.

The existing works on DKRR [14, 4, 15] and random features [9, 20, 18] mainly focus on the attainable case that the true regression belongs to the hypothesis space, ignoring the non-attainable case where the true regression is out of the hypothesis space. Since it’s hard to select the suitable kernel via kernel selection to guarantee that the target function belongs to the kernel space, the non-attainable case is more common in practice. Therefore, the statistical guarantees for the non-attainable are of practical and theoretical interest in the context of the statistical learning theory. The optimal rates for DKRR have been extended to a part of the non-attainable case via sharp analysis for the distributed error [10] and multiple communications [21, 22], but these techniques are hard to improve the results for random features. Meanwhile, some recent studies extended the capacity-independent optimality to the non-attainable, including distributed learning [23], random features [24] and Nyström approximation [25], but the capacity-independent results are suboptimal when the capacity of RKHS is small. The capacity-optimality for the combination of distributed learning and random features to the non-attainable case is still an open problem.

In this paper, we aim at extending the capacity-dependent optimal guarantees to the non-attainable case and improve the computational efficiency with more partitions and fewer random features. Firstly, using the refined estimation of operators’ similarity, we refine the optimal generalization error bound that allows much more partitions and pertains to a part of the non-attainable case. Then, generating random features in a data-dependent manner, we relax the restriction on the dimension of random features, and thus fewer random features are sufficient to reach the optimal rates. By using additional unlabeled data to reduce label-independent error terms, we further enlarge the number of partitions and improve the applicable scope in the non-attainable case. Finally, we validate our theoretical findings with extensive experiments. Note that, we leave the full proofs in the appendix.

1.1 Our Contributions

We highlight our contributions as follows:

•

On the algorithmic front: much higher computational efficiency. This work presents the currently maximum number of partitions and the minimal dimension of random features, extremely improving the computational efficiency.

–

More partitions. To achieve the optimal learning rate, the traditional distributed KRR methods [4, 14] impose a strict constraint on the number of partitions $m\lesssim N^{\frac{2r-1}{2r+\gamma}}$ , which heavily limits the computational efficiency. In this paper, using a novel estimation of the key quantity, we first relax the restriction to $m\lesssim N^{\frac{2r+\gamma-1}{2r+\gamma}}$ . Then, introducing a few additional unlabeled examples, we improve the number of partitions to $m\lesssim N^{\frac{2r+2\gamma-1}{2r+\gamma}}$ for the first time.

–

Fewer random features. By generating random features in a data-dependent manner rather than in a data-independent manner, we reduce the requirement on the number of random features from $M\gtrsim N^{\frac{(2r-1)\gamma+1}{2r+\gamma}}\quad\forall r\in[1/2,1]$ to $M\gtrsim N^{\frac{2r+\gamma-1}{2r+\gamma}}\vee N^{\frac{\gamma}{2r+\gamma}}\quad\forall r\in(0,1]$ , where $M$ is the number of random features and $\vee$ indicates the bigger one.

•

On the theoretical front: covering the non-attainable case. The conventional optimal learning properties for KRR [13, 9, 14] only pertain to the attainable case $r\in[1,1/2]$ , assuming the true regression belongs to the hypothesis space $f_{\rho}\in\mathcal{H}$ where the problems can not be too difficult. However, the condition $f_{\rho}\in\mathcal{H}$ is too ideal and the non-attainable $r\in(0,1/2)$ assuming $f_{\rho}\notin\mathcal{H}$ deserve more attention. In this paper, we first restate the classic results in the attainable $r\in[1/2,1]$ . Then, by relaxing the restriction on the number of partitions, we extend the optimal theoretical guarantees to the non-attainable case with the constraints $2r+\gamma\geq 1$ and $2r+2\gamma\geq 1$ . Note that we prove KRR with random features applies to all non-attainable cases $r\in(0,1/2)$ .

•

Extensive experimental validation. To validate our theoretical findings, we conduct extensive experiments on simulated data and real-world data. We first construct simulated experiments under different difficulties to validate the learning rate and training time. Then, we perform comparison on a small real-world dataset to verify the effectiveness of data-dependence random features (with a novel approximate leverage score function) and additional unlabeled examples. Finally, we compare the proposed DKRR-RF with related work in terms of the performance on three real-world datasets.

•

Technical challenges.

–

More partitions with additional unlabeled examples. In the error decomposition, only sample variance is label-dependent. At the same time, other terms are label-independent, and thus we employ additional unlabeled examples to reduce the estimation of label-independent error terms. We further improve the applicable scope in the non-attainable case to $m\lesssim N^{\frac{2r+2\gamma-1}{2r+\gamma}}$ .

–

Random features error in all non-attainable cases. Using an appropriate decomposition on the operatorial level for random features error, we prove KRR with random features pertains to both attainable and non-attainable case $r\in(0,1]$ .

Overall, by overcoming several technical hurdles, we present the optimal theoretical guarantees for the combination of DKRR and RF. With more partitions and fewer random features, the theoretical results not only obtain significant computational gains but also preserve the optimal learning properties to both the attainable and non-attainable case $r\in(0,1]$ . Indeed, KRR [13], DKRR [14], and KRR-RF [9] are special cases of this paper. Thus, the techniques presented here pave the way for studying the statistical guarantees of other types kernel approaches (even neural networks) that can apply to the non-attainable case.

2 Distributed Learning with Random Feature

In a standard framework of supervised learning, there is a probability space $\mathcal{X}\times\mathcal{Y}$ with a fixed but unknown distribution $\rho$ , where $\mathcal{X}=\mathbb{R}^{d}$ is the input space and $\mathcal{Y}=\mathbb{R}$ is the output space. The training set $D=\{(\boldsymbol{x}_{i},y_{i})\}_{i=1}^{N}$ is sampled i.i.d. from $\mathcal{X}\times\mathcal{Y}$ with respect to $\rho$ . The primary objective is to fit the target regression $f_{\rho}$ on $\mathcal{X}\times\mathcal{Y}$ . The Reproducing Kernel Hilbert Space (RKHS) $\mathcal{H}$ induced by a Mercer kernel $K$ is defined as the completion of the linear span of $\{K(\boldsymbol{x},\cdot),\boldsymbol{x}\in\mathcal{X}\}$ with respect to the inner product $\langle K(\boldsymbol{x},\cdot),K(\boldsymbol{x}^{\prime},\cdot)\rangle_{\mathcal{H}}=K(\boldsymbol{x},\boldsymbol{x}^{\prime})$ . In the view of feature mappings, an underlying nonlinear feature mapping $\phi:\mathcal{X}\to\mathcal{H}$ associated with the kernel $K$ is $\phi(\boldsymbol{x}):=K(\boldsymbol{x},\cdot)$ , so it holds $f(\boldsymbol{x})=\langle f,\phi(\boldsymbol{x})\rangle_{\mathcal{H}}$ .

2.1 Kernel Ridge Regression (KRR)

With an RKHS norm term, kernel ridge regression (KRR) is one of the popular empirical approaches to conducting a nonparametric regression. KRR can be stated as

[TABLE]

Using the representation theorem, the nonlinear regression problem (1) admits a closed form solution $\widehat{f}_{\lambda}(\boldsymbol{x})=\sum_{i=1}^{N}\widehat{\alpha_{i}}K(\boldsymbol{x}_{i},\boldsymbol{x})$ with

[TABLE]

where $\lambda>0,\mathbf{y}_{N}=[y_{1},\cdots,y_{N}]^{T}$ and $\mathbf{K}_{N}$ is the $N\times N$ kernel matrix with $\mathbf{K}_{N}(i,j)=K(\boldsymbol{x}_{i},\boldsymbol{x}_{j})$ . Although KRR characterizes optimal statistical properties [12, 13], it is unfeasible for large-scale settings because of $\mathcal{O}(N^{2})$ memory to store kernel matrix and $\mathcal{O}(N^{3})$ time to solve the linear system (2).

2.2 Distributed KRR with Random Features (DKRR-RF)

Assume that the kernel $K$ have an integral representation

[TABLE]

where $(\Omega,\pi)$ is a probability space and $\psi:\mathcal{X}\times\Omega\to\mathbb{R}$ . We define analogous operators for the constructed kernel $K_{M}(\boldsymbol{x},\boldsymbol{x}^{\prime})=\phi_{M}(\boldsymbol{x})^{\top}\phi_{M}(\boldsymbol{x}^{\prime})$ to approximate the primal kernel $K(\boldsymbol{x},\boldsymbol{x}^{\prime})$ in (3) with its corresponding random features via Monte Carlo sampling

[TABLE]

where $\omega_{1},\cdots,\omega_{M}\in\Omega$ are sampled w.r.t $p(\omega)$ .

Let the training set $D$ be randomly partitioned into $m$ disjoint subsets $\{D_{j}\}_{j=1}^{m}$ with $|D_{1}|=\cdots=|D_{m}|=n$ . The local estimator $\widehat{\boldsymbol{w}}_{j}$ on the subset $D_{j}$ is defined as

[TABLE]

where the estimator is $f(\boldsymbol{x})=\langle\boldsymbol{w},\phi_{M}(\boldsymbol{x})\rangle$ . It admits a closed-form solution

[TABLE]

where $\lambda>0$ . Note that for $j$ -th subset $D_{j}$ , it holds $\forall(\boldsymbol{x},y)\in D_{j},\Phi_{M}=\frac{1}{\sqrt{n}}[\phi_{M}(\boldsymbol{x}_{1}),\cdots,\phi_{M}(\boldsymbol{x}_{n})]^{\top}\in\mathbb{R}^{n\times M}$ and $\widehat{y}_{n}=\frac{1}{\sqrt{n}}(y_{1},\cdots,y_{n})^{\top}.$ The average of local estimators (6) yields a global estimator

[TABLE]

3 Theoretical Assessment

In this section, we present the theoretical analysis on the generalization performance of kernel ridge regression with divide-and-conquer and random features.

The generalization ability of a regression predictor $f:\mathcal{X}\to\mathbb{R}$ is measured in terms of the expected risk

[TABLE]

In this case, the target regression $f_{\rho}=\operatornamewithlimits{arg\,min}_{f}\mathcal{E}(f)$ minimizes the expected risk over all measurable functions $f:\mathcal{X}\to\mathbb{R}$ . The generalization ability of a KRR estimator $f\in{L^{2}_{\rho_{X}}}$ is measured by the excess risk, i.e. $\mathcal{E}(f)-\mathcal{E}(f_{\rho})$ , where $L^{2}_{\rho_{X}}=\{f:\mathcal{X}\to\mathbb{R}~{}|~{}\|f\|_{\rho}^{2}=\int_{X}|f(\boldsymbol{x})|^{2}d\rho_{X}<\infty\}$ is the square integral Hilbert space with respect to the marginal distribution $\rho_{X}$ on the input space $\mathcal{X}$ .

3.1 Assumptions

We first introduce two standard assumptions, which are also used in statistical learning theory [12, 13, 9].

Assumption 1 (Random features are continuous and bounded).

Assume that $\psi$ is continuous and there is a $\kappa\in[1,\infty)$ , such that $|\psi(\boldsymbol{x},\omega)|\leq\kappa,\forall\boldsymbol{x}\in\mathcal{X},\omega\in\Omega$ .

Assumption 2 (Moment assumption).

Assume there exists $B>0$ and $\sigma>0$ , such that for all $p\geq 2$ with $p\in\mathbb{N}$ ,

[TABLE]

According to Assumption 1, the kernel $K$ is bounded by $K(\boldsymbol{x},\boldsymbol{x})\leq\kappa^{2}$ . The moment assumption on the output $y$ holds when $y$ is bounded, sub-gaussian or sub-exponential. Assumptions 1 and 2 are standard in the generalization analysis of KRR, always leading to the learning rate $\mathcal{O}(1/\sqrt{N})$ [12] in general cases.

Definition 1 (Integral operators).

$\forall~{}g\in{L^{2}_{\rho_{X}}}(X,\rho_{X})$ , the integral operators $L,L_{M}$ are defined by the kernel $K$ and the random features $\phi_{M}$ , respectively

[TABLE]

Definition 2 (Effective dimension).

The effective dimension of the RKHS $\mathcal{H}$ induce by the kernel $K$ is defined as

[TABLE]

The effective dimension $\mathcal{N}(\lambda)$ is used to measure the complexity of RKHS $\mathcal{H}$ , and its empirical counterpart is also called degree of freedom [26]. Similarly, we define the effective dimension $\mathcal{N}_{M}(\lambda)$ for the random features mapping $\phi_{M}$ to measure the size of the approximate RKHS $\mathcal{H}_{M}$ , which is induced by finite dimensional random features $\phi_{M}:\mathcal{X}\to\mathbb{R}^{M}$ .

Assumption 3 (Capacity assumption).

Assume there exists $Q>0$ and $\gamma\in[0,1]$ , such that for any $\lambda>0$

[TABLE]

Assumption 4 (Regularity assumption).

Assume there exists $R>0$ , $r>0$ , and $g\in{L^{2}_{\rho_{X}}}$ , such that

[TABLE]

where $f_{\rho}$ is the target regression, $\|g\|_{\rho}\leq R$ and the operator $L^{r}$ denotes the $r$ -th power of the integral operator $L:{L^{2}_{\rho_{X}}}\to{L^{2}_{\rho_{X}}}$ , thus it is also a positive trace class operator.

Assumption 3 holds when the eigenvalues of the integral operator have a polynomial decay $i^{-1/\gamma},~{}\forall i>1$ [9, 20]. Thus, faster convergence rates are derived when the eigenvalues decay faster, a.k.a. $\gamma$ approaches [math], while $\gamma=1$ corresponds to the capacity-independent case. Assumption 4 (source condition) controls the regularity of the target function $f_{\rho}$ . The bigger the $r$ is, the stronger regularity of the regression is, and the easier the learning problem is. Both these two assumptions are widely used in the optimal theory for KRR [13, 9, 14].

3.2 General Results with Fast Rates

One can prove the optimal generalization guarantees for DKRR-RF by combining the theories in KRR-DC [4] and KRR-RF [9]. The attainable case $r\in[1/2,1]$ requires the existence of $f_{\mathcal{H}}=\min_{f\in\mathcal{H}}\mathcal{E}(f)$ , such that $f_{\rho}=f_{\mathcal{H}}$ almost surely [27], which is widely used in KRR and its variants including distributed KRR and random features based KRR [13, 9, 14].

Theorem 1.

*Under Assumptions 1, 2, 3 and 4, if $r\in[1/2,1],\gamma\in[0,1]$ , and $\lambda=N^{-\frac{1}{2r+\gamma}},$ then *

[TABLE]

are enough to guarantee, with a high probability, that

[TABLE]

The optimal learning rate $\mathcal{O}\left(N^{-\frac{2r}{2r+\gamma}}\right)$ stated in Theorem 1 in the above bound is optimal in a minimax sense for KRR approaches [13]. Distributed KKR methods have obtained the same optimal error bounds with a stronger condition on the number of partitions, such as KRR-DC [4, 15] with $m\lesssim N^{\frac{2r-1}{2r+\gamma}}$ . In particular, for the general case $r=1/2$ , the number of local processors $m=\mathcal{O}(1)$ becomes a constant number that is independent of the sample size $N$ . The time complexity of DKRR-RF is $\mathcal{O}(NM^{2}/m)$ and the space complexity $\mathcal{O}(NM/m)$ , thus we report the computational complexities of Theorem 1 in Figure 1.

Remark 1.

The general results in Theorem 1 have three fatal drawbacks: 1) the above bound is only suitable for the attainable case $r\in[1/2,1]$ and fail to apply to the non-attainable case $r\in(0,1/2)$ induced by more complicated problems; 2) random features generated via Monte Carlo are data-independent, which requires much more features than the data-dependent generating features; 3) the constraint on the number of partitions $m\lesssim N^{\frac{2r-1}{2r+\gamma}}$ is too strict, leading to a constant number of partitions when $r$ is close to $1/2$ .

3.3 Refined Results in the Non-attainable Case

Theorem 2.

Under Assumptions 1, 2, 3 and 4, if $r\in(0,1]$ , $\gamma\in[0,1]$ , $2r+\gamma\geq 1$ and $\lambda=N^{-\frac{1}{2r+\gamma}},$ then the number of partitions corresponding to

[TABLE]

and the number of random features $M$ satisfying

[TABLE]

are enough to guarantee, with a high probability, that

[TABLE]

Compared to Theorem 1, Theorem 2 allows more partitions and extends the optimal learning guarantees to the non-attainable case $r\in(0,1/2)$ where the true regression does not lie in RKHS $\mathcal{H}$ . Thus, it achieves significant improvements in both computational efficiency and statistical guarantees. With the same optimal learning rates, Theorem 2 relaxes the restriction on $m$ from $m\lesssim N^{\frac{2r-1}{2r+\gamma}}$ to $m\lesssim N^{\frac{2r+\gamma-1}{2r+\gamma}}$ , which allows more partitions and relaxes the constraints from $r\geq 1/2$ to $2r+\gamma\geq 1$ . When $r\in(0,1/2)$ , the number of random features $M\gtrsim N^{\frac{1}{2r+\gamma}}$ increases as the $r$ approaches zero, because $f_{\rho}$ becomes far away from $\mathcal{H}$ when $r$ is near zero. When $r\in[1/2,1]$ , we obtain the same level of the number of random features $M\gtrsim N^{\frac{(2r-1)\gamma+1}{2r+\gamma}}$ as KRR-RF [9], which is continuous to $M\gtrsim N^{\frac{1}{2r+\gamma}}$ at the critical points $r=1/2$ . Compared to Figure 1, Figure 2 illustrates Theorem 2 not only enlarge the applicable case but also improve the computational efficiency.

Remark 2.

Theorem 2 extends the optimal generalization theories from only attainable case $r\in[1/2,1]$ to the non-attainable case $2r+\gamma\geq 1$ , which include a part of difficult problems $r\in(0,1/2)$ . However, there are also many cases satisfying $2r+\gamma<1$ in the non-attainable case $r\in(0,1/2)$ , where the optimal learning guarantees in Theorem 2 are no longer valid. Inspired the literature [28], we employ additional unlabeled samples to relax the restriction $2r+\gamma\geq 1$ in Section 3.5.

3.4 Fewer Features with Data-dependent Sampling

Assumption 5 (Compatibility assumption).

Define the maximum effective dimension as

[TABLE]

Assume there exists $\alpha\in[0,1]$ and $F>0$ , such that

[TABLE]

Using the definition of $\mathcal{N}(\lambda)$ , we characterize the lower bounds for $\mathcal{N}_{\infty}(\lambda)$ :

[TABLE]

Compared to the (average) effective dimension used in Assumption 3, the maximum effective dimension offers a finer-grained estimate for the capacity of RKHS [29, 9, 30], which often leads to shaper estimate for the related quantities. Using the compatibility assumption, we relax the constraints on the dimension of random features and the number of partitions by generating features in a data-dependent manner, as shown in [30, 31, 20].

Theorem 3.

Under the same assumptions of Theorem 2 and Assumption 5, if $r\in(0,1]$ , $\gamma\in[0,1]$ , $2r+\gamma\geq 1$ and $\lambda=N^{-\frac{1}{2r+\gamma}},$ then the number of partitions $m$ satisfying

[TABLE]

and the number of random features $M$ satisfying

[TABLE]

is sufficient to guarantee, with a high probability, that

[TABLE]

The learning rates of the above theorem are optimal, same as Theorems 2. Achieving the same optimal learning rates, Theorem 3 reduce the computational costs with fewer random features. The number of required random features is reduced from $\mathcal{O}\big{(}N^{\frac{1}{2r+\gamma}}\big{)}$ to $\mathcal{O}\big{(}N^{\frac{\alpha}{2r+\gamma}}\big{)}$ when $r\in(0,1/2)$ and $\mathcal{O}\big{(}N^{\frac{(2r-1)\gamma+1}{2r+\gamma}}\big{)}$ to $\mathcal{O}\big{(}N^{\frac{(2r-1)\gamma+1+2(r-1)(1-\alpha)}{2r+\gamma}}\big{)}$ when $r\in[1/2,1]$ , where the term $2(r-1)(1-\alpha)\leq 0$ . We report the applicable area and computational complexities of Theorem 3 in Figure 3. It shows the use of data-dependent sampling significantly reduce both the time and space complexities. The situations near the boarder line $2r+\gamma=1$ are away from the same computational complexities as the exact KRR.

Remark 3.

From Theorem 1 in [20], we find that the requirement on the data-dependent random features is bounded as $M\gtrsim d_{\tilde{l}}:=\sup_{\boldsymbol{w}\in\Omega}l_{\lambda}(\boldsymbol{w})/q(\boldsymbol{w})$ , where $d_{\tilde{l}}\propto\mathcal{N}_{\infty}(\lambda)\leq FN^{\frac{\alpha}{2r+\gamma}}$ . The condition is the same as Theorem 3 in the non-attainable $r\in(0,1/2)$ and milder than Theorem 3 in the attainable case $r\in[1/2,1]$ . However, the theoretical analysis provided in [20] only pertains to the general case $(r=1/2,\gamma=1)$ and obtains error bounds with the convergence rate $\mathcal{O}(1/\sqrt{N})$ .

Remark 4.

According to the definition of $\mathcal{N}_{\infty}(\lambda)$ , the sampling probability of random features $\pi(\omega)$ is independent of data, which leads to a pessimistic estimate of $\alpha$ . However, generating random features in a data-dependent manner relaxes the estimate of $\alpha$ closer to $\gamma$ . A theoretical example of data-dependent random features was given in Example 2 [9], which guarantees $\mathcal{N}_{\infty}(\lambda)=\mathcal{N}(\lambda)$ (such that $\alpha=\gamma$ ) by constructing random features generated in a data-dependent way. In practice, leverage sampling algorithms were proposed to obtain data-dependent random features [20], where $\alpha$ is close to $\gamma$ . To intuitively illustrate the improvement of data-dependent random features, we boldly assume $\alpha=\gamma$ by generating data-dependent random features.

3.5 More Partitions with Unlabeled Data

In this part, we introduce the additional unlabeled samples $\widetilde{D}_{j}$ to relax this restriction further. We consider the merged dataset $D^{*}$ on the $j$ -th processor, $D_{j}^{*}=D_{j}\cup\widetilde{D}_{j}$ with

[TABLE]

Let $D^{*}=\bigcup_{j=1}^{m}D_{j}^{*},|D^{*}|=N^{*}$ and $|D_{1}^{*}|=\cdots=|D_{m}^{*}|=n^{*}$ . We define semi-supervised kernel ridge regression with divide-and-conquer and random features by

[TABLE]

Theorem 4.

Under the same assumptions of Theorem 3, if $r\in(0,1],\gamma\in[0,1],2r+2\gamma\geq 1$ and $\lambda=N^{-\frac{1}{2r+\gamma}},$ then the total number of samples corresponding to

[TABLE]

the number of local processors satisfying

[TABLE]

and the number of random features $M$ satisfying

[TABLE]

are sufficient to guarantee, with a high probability, that

[TABLE]

To our best knowledge, for the first time, we prove that the number of partitions can achieve $m\lesssim N^{\frac{2r+2\gamma-1}{2r+\gamma}}$ , while the existing constraints on $m$ of the existing work [10, 22] are $m\lesssim N^{\frac{2r+\gamma-1}{2r+\gamma}}$ . Such that, much more partitions are allowed in distributed KRR methods. The relaxation of condition on the partition number $m$ can not only lead to better computational efficiency but also covers more difficult problems, where the suitable problems are enlarged from the situation $2r+\gamma\geq 1$ to the situation $2r+2\gamma\geq 1$ . Figure 4 reveals the advantages of DKRR-RF with unlabeled data. Theorem 4 provides the largest applicable area $2r+2\gamma\geq 1$ but also the highest computational efficiency owing to more partitions.

Remark 5.

From the error decomposition, there are two error terms related to the number of partitions $m$ : sample variance and empirical error. Sample variance depends on the number of labeled samples $n$ , while empirical error is input-dependent but output-independent; thus, it is related to the number of total samples $n^{*}$ . Meanwhile, the similarity between empirical and expected covariance operators $\|\widehat{C}_{M,\lambda}^{-1/2}C_{M,\lambda}^{1/2}\|$ is also label-free, and thus it is related to the total sample size $n^{*}$ rather than $n$ . To achieve the optimal learning rates, we consider the constraints on both the required labeled samples $n$ and the total samples $n^{*}$ . Considering both conditions for supervised learning $m=N/n$ and semi-supervised learning $m=N^{*}/n^{*}$ , we then obtain two constraints on the number of partitions $m$ and consolidate them together.

4 Compared with Related Work

The existing optimal learning guarantees of KRR [13], KRR-DC [14, 15] and KRR-RF [9, 22] only apply to the attainable case $r\in[1/2,1]$ . In this paper, we apply the optimal generalization error bounds to the non-attainable case $r\in(0,1/2)$ with some restrictions, including $2r+\gamma\geq 1$ in Theorem 2 and $2r+2\gamma\geq 1$ in Theorem 4. Using refined estimation, we extend the random features error to the non-attainable case.

4.1 Applicable Area from $r\in[1/2,1]$ to $2r+\gamma\geq 1$

The key to obtaining the optimal learning rates with integral-operator approach is to bound the identity $\|(\widehat{C}_{M}+\lambda I)^{-1/2}(C_{M}+\lambda I)^{1/2}\|$ as a constant, where $C_{M}$ and $\widehat{C}_{M}$ are the expected and empirical covariance operators defined in Definition 4. In conventional distributed KRR [4, 28], they estimated the operator difference after first order (or second order) decomposition

[TABLE]

To bound the identity as a constant, the local sample size should larger enough $n\geq\frac{\mathcal{N}(\lambda)}{\lambda}$ . it holds $m\lesssim N^{\frac{2r-1}{2r+\gamma}}$ for KRR-DC and only applies to $r\geq 1/2$ . However, this paper directly estimates the identity in total (rather than in parts after decomposition) based on concentration inequalities for self-adjoint operators and obtain

[TABLE]

To bound the identity as a constant, the local sample size only needs $n\geq\frac{1}{\lambda}$ , which is smaller than [14] with $\mathcal{N}(\lambda)$ . Therefore, our estimation of $\|(C_{M}+\lambda I)^{-1/2}(\widehat{C}_{M}+\lambda I)^{1/2}\|$ in Theorem (2) is $\sqrt{\mathcal{N}(\lambda)}$ tighter than that in [14]. To bound identity as a constant, we then have $m\lesssim N^{\frac{2r+\gamma-1}{2r+\gamma}}$ , which is the key to obtain more partitions and extends the optimal learning guarantees to the non-attainable case $2r+\gamma\geq 1$ .

4.2 Applicable Area from $2r+\gamma\geq 1$ to $2r+2\gamma\geq 1$

Only sample variance is dependent on the labeled samples, while other error terms involving the estimate of $\|(C+\lambda I)^{-1/2}(\widehat{C}_{M}+\lambda I)^{1/2}\|$ are label-free. Thus, there are two restrictions on the number of partitions $m$ : sample variance (label-dependent) and the estimate of $\|(C+\lambda I)^{-1/2}(\widehat{C}_{M}+\lambda I)^{1/2}\|$ (label-free).

As shown in the proof of Theorem 3, the global sample variance (label-dependent) can be estimated

[TABLE]

To achieve the optimal learning rates $\mathcal{O}(N^{\frac{-2r}{2r+\gamma}})$ , the number of partitions should satisfy $m\lesssim N^{\frac{2r+2\gamma-1}{2r+\gamma}}$ . Then, we utilize additional unlabeled samples to relax the condition on the estimate of $\|(C+\lambda I)^{-1/2}(\widehat{C}_{M}+\lambda I)^{1/2}\|$ . Using Assumption 5, one can further relax the condition of $m$ due to

[TABLE]

To guarantee the key quantity $\|(C+\lambda I)^{-1/2}(\widehat{C}_{M}+\lambda I)^{1/2}\|$ be a constant, we have $m\lesssim\lambda^{\alpha}N^{*}=\mathcal{O}(N^{*}N^{\frac{-\alpha}{2r+\gamma}})$ . We then consider the dominant constraints:

•

The case $\alpha<1-\gamma$ . It holds $2r+2\gamma-1<2r+\gamma-\alpha$ , thus the number of partition is $m\lesssim N^{\frac{2r+2\gamma-1}{2r+\gamma}}$ .

•

The case $\alpha\geq 1-\gamma$ . It holds $\gamma+\alpha-1\geq 0$ and we make use of additional unlabeled examples $N^{*}\gtrsim NN^{\frac{\gamma+\alpha-1}{2r+\gamma}}$ to guarantee $m\lesssim N^{\frac{2r+\gamma-\alpha}{2r+\gamma}}\leq N^{\frac{2r+2\gamma-1}{2r+\gamma}}$ .

4.3 Random Features Error in the Non-attainable Case

Using appropriate decomposition on operatorial level, we derive the random features error for both attainable and non-attainable case, where the dimension of random features should satisfy $M\gtrsim N^{\frac{\gamma}{2r+\gamma}}$ for the non-attainable case $r\in(0,1/2)$ . The extension from the attainable case to the non-attainable case is non-trivial, where the non-attainable case requires refined estimations for operators similarity.

The operatorial definitions of intermediate estimators $\widetilde{f}_{D_{j},\lambda}^{M}$ , $f_{\lambda}^{M}$ and $f_{\lambda}$ in Lemma 1 involve the true regression $f_{\rho}$ , where $f_{\rho}=L^{r}g$ (under Assumption 4) is related the range of $r$ . Such that, we estimate the last there error terms (empirical error $\|\widetilde{f}_{D_{j},\lambda}^{M}-{f}_{\lambda}^{M}\|$ , random features error $\|{f}_{\lambda}^{M}-{f}_{\lambda}\|$ and approximation error $\|{f}_{\lambda}-f_{\rho}\|$ ) that involves $\widetilde{f}_{D_{j},\lambda}^{M}$ , $f_{\lambda}^{M}$ and $f_{\lambda}$ for the non-attainable case. Meanwhile, because the empirical error satisfies $\|\widetilde{f}_{D_{j},\lambda}^{M}-{f}_{\lambda}^{M}\|\leq(\sqrt{2}+2)\left(\|f_{\lambda}^{M}-f_{\lambda}\|+\|f_{\lambda}-f_{\rho}\|\right)$ and the approximation error $\|f_{\lambda}-f_{\rho}\|$ naturally applies to the non-attainable case, only random features error $\|{f}_{\lambda}^{M}-{f}_{\lambda}\|$ is needed to specifically estimated for the non-attainable case.

5 Experiments

To validate the theoretical findings, we conduct experiments on both simulated data and real-world data. In the numerical experiments, we study the computational and statistical tradeoffs of DKRR-RF, KRR-DC, KRR-RF, and KRR. In real-world experiments, we first explore the effectiveness of data-dependent random features and additional unlabeled samples on a small world dataset. Then, we compare the statistical performance of DKRR-RF, KRR-DC, and KRR-RF on three large-scale real-world datasets w.r.t. the number of random features $M$ and the number of partitions $m$ .

5.1 Numerical Experiments (for Theorem 2)

In this section, to validate our theoretical findings, we perform experiments on simulated data. From Theorem 2, we find that the learning rates become slower as the ratio $\frac{\gamma}{r}$ increases, which is

[TABLE]

As the ratio $\gamma/r$ increases, the hardness of the problem increases. Such that, given a fixed $\gamma$ , a smaller $r$ leads to a slower converge rate of generalization error bounds. As $r$ decreases from $1$ to near zero, the learning rates are in the range $N^{(0,\frac{-2}{2+\gamma}]}$ . Inspired by numerical experiments in [9, 32], we introduce the spline kernel of order $q\geq 2$ , where more details are referred in [33] (Eq. 2.1.7)

[TABLE]

More importantly, the spline kernels naturally construct random features for any $q,q^{\prime}\in\mathbb{R}$

[TABLE]

Using the following settings, we perform experiments on both easy and difficult problems

Input distribution: $\mathcal{X}=[0,1]$ and $\rho_{X}$ is the uniform distribution. 2. -

Output distribution: the target function $f_{*}(\boldsymbol{x})=\Lambda_{\frac{r}{\gamma}+\frac{1}{2}}(\boldsymbol{x},0)$ with a variance $\epsilon^{2}$ . 3. -

Kernel and Random features: $K(\boldsymbol{x},\boldsymbol{x}^{\prime})=\Lambda_{\frac{1}{\gamma}}(\boldsymbol{x},\boldsymbol{x}^{\prime})$ . According to (3) and (13), $\psi(\boldsymbol{x},\omega_{i})=\Lambda_{\frac{1}{2\gamma}}(\boldsymbol{x},\omega_{i})$ with $\omega_{i}$ sampled i.i.d from uniform distribution $U[0,1].$ The random features of the spline kernel are

[TABLE]

Then, conditions used in Theorem 2 are satisfied [9], including Assumption 3, 4 with $\alpha=1$ and no unlabeled data. As shown in Figure 5 (a), the smaller ratio $\gamma/r$ leads to a smoother curve, which corresponds to a easier problem. We explore regression problems with different difficulties in terms of different settings for $r$ and $\gamma$ .

According to the target regression $f_{*}(\boldsymbol{x})=\Lambda_{\frac{r}{\gamma}+\frac{1}{2}}(\boldsymbol{x},0)$ and a variance $\epsilon^{2}=0.01$ , the training data is generated with various sample size $N\in\{1000,2000,\cdots,10000\}$ and $10000$ samples for testing. To study the difference between the simulated excess risk and the theoretical excess risk, we repeat the data generating and the training $10$ times and estimate the averaged excess risk on the testing data. On each training, we perform DKRR-RF $\widehat{f}_{D,\lambda}^{M}$ (7), KRR-DC [14], KRR-RF [9] and KRR [13] by evaluating both statistical performance (mean square error, MSE) and computational costs (training time). Meanwhile, according to Theorem 2, we set $\lambda=N^{-\frac{1}{2r+\gamma}}$ , $M=\widetilde{C}N^{\frac{(2r-1)\gamma+1}{2r+\gamma}}\vee N^{\frac{1}{2r+\gamma}}$ and $m=N^{\frac{2r+\gamma-1}{2r+\gamma}}/\widetilde{C}$ , where $\widetilde{C}$ is an estimation of the constant $32\kappa^{2}\log(2/\delta)$ .

5.1.1 Easy Problem

Easy problem with the learning rate $\mathcal{O}(N^{-1})$ is given by setting $(r=1,\gamma=0)$ , where the target function is $f_{*}(\boldsymbol{x})=\Lambda_{Inf}(\boldsymbol{x},0)=1+2\cos(2\pi\boldsymbol{x})$ and leads to a smooth curve in Figure 5 (b). Figure 5 (b) illustrates that the problem is easy, and a smaller number of training samples ( $N=100$ ) is enough to fit the target curve perfectly.

The left of Figure 6 shows the empirical learning rate of DKRR-RF $\mathcal{O}(N^{-0.99})$ is very close to the theoretical rate $\mathcal{O}(N^{-1})$ for the benign case $(r=1,\gamma=0)$ . From the middle of Figure 6, we find that the empirical MSE of KRR, KRR-DC, KRR-RF, and DKRR-RF are the same and extremely small, where the target problem is easy and even the approximate methods achieve the same optimal learning performance as the exact KRR.

5.1.2 General Problem

General problem with the learning rate $\mathcal{O}(N^{-1/2})$ is given by setting $(r=1/2,\gamma=1)$ , which is seemed as the worst one for the attainable case $r\in[1/2,1]$ and well-studied in [9, 22]. The target function is $f_{*}(\boldsymbol{x})=\Lambda_{1}(\boldsymbol{x},0).$ Figure 5 (c) shows that the curve becomes sharp and thus the problem is of medium difficulty. A few samples $(N=100)$ bring noises, and it needs more samples $(N=1000)$ to achieve perfect fitting.

The left of Figure 7 demonstrates the empirical error of DKRR-RF converges at $\mathcal{O}(N^{-0.41})$ near the expected rate $\mathcal{O}(N^{-0.5})$ . The comparison of MSE in the middle of Figure 7 shows the errors of DKRR-RF are mainly due to more partitions rather than random features. The empirical performance for the general problem in Figure 7 (b) is much worse than the easy problem in Figure 6. The gap of test errors between distributed methods and centralized methods is negligible. The right of Figure 7 shows the training time of DKRR-RF is higher than KRR-DC when more random features are used.

5.1.3 Diffiult Problem

The difficult problem with the learning rate $\mathcal{O}(1)$ is given by setting $(r=0,\gamma=1)$ , which is almost unable to be learned. According to Figure 5 (d), we find that $(r=0,\gamma=1)$ provides the difficult problem where the curve steepens rapidly near [math] or $1$ . A large number $(N=1000)$ of training samples are still unable to fit the curve perfectly.

From the left of Figure 8, we can see that the learning rate is near $\mathcal{O}(1)$ , such that the target function is hard to learn. The middle and right of Figure 8 illustrates that errors and training time are similar for KRR, KRR-RF, KRR-DC, and DKRR-RF. Compared with test errors of the easy problem in Figure 6 (b) and the general problem in Figure 7 (b), MSE of the difficult problem in Figure 8 (b) is much higher and the performance gab between distributed learning and centralized learning is significant. Therefore, distributed learning approaches are not suitable for difficult problems when $r$ approaches zero, which coincides with the theoretical findings in Theorem 2.

5.2 Influence of data-dependent random features (for Theorem 3)

Inspired by the leverage weighted random Fourier features [34, 20, 35], we proposed the leverage weighted random features (not just for shift-invariant kernels). Based on (4), the data-dependent random features are defined as

[TABLE]

where $q(\boldsymbol{w}_{i})=\tilde{l}_{\lambda}(\boldsymbol{w}_{i})/D_{\mathbf{K}}^{\lambda}$ . Using the ideal matrix $\boldsymbol{y}\boldsymbol{y}^{\top}$ to replace the kernel matrix, we obtain the following leverage score function

[TABLE]

where $\mathbf{z}_{\boldsymbol{w}_{i}}({\boldsymbol{X}})=1/\sqrt{L}\big{[}\psi(\boldsymbol{x}_{1},\boldsymbol{w}_{i}),\cdots,\psi(\boldsymbol{x}_{n},\boldsymbol{w}_{i})\big{]}^{\top}$ and $L\geq M$ . Removing the data-independent terms, there holds

[TABLE]

and

[TABLE]

The time complexity of generating data-dependent random features is $\mathcal{O}(NM^{2})$ on a global machine, while it can be further reduced by computing in local machines and as a part of data preprocessing. Then, we re-sample $M$ features from $\{\boldsymbol{w}\}_{i=1}^{L}$ using the multinomial distribution given by $p(\boldsymbol{w}_{i})/q(\boldsymbol{w}_{i})=[\boldsymbol{y}^{\top}\mathbf{z}_{\boldsymbol{w}_{i}}({\boldsymbol{X}})]^{2}/\sum_{i=1}^{M}[\boldsymbol{y}^{\top}\mathbf{z}_{\boldsymbol{w}_{i}}({\boldsymbol{X}})]^{2}$ . Then, using (14), we compute the data-dependent random features.

To validate Theorem 3, we compare the empirical performance of the following methods:

•

Leverage RF with $m=1$ : the proposed approximate leverage weighted random features (14) without distributed learning, similar to [35].

•

Leverage RF with $m=10$ : the proposed approximate leverage weighted random features (14) with $10$ partitions, a.k.a. DKRR-RF in Theorem 3.

•

Plain RF with $m=1$ : the exact random features with Monte Carlo sampling (4), which is KRR-RF given in [9].

•

Plain RF with $m=10$ : the exact random features with Monte Carlo sampling (4) with $10$ partitions, namely DKRR-RF defined in Theorem 2.

In terms of different values of $M$ , we perform different random features generating algorithms on the EGG dataset to evaluate the test accuracies, time costs for generating random features, and time costs for $10$ trials. Figure 9 reports the mean and one standard deviation of test accuracies, time costs for generating random features, and time costs for training versus different settings of $M=d\times 2^{\{0,\cdots,7\}}$ . We use Gaussian kernel $K(\boldsymbol{x},\boldsymbol{x}^{\prime})=\exp(-{\|\boldsymbol{x}-\boldsymbol{x}^{\prime}\|^{2}}/{2\sigma^{2}})$ in experiments and the corresponding probability density function $p(\boldsymbol{w})=\mathcal{N}(0,1/\sigma^{2})$ . The kernel parameter $\sigma$ and the regularity parameter $\lambda$ are tuned via $5$ -folds cross-validation over grids of $2^{\{-5,-4.5,\cdots,5\}}$ and $2^{\{-7,-6,\cdots,3\}}$ . The statistical information of the dataset and hyperparameter settings are reported in Table 1. From Figure 9, we find that:

Centralized learning with data-dependent random features (blue) achieves the best accuracy but also leads to the highest computational costs for training the model on a centralized machine, while the generating times for centralized learning and distributed learning are the same.

2)

With a slight increase in random features generating time in Figure 9 (b), two kinds of data-dependent approaches (blue and red ones) brings significant improvements on the classification accuracy as the number of random features increases in Figure 9 (a), which validates the effectiveness of data-dependent random features.

3)

The use of data-dependent features generating approaches does not sacrifice too much computational efficiency as shown in Figure 9 (b). Meanwhile, in Figure 9 (c), distributed learning dramatically improves training efficiency.

4)

Consuming a little bit more generating time and similar training time, data-dependent DKRR-RF (red) markedly surpasses data-independent DKRR-RF (purple), which reveals the superiority of Theorem 3 than Theorem 2.

Overall, data-dependent DKRR-RF achieves a good tradeoff on accuracy and efficiency, which coincides with the theoretical findings in Theorem 3.

5.3 Influence of Unlabeled Data (For Theorem 4)

To validate Theorem 4, we split the EGG dataset into three parts: $3000$ examples as the labeled training data, $7000$ ones as the unlabeled training data, and $4980$ examples as the test data, which is illustrated in Table 1. There are two compared methods:

•

Semi-supervised DKRR-RF (defined in Theorem 4) is constructed as (10), where the unlabeled examples are marked as zeros.

•

Supervised DKRR-RF (defined in Theorem 3) only uses the labeled samples.

In the following experiments, we fix the labeled sample size $N=3000$ , the number of random features $M=1,000$ and the number of partitions $m=10$ . In Figure 10, we perform the compared methods across $10$ trials to plot the mean and one standard deviation of test accuracies, time costs for random feature generating, and times costs for training under varying unlabeled samples size $N^{*}-N\in\{0,1000,\cdots,7000\}$ . Figure 10 illustrates that:

The use of additional unlabeled samples improves the empirical performance of DKRR-RF. In other words, we can increase the number of partitions without losing accuracy by using additional unlabeled samples. It is consistent with the theoretical findings in Theorem 4 that additional unlabeled examples can relax the restriction on $m$ .

2)

The increase of the number of unlabeled examples aggravates the computational burden. It is worthy of balancing the accuracy gains and the computational costs in terms of different sizes of unlabeled data. Generating time is larger than training time, thus generating time dominates in the time-consuming.

In Figure 11, we fixed unlabeled sample size as $N^{*}-N=7000$ and perform semi-supervised/supervised methods on the different number of partitions $m\in\{10,20,\cdots,100\}$ . Figure 11 reports the mean and one standard deviation of test accuracies and training times under different partitions, which shows:

Both two test accuracies decrease as the number of partitions $m$ increases, and the accuracy gap between semi-supervised DKRR-RF and supervised DKRR-RF becomes larger and larger.

2)

Both the training time drops as the number of partitions increases, but there are no more significant computational gains when the number of partitions is greater than $50$ . Both the generating times of semi-supervised DKRR-RF and supervised DKRR-RF is almost stationary and plays a dominant role. Semi-supervised DKRR-RF costs much more time for generating data-dependent random features than supervised DKRR-RF.

3)

Semi-supervised DKRR-RF (Theorem 4) always provides better empirical performance than supervised DKRR-RF (Theorem 3), but also the training times of them are similar when $m\geq 50$ .

Therefore, semi-supervised DKRR-RF achieves a good balance between the test accuracy and the training time at $m=50$ .

5.4 Large-scale Real Data

As listed in Table 1, we study the empirical performance of DKRR-RF algorithm on three large-scale binary classification datasets, including covtype 111https://archive.ics.uci.edu/ml/datasets/covertype and SUSY 222https://archive.ics.uci.edu/ml/datasets/susy and HIGGS 333https://archive.ics.uci.edu/ml/datasets/higgs. For the sake of comparison, we random sampled $N=2.5\times 10^{5}$ data points as the training data and $5\times 10^{4}$ data points as the test data. We use random Fourier features [8] to approximate Gaussian kernel $K(\boldsymbol{x},\boldsymbol{x}^{\prime})=\exp(-{\|\boldsymbol{x}-\boldsymbol{x}^{\prime}\|^{2}}/{2\sigma^{2}})$ . Random Fourier features are in the form $\psi(\boldsymbol{x},\omega)=\cos(\omega^{T}\boldsymbol{x}+b)$ , where $\omega$ is drawn from the corresponding Gaussian distribution and $b$ is drawn from uniform distribution $[0,2\pi]$ . In the following experiments, we tune parameters $\sigma$ and $\lambda$ via $5$ -folds cross-validation over grids of $2^{\{-5,-4.5,\cdots,5\}}$ and $2^{\{-7,-6,\cdots,3\}}$ respectively for each dataset, and report average errors over 10 repetitions.

The difficulties of those tasks are unknown ( $r$ and $\gamma$ are unknown), such that for each dataset, we evaluate the classification errors in two cases:

For fixed $m=20$ and different $M\in[10,1500]$ , we compare the empirical performance of DKRR-RF and KRR-DC [14].

2)

For fixed $M=500$ and different $m\in[1,500]$ , we compare the empirical performance of DKRR-RF and KRR-RF [9].

To explore how the number of random features affects the classification accuracy, we fix the number of partitions as $20$ and vary the number of RFs. As shown in left plots of Figures 12, 13, 14, when the number of RFs is small, classification errors of DKRR-RF decrease dramatically as the number of RFs increases. However, when the number of features is more than certain thresholds, classification errors of DKRR-RF converge at some rate near the classification error of KRR-DC. The certain threshold is near $300$ for covtype, $100$ for SUSY, and $600$ for HIGGS. According to Theorem 4, the number of random features is around $M\gtrsim N^{\frac{1}{2r+\gamma}}$ , thus smaller thresholds represent smaller $1/(2r+\gamma)$ and lead to higher computational efficiency.

To study the influence of partitions on the accuracy, we fix the number of RFs as $M=500$ and increase the number of partitions $m$ . As demonstrated in the right plots of Figures 12, 13, 14, when the numbers of partitions are less than certain thresholds, DKRR-RF provides preferable classification accuracy that is close to the accuracy of KRR-RF. After that, errors increase quickly when the number of partitions increases. The threshold is near $m=50$ for covtype, $m=60$ for SUSY and $m=40$ for HIGGS. According to Theorem 4, the number of partitions is near $\mathcal{O}(NN^{\frac{\gamma-1}{2r+\gamma}})$ , thus larger the number of partitions $m$ leads to larger $\frac{\gamma-1}{2r+\gamma}$ and less computational costs.

Indeed, when $M$ and $m$ are settled as the corresponding thresholds for each data, DKRR-RF achieves the optimal tradeoff of empirical performance and computational efficiency, for example $(M=300,m=50)$ for covtype, $(M=100,m=60)$ for SUSY, and $(M=600,m=40)$ for HIGGS. We think that the optimal learning rates can still be achieved when the number of partitions is smaller than corresponding thresholds and the number of random features is larger than corresponding thresholds. Nevertheless, more partitions or fewer random features break the optimal statistical properties of DKRR-RF, and thus the performance drops very fast. Comparing the thresholds for $M$ and $m$ on those tasks, we find the following relationships on trainability: SUSY $>$ covtype $>$ HIGGS, which means the dataset SUSY is easier to obtain a good accuracy-efficiency tradeoff for DKRR-RF.

6 Conclusion

This paper explores the generalization performance of kernel ridge regression with two commonly used efficient large-scale techniques: divide-and-conquer and random features. We first present a general result with the optimal learning rates under standard assumptions. We then refine the theoretical results with more partitions and applicability in the non-attainable case. Further, we reduce the number of random features by generating features in a data-dependent manner. Finally, we present the theoretical results that substantially relax the constraint on the number of partitions with extra unlabeled data, which apply to both the attainable case and non-attainable case. The proposed optimal theoretical guarantees are state-of-the-art in the theoretical analysis for KRR approaches. With extensive experiments on both simulated and real-world data, we validate our theoretical findings with experimental results.

This paper can be extended in several ways: (a) the combination with gradient algorithms such as multi-pass SGD [36, 10] and preconditioned conjugate gradient [34] to further reduce the time complexity. (b) using asynchronous distributed methods or a few of communications [21, 22] instead of one-shot approach to alleviate the saturation phenomenon when $r\geq 1$ .

acknowledgment

This work was supported in part by the Excellent Talents Program of Institute of Information Engineering, CAS, the Special Research Assistant Project of CAS, the Beijing Outstanding Young Scientist Program (No. BJJWZYJH012019100020098), and National Natural Science Foundation of China (No. 62076234, No. 62106257).

Appendix A Proofs

We denote $\lVert{\cdot}\rVert$ the operatorial norm, specifically the norm $\|\cdot\|$ to represent the ${L^{2}_{\rho_{X}}}$ norm $\|\cdot\|_{\rho}$ in the estimate of error terms. Moreover, we denote with $Q_{\lambda}$ the operator $Q+\lambda I$ , where $Q$ is a bounded self-adjoint linear operator, $\lambda\in\mathbb{R}$ and $I$ the identity operator, so for example $\widehat{C}_{M,\lambda}:=(\widehat{C}_{M}+\lambda I)$ , $C_{M\lambda}:=(C_{M}+\lambda I)$ , $L_{M,\lambda}:=(L_{M}+\lambda I)$ and $L_{\lambda}:=(L+\lambda I)$ , where operators $\widehat{C}_{M},C_{M},L_{M},L$ are defined in Definition 3 and Definition 4. The estimates of error bounds are based on local estimators (18) and (19) rather than global ones, such that they are associated with the number of local samples $n$ , where $n=N/m$ .

A.1 Definitions of Linear Operators

Since KRR has closed-form solutions, we represent the intermediate estimators $\widehat{f}_{D,\lambda}^{M},\widetilde{f}_{D,\lambda}^{M},{f}_{\lambda}^{M},{f}_{\lambda}$ in error decomposition by the redirection operators and their adjoint operators. In this part, we first provide useful linear operators associated with kernel $K$ (Definition 3) and with random features $\phi_{M}$ (Definition 4), respectively. To bound the excess risk $\mathcal{E}(\widehat{f}_{D,\lambda}^{M})-\mathcal{E}(f_{\rho})$ , we present the closed-form solutions of the estimators used in Lemma 2 based on those operators, which can be estimated by the difference between integral operator $L$ and random features based covariance operator $\widehat{C}_{M}$ .

To clearly state the relationships among estimators in Lemma 2, we introduce linear operators (both expected and empirical) associated with the RKHS $\mathcal{H}$ induced by the kernel $K$ and the feature space $\mathbb{R}^{M}$ induced by the random features $\phi_{M}$ .

Definition 3 (Operators with kernel $K$ ).

For any $g\in{L^{2}_{\rho_{X}}}$ , $\beta\in\mathcal{H}$ and $\alpha\in\mathbb{R}^{n}$ , we have

•

$S_{\mathcal{H}}:\mathcal{H}\to{L^{2}_{\rho_{X}}},\quad(S_{\mathcal{H}}\beta)(\cdot)=\langle\beta,\phi(\cdot)\rangle$ .

•

$\widehat{S}_{\mathcal{H}}:\mathcal{H}\to\mathbb{R}^{n},\quad\widehat{S}_{\mathcal{H}}\beta=\frac{1}{\sqrt{n}}\big{(}\langle\beta,\phi(\boldsymbol{x}_{i})\rangle\big{)}_{i=1}^{n}\in\mathbb{R}^{n}$ .

•

$S_{\mathcal{H}}^{*}:{L^{2}_{\rho_{X}}}\to\mathcal{H},\quad S_{\mathcal{H}}^{*}g=\int_{X}\phi(\boldsymbol{x})g(\boldsymbol{x})d{\rho_{X}}(\boldsymbol{x})$ .

•

$\widehat{S}_{\mathcal{H}}^{*}:\mathbb{R}^{n}\to\mathcal{H},\quad\widehat{S}_{\mathcal{H}}^{*}\alpha=\frac{1}{\sqrt{n}}\sum_{i=1}^{n}\phi(\boldsymbol{x}_{i})\alpha_{i}\in\mathcal{H}$ .

•

$L:{L^{2}_{\rho_{X}}}\to{L^{2}_{\rho_{X}}},\,L=S_{\mathcal{H}}S_{\mathcal{H}}^{*},\,\text{such that}~{}~{}(Lg)(\cdot)=\int_{X}K(\cdot,\boldsymbol{x})g(\boldsymbol{x})d{\rho_{X}}(\boldsymbol{x})$ .

•

$\mathbf{K}:\mathbb{R}^{n}\to\mathbb{R}^{n},\quad\mathbf{K}=\widehat{S}_{\mathcal{H}}\widehat{S}_{\mathcal{H}}^{*},\quad\text{such that}~{}~{}\mathbf{K}=\frac{1}{n}\big{(}K(\boldsymbol{x}_{i},\boldsymbol{x}_{j})\big{)}_{i,j=1}^{n}$ .

•

$C:\mathcal{H}\to\mathcal{H},\quad C=S_{\mathcal{H}}^{*}S_{\mathcal{H}},\quad\text{such that}~{}~{}C\beta=\int_{X}\langle\beta,\phi(\boldsymbol{x})\rangle\phi(\boldsymbol{x})d{\rho_{X}}(\boldsymbol{x})$ .

•

$\widehat{C}:\mathcal{H}\to\mathcal{H},\quad\widehat{C}=\widehat{S}_{\mathcal{H}}^{*}\widehat{S}_{\mathcal{H}},\quad\text{such that}~{}~{}\widehat{C}\beta=\frac{1}{n}\sum_{i=1}^{n}\langle\beta,\phi(\boldsymbol{x}_{i})\rangle\phi(\boldsymbol{x}_{i})$ .

Here, we denote $S_{\mathcal{H}}$ the inclusion operator and $\widehat{S}_{\mathcal{H}}$ the sampling operator, while $S_{\mathcal{H}}^{*},\widehat{S}_{\mathcal{H}}^{*}$ are their adjoint operators. Note that $C:\mathcal{H}\to\mathcal{H}$ is the covariance operator given by $S_{\mathcal{H}}^{*}S_{\mathcal{H}}$ , and the integral operator $L:{L^{2}_{\rho_{X}}}\to{L^{2}_{\rho_{X}}}$ given by $S_{\mathcal{H}}S_{\mathcal{H}}^{*}$ . The kernel matrix $\mathbf{K}$ and the covariance matrix $\widehat{C}$ are the empirical counterparts of the integral operator $L$ and the covariance operator $C$ , respectively. Using Singular Value Decomposition shows that $L$ and $C$ have the same eigenvalues, and the corresponding eigenvectors are closely related [37]. A similar relationship holds for the kernel matrix $\mathbf{K}$ and the covariance matrix $\widehat{C}$ . Those kernels-related operators are widely used in the proof of optimal learning theory for standard KRR. Using Assumption 1, the integral operator $L$ and the covariance operator $C$ are positive trace class operators (and hence compact) and bounded by $\|L\|=\|C\|\leq\kappa^{2}.$ For any function $f\in\mathcal{H}$ , the estimator $f\in{L^{2}_{\rho_{X}}}$ is obtained by $f(\cdot)=\langle f,\phi(\cdot)\rangle_{\mathcal{H}}=S_{\mathcal{H}}f$ . Thus, the RKHS norm can be related to the ${L^{2}_{\rho_{X}}}$ -norm by $C^{1/2}$ [38]:

[TABLE]

Definition 4 (Operators with random features).

For any $g\in{L^{2}_{\rho_{X}}}$ , $\beta\in\mathbb{R}^{M}$ , $\alpha\in\mathbb{R}^{n}$ and $K_{M}(\boldsymbol{x},\boldsymbol{x}^{\prime})=\langle\phi_{M}(\boldsymbol{x}),\phi_{M}(\boldsymbol{x}^{\prime})\rangle$ , we have

•

$S_{M}:\mathbb{R}^{M}\to{L^{2}_{\rho_{X}}},\quad(S_{M}\beta)(\cdot)=\langle\phi_{M}(\cdot),\beta\rangle$ .

•

$\widehat{S}_{M}:\mathbb{R}^{M}\to\mathbb{R}^{n},\quad\widehat{S}_{M}\beta=\frac{1}{\sqrt{n}}\big{(}\langle\beta,\phi_{M}(\boldsymbol{x}_{i})\rangle\big{)}_{i=1}^{n}\in\mathbb{R}^{n}$ .

•

$S_{M}^{*}:{L^{2}_{\rho_{X}}}\to\mathbb{R}^{M},\quad S_{M}^{*}g=\int_{X}\phi_{M}(\boldsymbol{x})g(\boldsymbol{x})d{\rho_{X}}(\boldsymbol{x})$ .

•

$\widehat{S}_{M}^{*}:\mathbb{R}^{n}\to\mathbb{R}^{M},\quad\widehat{S}_{M}^{*}\alpha=\frac{1}{\sqrt{n}}\sum_{i=1}^{n}\phi_{M}(\boldsymbol{x}_{i})\alpha_{i}\in\mathbb{R}^{M}$ .

•

$L_{M}:{L^{2}_{\rho_{X}}}\to{L^{2}_{\rho_{X}}},\,L_{M}=S_{M}S_{M}^{*},\,\text{such that}~{}(L_{M}g)(\cdot)=\int_{X}K_{M}(\cdot,\boldsymbol{x})g(\boldsymbol{x})d{\rho_{X}}(\boldsymbol{x})$ .

•

$\mathbf{K}_{M}:\mathbb{R}^{n}\to\mathbb{R}^{n},\,\mathbf{K}_{M}=\widehat{S}_{M}\widehat{S}_{M}^{*},\,\text{such that}~{}\mathbf{K}_{M}=\frac{1}{n}\big{(}K_{M}(\boldsymbol{x}_{i},\boldsymbol{x}_{j})\big{)}_{i,j=1}^{n}$ .

•

$C_{M}:\mathbb{R}^{M}\to\mathbb{R}^{M},\,C_{M}=S_{M}^{*}S_{M},\,\text{such that}~{}C_{M}\beta=\int_{X}\langle\beta,\phi_{M}(\boldsymbol{x})\rangle\phi_{M}(\boldsymbol{x})d{\rho_{X}}(\boldsymbol{x})$ .

•

$\widehat{C}_{M}:\mathbb{R}^{M}\to\mathbb{R}^{M},\,\widehat{C}_{M}=\widehat{S}_{M}^{*}\widehat{S}_{M},\,\text{such that}~{}\widehat{C}_{M}\beta=\frac{1}{n}\sum_{i=1}^{n}\langle\beta,\phi_{M}(\boldsymbol{x}_{i})\rangle\phi_{M}(\boldsymbol{x}_{i})$ .

Similarly, we also define the redirection operators $S_{M},\widehat{S}_{M}$ and their adjoint operators $S_{M}^{*},\widehat{S}_{M}^{*}$ . Using the random features $\phi_{M}$ , random features based operators $S_{M},\widehat{S}_{M},S_{M}^{*}$ and $\widehat{S}_{M}^{*}$ are close to the kernel based operators $S_{\mathcal{H}},\widehat{S}_{\mathcal{H}},S_{\mathcal{H}}^{*},\widehat{S}_{\mathcal{H}}^{*}$ . $L_{M}$ and $C_{M}$ are the integral operator and the covariance operator defined by random features on ${L^{2}_{\rho_{X}}}$ and $\mathbb{R}^{M}$ , respectively. The kernel matrix $\mathbf{K}_{M}$ is give by the kernel $K_{M}(\boldsymbol{x},\boldsymbol{x}^{\prime})=\langle\phi_{M}(\boldsymbol{x}),\phi_{M}(\boldsymbol{x}^{\prime})\rangle$ associated with random features $\phi_{M}:\mathcal{X}\to\mathbb{R}^{M}$ . Random features use Monte Carlo sampling approximates the kernel with $M$ features $K(\boldsymbol{x},\boldsymbol{x}^{\prime})\approx\langle\phi_{M}(\boldsymbol{x}),\phi_{M}(\boldsymbol{x}^{\prime})\rangle$ , thus the operators $L,\mathbf{K},C,\widehat{C}$ are the expectation counterparts of $L_{M},\mathbf{K}_{M},C_{M},\widehat{C}_{M}$ in terms of kernel probability density $\pi$ .

In Figure 15, we discuss the relationships among operators given in Definition 3 and Definition 4, where $L,L_{M}$ are self-adjoint integral operators on ${L^{2}_{\rho_{X}}}$ and $C,C_{M}$ are self-adjoint integral operators on RKHS $\mathcal{H}$ and $\mathbb{R}^{M}$ , respectively. Operators $\widehat{C},\widehat{C}_{M},\mathbf{K},\mathbf{K}_{M}$ are the corresponding empirical counterparts of $C,C_{M},L,L_{M}$ . To estimate the error terms in Lemma 2, we utilize those operators to represent the estimators $\widehat{f}_{D,\lambda}^{M},\widetilde{f}_{D,\lambda}^{M},f_{\lambda}^{M},f_{\lambda},f_{\rho}$ with closed-form solutions. As shown in Figure 15, we measure the excess risk of DKRR-RF by estimating the difference between the covariance matrix $\widehat{C}_{M}$ and the integral operator $L$ following the approximation chain $\widehat{C}_{M}\to C_{M}\to L_{M}\to L$ .

Remark 6.

Under Assumption 1, the integral operator $L$ is trace class [13] and $L_{M},C_{M},S_{M},\widehat{C}_{M},\widehat{S}_{M}$ are finite dimensional. Moreover we have that $L_{M}=S_{M}S_{M}^{*}$ , $C_{M}=S_{M}^{*}S_{M}$ and $\widehat{C}_{M}=\widehat{S}_{M}^{*}\widehat{S}_{M}$ . Finally $L,L_{M},C_{M},\widehat{C}_{M}$ are self-adjoint and positive operators, with spectrum is $[0,\kappa^{2}]$ .

To represent the noise-free estimator (15), we define a sampling operator $\bar{S}_{M}^{*}:{L^{2}_{\rho_{X}}}\to\mathbb{R}^{M}$ :

[TABLE]

A.2 Error decomposition

To decompose the excess risk of DKRR-RF $\mathcal{E}(\widehat{f}_{D,\lambda}^{M})-\mathcal{E}(f_{\rho})$ clearly, we provide some intermediate estimators $\widetilde{f}_{D_{j},\lambda}^{M}(\boldsymbol{x})=\langle\widetilde{w},\phi_{M}(\boldsymbol{x})\rangle$ , ${f}_{\lambda}^{M}(\boldsymbol{x})=\langle u,\phi_{M}(\boldsymbol{x})\rangle$ , and ${f}_{\lambda}(\boldsymbol{x})=\langle v,\phi(\boldsymbol{x})\rangle$ , where

[TABLE]

The local estimator $\widetilde{f}_{D_{j},\lambda}^{M}$ in (15) is the noise-free version of $\widehat{f}_{D,\lambda}^{M}$ , making up a global noise-free estimator $\widetilde{f}_{D,\lambda}^{M}=\frac{1}{m}\sum_{j=1}^{m}\widetilde{f}_{D_{j},\lambda}^{M}$ . The estimator ${f}_{\lambda}^{M}$ in (16) is a data-free version (expected version on $\rho$ ) of $\widehat{f}_{D,\lambda}^{M}$ , which is still an approximation estimator by random features mapping $\phi_{M}:\mathcal{X}\to\mathbb{R}^{M}$ . The last one ${f}_{\lambda}$ in (17) is the expected version of primal KRR with implicit feature mappings $\phi:\mathcal{X}\to\mathcal{H}$ associated with the kernel $K$ by $K(\boldsymbol{x},\boldsymbol{x}^{\prime})=\langle\phi(\boldsymbol{x}),\phi(\boldsymbol{x}^{\prime})\rangle$ . Using these estimators, we provide the following decomposition in the equality form of the excess risk to further analyses the components of errors.

Lemma 1.

Using operators defined in Definition 3 and Definition 4, the intermediate estimators $\widehat{f}_{D_{j},\lambda}^{M}$ (6), $\widetilde{f}_{D_{j},\lambda}^{M}$ (15), $f_{\lambda}^{M}$ (16) and $f_{\lambda}$ (17) admit the following closed-form solutions:

[TABLE]

Here, $\widehat{y}_{n}=\frac{1}{\sqrt{n}}[y_{1},\cdots,y_{n}]^{T}$ represents the normalized labels of empirical samples.

Proof.

The objective of the local estimator is given in (5). Using the representation theorem $\widehat{f}_{D_{j},\lambda}^{M}(\boldsymbol{x})=\langle\widehat{\boldsymbol{w}}_{j},\phi_{M}(\boldsymbol{x})\rangle$ , we let the derivative of the objective be zero and get the solution (6) for the local estimator, which is

[TABLE]

According to the definitions of operators in Definition 4, $\widehat{\boldsymbol{w}}_{j}=(\widehat{S}_{M}^{*}\widehat{S}_{M}+\lambda I)^{-1}\widehat{S}_{M}^{*}\widehat{y}_{n}$ . Using the definitions of $S_{M}$ and $\widehat{C}_{M}$ , we obtain

[TABLE]

According to the objective of $\widetilde{f}_{D_{j},\lambda}^{M}$ (15), we replace the noisy labels $\{y_{1},y_{2},\cdots,y_{n}\}$ with noise-free ones $\{f_{\rho}(\boldsymbol{x}_{1}),f_{\rho}(\boldsymbol{x}_{2}),\cdots,f_{\rho}(\boldsymbol{x}_{n})\}$ . Meanwhile, using $\bar{S}_{M}^{*}f_{\rho}=\frac{1}{n}\sum_{i=1}^{n}\phi_{M}(\boldsymbol{x}_{i})f_{\rho}(\boldsymbol{x}_{i})$ instead of $\widehat{S}_{M}^{*}\widehat{y}_{n}=\frac{1}{n}\sum_{i=1}^{n}\phi_{M}(\boldsymbol{x}_{i})y_{i}$ , it holds

[TABLE]

Let the derivative of (16) be zero, and we can get $f^{M}_{\lambda}(\boldsymbol{x})=\langle u,\phi_{M}(\boldsymbol{x})\rangle$ with

[TABLE]

Thus, it holds

[TABLE]

Similarly, using operators related the kernel $K$ in Definition 3, we let the derivative of (17) be zero, and obtain $f_{\lambda}(\boldsymbol{x})=\langle v,\phi_{M}(\boldsymbol{x})\rangle$ with

[TABLE]

Thus, it holds

[TABLE]

∎

We denote $\psi_{\omega}:=\psi(\cdot,\omega)$ for any $\omega\in\Omega$ . According to Assumption 1 and the fact that $\rho$ is a finite measure, we have that $\psi_{\omega}\in{L^{2}_{\rho_{X}}}$ almost surely. $\widehat{f}_{D_{j},\lambda}^{M}$ is the linear combination of $\psi_{\omega_{1}},\cdots,\psi_{\omega_{M}}$ , such that $\widehat{f}_{D_{j},\lambda}^{M}\in{L^{2}_{\rho_{X}}}$ almost surely. Since $\widehat{f}_{D_{j},\lambda}^{M},f_{\rho}\in{L^{2}_{\rho_{X}}},\widehat{y}_{n}\in\mathbb{R}^{n}$ and using definitions of the operators, it holds that $\widetilde{f}_{D_{j},\lambda}^{M},f_{\lambda}^{M},f_{\lambda}\in{L^{2}_{\rho_{X}}}$ and it is natural to estimate the differences of them in the ${L^{2}_{\rho_{X}}}$ -norm.

Remark 7.

In some KRR theory literature, an estimator and its weight are defined as the same symbol. The excess risk measures the difference of the RKHS weight $f\in\mathcal{H}$ and the estimator $f_{\rho}\in{L^{2}_{\rho_{X}}}$ (may not belong to the hypothesis space induced by the kernel $K$ ), which confuses the error decomposition of the excess risk. In this paper, to cover the situation that $f_{\rho}$ is out of the hypothesis space, we measure the difference between $\|f-f_{\rho}\|^{2}_{\rho},~{}\forall f\in{L^{2}_{\rho_{X}}}$ rather than $\|f-\operatornamewithlimits{arg\,min}_{f\in\mathcal{H}}\mathcal{E}(f)\|^{2}_{\mathcal{H}},~{}\forall f\in\mathcal{H}$ . Meanwhile, for sake of clarity, we define the estimators $\{\widehat{f}_{D,\lambda}^{M},\widetilde{f}_{D,\lambda}^{M},{f}_{\lambda}^{M},f_{\lambda}\}\in{L^{2}_{\rho_{X}}}$ and estimate the excess risk in the ${L^{2}_{\rho_{X}}}$ -norm.

Proposition 1.

For any $j\in[m]$ , we use $(\mathbf{X}_{j},\mathbf{y}_{j})$ to denote the local samples and their labels on $D_{j}$ , where $\overline{\mathbf{X}}=\{\mathbf{X}_{1},\cdots,\mathbf{X}_{m}\}$ and $\bar{\mathbf{y}}=\{\mathbf{y}_{1},\cdots,\mathbf{y}_{m}\}$ represent the samples and labels on all partitions. For the estimators $\widehat{f}_{D_{j},\lambda}^{M}$ , and $\widetilde{f}_{D_{j},\lambda}^{M}$ , there holds

[TABLE]

Here, $\mathbb{E}_{\mathbf{y}_{j}}$ denotes the conditional expectation with respect to $\mathbf{y}_{j}$ given $\mathbf{X}_{j}$ on the $j$ -th partition $D_{j}$ .

Proof.

Using the operator based solutions of the local estimator of DKRR-RF $\widehat{f}_{D_{j},\lambda}^{M}$ (18), and the local noise-free estimator $\widetilde{f}_{D_{j},\lambda}^{M}$ (19), we have

[TABLE]

Taking the expectation over $\widehat{f}_{D_{j},\lambda}^{M}$ in terms of $\mathbf{y}_{j}$ , it holds

[TABLE]

that proves the identity (22). ∎

Taking the expectation over the conditional distribution $\rho(y|\boldsymbol{x})$ , we first prove the equivalence between the local estimators. We then establish the equivalence relationship between $\widehat{f}_{D_{j},\lambda}^{M}$ and $\widetilde{f}_{D_{j},\lambda}^{M}$ . Next, we derive relationships between global estimators and local ones to prove the error decomposition in Lemma 2. It is easy to bridge connection between the excess risk and the discrepancy of two estimators: $f\in{L^{2}_{\rho_{X}}}$ and the target regression $f_{\rho}$ [12] that

[TABLE]

Here, $\|f\|_{\rho}=\|f\|_{L^{2}_{\rho_{X}}}=\big{(}\int_{\mathcal{X}}|f(\boldsymbol{x})|^{2}d\rho_{X}\big{)}^{1/2}$ , $\rho_{X}(\cdot)$ is the induced marginal measure on the input space $\mathcal{X}$ .

Lemma 2.

For any $j\in[m]$ , let $\widehat{f}_{D,\lambda}^{M},\widetilde{f}_{D,\lambda}^{M},{f}_{\lambda}^{M}$ and ${f}_{\lambda}$ be defined as the above, we have

[TABLE]

Proof.

Using the noise-free estimator $\widetilde{f}_{D,\lambda}^{M}$ as the intermedium, we have

[TABLE]

Taking the conditional expectation with respect to $\bar{\mathbf{y}}$ given $\overline{\mathbf{X}}$ on both sides, using (22) in Proposition 1 which indicates

[TABLE]

we thus have

[TABLE]

Using the fact $(a+b+c)^{2}\leq 3a^{2}+3b^{2}+3c^{2},~{}\forall a,b,c>0$ , we have

[TABLE]

Following the proof of Proposition 5 in [28], we establish the relationship between global and local empirical error

[TABLE]

Substituting (30), (31) and (32) to (23), we get the desired result

[TABLE]

∎

Sample variance (25) is brought by noise on labels $y$ , which is output-dependent. Distributed error (26) reflects errors from distributed learning. Empirical error (27) represents the gap between expected learning and empirical learning. Note that empirical error focuses on noise-free data, and thus it can be reduced by additional unlabeled data, resulting in Theorem 4. Independent on the sample, random features error (28) is caused by the discrepancy between the kernel approximated by random features and the kernel, while approximation error (29) reflects the bias of the algorithm. Data-dependent features can reduce random features error (28) that motivates Theorem 3.

The global sample variance is reduced to $1/m$ of the local one, illustrating that distributed learning can reduce the sample error than any local estimator. But also, the empirical error is output independent and can be reduced by using unlabeled data. Thus, with the same optimal error convergence rate, we improve the number of partitions $m$ by introducing more unlabeled examples in Theorem 4. Sample variance relies on both samples and labels, while random features error and approximation error are independent of the data, so additional unlabeled data do not influence other errors.

A.3 Estimate Error Terms

To analysis the excess risk, in this part, we estimate four error terms $\|\widehat{f}_{D_{j},\lambda}^{M}-\widetilde{f}_{D_{j},\lambda}^{M}\|_{\rho}^{2},\|\widetilde{f}_{D_{j},\lambda}^{M}-{f}_{\lambda}^{M}\|_{\rho}^{2},\|{f}_{\lambda}^{M}-{f}_{\lambda}\|_{\rho}^{2}$ and $\|{f}_{\lambda}-f_{\rho}\|_{\rho}^{2}$ . The estimate of sample variance $\|\widehat{f}_{D_{j},\lambda}^{M}-\widetilde{f}_{D_{j},\lambda}^{M}\|_{\rho}^{2}$ and empirical error $\|\widetilde{f}_{D_{j},\lambda}^{M}-{f}_{\lambda}^{M}\|_{\rho}^{2}$ are related to the key quantity $\|C_{M,\lambda}^{1/2}\widehat{C}_{M,\lambda}^{-1/2}\|$ . To relax the restriction on the number of partitions, we provide a sharper upper bound for the critical quantity as a constant based on Bernstein’s inequality. The estimate of random features error is also associated with a critical quantity $\|L_{\lambda}^{-1/2}(L-L_{M})L_{\lambda}^{-(1-r)}\|$ , where we estimate this term separately. The sample variance is related to the number local labeled sample size $n$ , while the key quantity $\|C_{M,\lambda}^{1/2}\widehat{C}_{M,\lambda}^{-1/2}\|$ is related to the local total sample size $n^{*}$ . Those two parts lead to two constraints on the number of partitions. Random features error is related to the dimension of random features and independent of sample size.

A.3.1 Estimates for Sample Variance $\|\widehat{f}_{D_{j},\lambda}^{M}-\widetilde{f}_{D_{j},\lambda}^{M}\|$

Lemma 3.

For the sample variance (25) in the error decomposition, the following holds

[TABLE]

For any $\delta\in(0,1/3]$ , when $n\geq 16(\mathcal{N}_{\infty}(\lambda)+1)\log(2/\delta)$ , there exists with the probability at least $1-3\,\delta$

[TABLE]

Proof.

Let $\widehat{f}_{D_{j},\lambda}^{M}$ and $\widetilde{f}_{D_{j},\lambda}^{M}$ be defined as (18) and (19), we have

[TABLE]

The last step is due to Cauchy–Schwarz inequality. Note that

[TABLE]

where $\|S_{M}C_{M,\lambda}^{-1/2}\|\leq\|C_{M,\lambda}^{-1/2}S_{M}^{*}S_{M}C_{M,\lambda}^{-1/2}\|^{1/2}\leq 1$ . Thus, we have $\|S_{M}\widehat{C}_{M,\lambda}^{-1/2}\|\leq\|C_{M,\lambda}^{1/2}\widehat{C}_{M,\lambda}^{-1/2}\|$ , and it holds that

[TABLE]

The term $\|C_{M,\lambda}^{-1/2}(\widehat{S}_{M}^{*}\widehat{y}_{n}-\bar{S}_{M}^{*}f_{\rho})\|$ can be rewritten as

[TABLE]

Combining (33), (34) and (35), one can prove

[TABLE]

From Lemma 10, we know that with high probability $\|C_{M,\lambda}^{1/2}\widehat{C}_{M,\lambda}^{-1/2}\|^{2}\leq 2$ if $n\geq 16(\mathcal{N}_{\infty}(\lambda)+1)\log(2/\delta)$ . Substituting Lemma 4, Lemma 5 and Lemma 10 to (LABEL:eq.sample_variance.proof.eq3), if $n\geq 16(\mathcal{N}_{\infty}(\lambda)+1)\log(2/\delta)$ , it holds with the probability at least $1-3\delta$

[TABLE]

∎

Lemma 4 (Lemma 6 in [9]).

For $\delta\in(0,1]$ , under assumptions 1, 2, the following holds with the probability at least $1-\delta$

[TABLE]

Using Bernstein’s inequality (Proposition 2), we prove the following lemma.

Lemma 5.

For $\delta\in(0,1],$ under Assumptions 1, 2 with the probability at least $1-\delta$ , we have

[TABLE]

Proof.

Let $\xi_{i}=C_{M,\lambda}^{-1/2}\phi_{M}(\boldsymbol{x}_{i})f_{\rho}(\boldsymbol{x}_{i})$ on $\mathcal{X}$ in the Hilbert space $\mathcal{H}_{M}$ . We see that

[TABLE]

Thus, the error term to bound can be stated as

[TABLE]

The rhs of the above identity can be bounded by Bernstein’s inequality (Proposition 2), thus we need to estimate $\|\xi\|$ and $\mathbb{E}(\|\xi_{i}-\mathbb{E}(\xi_{i})\|^{2})$ first.

Note that $\int_{\mathcal{Y}}y^{2}d\rho(y|\boldsymbol{x})\leq\sigma^{2}$ under Assumption 2 when setting $p=2$ implies that the regression function is bounded almost surely [10]

[TABLE]

With the inequality

[TABLE]

we thus have

[TABLE]

Note that

[TABLE]

Substituting (38) and (39) to (37), by Bernstein’s inequality (Proposition 2), one can prove that with the probability at least $1-\delta$

[TABLE]

∎

A.3.2 Estimates for Empirical Error $\|\widetilde{f}_{D_{j},\lambda}^{M}-{f}_{\lambda}^{M}\|$

Lemma 6.

For the empirical error (27) in error decomposition, the following holds

[TABLE]

Under Assumptions 1 and 5, for $\delta\in(0,1/3]$ and $\lambda>0$ , when the number of local examples satisfies $n\geq 16(\mathcal{N}_{\infty}(\lambda)+1)\log(2/\delta)$ and the dimension of random features satisfies

[TABLE]

there exists with the probability at least $1-3\delta$

[TABLE]

Proof.

Using the definition of $f^{M}_{\lambda}$ , we have

[TABLE]

Under definitions in (19) and (20), using the above identity and $A^{-1}-B^{-1}=A^{-1}(B-A)B^{-1}$ for positive operators $A,B$ , we have

[TABLE]

To obtain the key term $\|\widehat{C}_{M,\lambda}^{-1/2}\widehat{C}_{M,\lambda}^{-1/2}\|$ , we introduce additional terms in the last step of the above identity. Note that, the following inequalities hold $\|S_{M}C_{M,\lambda}^{-1/2}\|=\|C_{M,\lambda}^{-1/2}C_{M}C_{M,\lambda}^{-1/2}\|^{1/2}\leq 1$ , $\|\widehat{C}_{M,\lambda}^{-1/2}\bar{S}_{M}^{*}\|=\|\widehat{C}_{M,\lambda}^{-1/2}\widehat{C}_{M}\widehat{C}_{M,\lambda}^{-1/2}\|^{1/2}\leq 1$ , and $\|C_{M,\lambda}^{-1/2}S_{M}^{*}\|=\|C_{M,\lambda}^{-1/2}C_{M}C_{M,\lambda}^{-1/2}\|^{1/2}\leq 1$ . Thus, one can obtain that

[TABLE]

When $n\geq 16(\mathcal{N}_{\infty}(\lambda)+1)\log(2/\delta)$ , by Lemma 10, it holds with the probability $1-\delta$

[TABLE]

Using Lemma 7 and Lemma 8, we have with probability at least $1-3\delta$

[TABLE]

∎

A.3.3 Estimates for Random Features Error $\|f_{\lambda}^{M}-f_{\lambda}\|$

The next lemma bounds the distance between the Tikhonov solution with RF and the Tikhonov solution without RF, reflecting the approximation ability of random features.

Lemma 7.

Under Assumptions 1 and 5, for $\delta\in(0,1/2],\lambda>0$ , when

[TABLE]

the following holds with a probability at least $1-2\delta$

[TABLE]

Proof.

According to the operator representations of $f_{\lambda}^{M}$ (20) and $f_{\lambda}$ (21)

[TABLE]

Using the identity $A(A+\lambda I)^{-1}=I-\lambda(A+\lambda I)^{-1}$ and $A^{-1}-B^{-1}=A^{-1}(B-A)B^{-1}$ , we have

[TABLE]

Applying Assumption 3, there exists $g\in{L^{2}_{\rho_{X}}}$ and $f_{\rho}=L^{r}g$ , so we have

[TABLE]

Note that $\|\sqrt{\lambda}L_{M,\lambda}^{-1/2}\|\leq 1$ , $\|L_{\lambda}^{-r}L^{r}\|\leq 1$ , $\|g\|\leq R$ and $\|L_{M,\lambda}^{-1/2}L_{\lambda}^{1/2}\|\leq\sqrt{2}$ if $M\geq 16(\mathcal{N}_{\infty}(\lambda)+1)\log(2/\delta)$ due to Lemma 12, we thus have with probability at least $1-\delta$

[TABLE]

Then, we estimate the bound in two cases $r\in(0,1/2)$ and $r\in[1/2,1]$ .

•

When $r\in(0,1/2)$ , there exists

[TABLE]

The last step is due to $\|\lambda^{1/2-r}L_{\lambda}^{-(1/2-r)}\|\leq 1$ for any $0<r<1/2$ .

Note that $\|L_{\lambda}^{-1/2}(L-L_{M})L_{\lambda}^{-1/2}\|\leq 1/2$ using Lemma 12, thus for $r\in(0,1/2)$ , it holds with probability at least $1-\delta$

[TABLE]

•

When $r\in[1/2,1]$ , there exists

[TABLE]

with $\varsigma=2-2r$ and $0\leq\varsigma\leq 1$ .

Using Proposition 4 with $X=L_{\lambda}^{-1/2}(L-L_{M})$ and $A=L_{\lambda}^{-1/2}$ , one can obtain that

[TABLE]

Thus, applying the above inequality to (40), we have

[TABLE]

To obtain $\|f_{\lambda}^{M}-f_{\lambda}\|\leq R\lambda^{r}$ we need the mixed term be bounded by

[TABLE]

From Lemma 11, with the condition $M\geq 16\kappa^{2}(\mathcal{N}_{\infty}(\lambda)+1)\log(2/\delta)$ , it holds

[TABLE]

where $a=2\kappa^{2}\log(2/\delta)$ and $b=\mathcal{N}_{\infty}(\lambda)+1$ . Similarly, Lemma 13 can be stated as

[TABLE]

where $a=2\kappa^{2}\log(2/\delta)$ , $b=\mathcal{N}_{\infty}(\lambda)+1$ and $c=\mathcal{N}(\lambda)$ .

Note that, according to Minkowski’s inequality, we have

[TABLE]

Therefore, substituting (44) (45), (46) to (43), there holds

[TABLE]

To make the mixed term bounded by $\lambda^{r-1/2}$ , we consider the following condition

[TABLE]

and obtain the bound of mixed term

[TABLE]

The third step is due to $b=\mathcal{N}_{\infty}(\lambda)+1\leq 2\kappa^{2}\lambda^{-1}$ and

[TABLE]

due to $0\leq\lambda\leq\|L\|$ to guarantee bounded effective dimension $\mathcal{N}_{M}(\lambda)$ in Proposition 10 [9]. The last step is due to $\kappa^{4r^{2}-6r+2}\leq 1$ since $4r^{2}-6r+2\leq 0$ and $2^{6r-4r^{2}-1/2}\geq 2\sqrt{2}$ since $6r-4r^{2}-1/2\geq 1.5$ .

Thus, with the condition $M\geq 16\kappa^{2}(\mathcal{N}_{\infty}(\lambda)+1)\log(2/\delta)$ and $M\geq 16\kappa^{2}(\mathcal{N}_{\infty}(\lambda)+1)^{2-2r}(\mathcal{N}(\lambda)/\lambda)^{2r-1}\ \log(2/\delta),$ we have with probability at least $1-2\delta$

[TABLE]

Combing the results in (41) and (LABEL:eq.rf-errors.easy_problems), we prove the lemma. ∎

A.3.4 Estimates for Approximation Error $\|{f}_{\lambda}-f_{\rho}\|$

The last term we need to estimate is approximation error $\|{f}_{\lambda}-f_{\rho}\|,$ whose proof is standard [12, 13, 9].

Lemma 8.

Under Assumption 1 and 4, the following holds for any $\lambda>0$ and $r>0$ ,

[TABLE]

Proof.

Under Assumption 4, there exists $g\in{L^{2}_{\rho_{X}}}$ such that $f_{\rho}=L^{r}g$ with $\|g\|\leq R$ . The identity $A(A+\lambda I)^{-1}=I-\lambda(A+\lambda I)^{-1}$ is valid for $\lambda>0$ and $A$ the bounded self-adjoint positive operator and by the definition of $f_{\lambda}$ (21), we have

[TABLE]

Note that $\lVert{\lambda^{1-r}L_{\lambda}^{-(1-r)}}\rVert\leq 1$ and $\lVert{L_{\lambda}^{-r}{L}^{r}}\rVert\leq 1$ , while $R:=\lVert{g}\rVert_{L^{2}_{\rho_{X}}}$ according to Assumption 4. The proof is completed. ∎

A.4 Proofs of Main Results

Theorem 5 (General excess risk bound).

Let $\delta\in(0,1/5]$ and $\widehat{f}_{D,\lambda}^{M}$ be defined by (7). Under Assumptions 1, 2, 3, 4 and 5, when $\lambda=N^{-\frac{1}{2r+\gamma}}$ , the number of local processors satisfies

[TABLE]

and the dimension of random features satisfies

[TABLE]

then the following holds with a probability at least $1-5\delta$ ,

[TABLE]

where $\tilde{F}=\max(F,\kappa^{2})$ , $\tilde{Q}^{2}=\max(Q^{2},1)$ and $c_{2}$ is a constant independent on $m,n,N^{*}$ that

[TABLE]

Proof.

From Lemma 2, there holds the upper bound for excess risk

[TABLE]

In the following, we use Lemma 3, Lemma 6 and Lemma 7 to bound error terms. Therefore, we need to take into account the conditions in those lemmas. There are constraints on the number of local examples $n$ and the dimension of random features $M$ :

[TABLE]

Here, we merge the constraints on $M$ because it is difficult to acknowledge which range the regularity $r$ belongs to. Meanwhile, $n$ is dependent on the number of partitions $m$ , where $n=N/m$ . Due the constraint on the number of samples $n\geq 32\mathcal{N}_{\infty}(\lambda)\log(2/\delta)\geq 16(\mathcal{N}_{\infty}(\lambda)+1)\log(2/\delta)$ and $\lambda=N^{-\frac{1}{2r+\gamma}}$ , we use Assumption 5 to obtain the restrict on the number of partitions

[TABLE]

•

When $r\in(0,1/2)$ , using Assumption 5 that $\mathcal{N}_{\infty}(\lambda)\leq F\lambda^{-\alpha}$ , to ensure $M\geq 16(\mathcal{N}_{\infty}(\lambda)+1)\log(2/\delta)$ , we sholud have

[TABLE]

Thus, it holds

[TABLE]

•

When $r\in[1/2,1]$ , using Assumption 3 and Assumption 5, we should have

[TABLE]

where $\tilde{F}=\max(F,\kappa^{2})\geq 1$ and $\tilde{Q}^{2}=\max(Q^{2},1)\geq 1$ .

To ensure $M\geq 16\kappa^{2}\log(2/\delta)\big{[}(\mathcal{N}_{\infty}(\lambda)+1)\vee\lambda^{1-2r}\mathcal{N}(\lambda)^{2r-1}(\mathcal{N}_{\infty}(\lambda)+1)^{2-2r}\big{]}$ , using the above inequality it holds for $r\in[1/2,1]$

[TABLE]

due to the fact $32\kappa^{2}\log(2/\delta)\leq 32\kappa^{2}\tilde{F}\tilde{Q}^{2}\log(2/\delta)$ .

By Lemma 3, $\mathcal{N}(\lambda)\leq Q^{2}\lambda^{-\gamma}$ and $\lambda=N^{-\frac{1}{2r+\gamma}}$ , it holds for the global sample variance

[TABLE]

The last step is due the inequality $(a+b)^{2}\leq 2a^{2}+2b^{2}.$ From Assumption 3, we have $\mathcal{N}(\lambda)\leq Q^{2}\lambda^{-\gamma}$ . Note that, we can obtain $\mathcal{N}_{M}(\lambda)\leq 2.55\mathcal{N}(\lambda)\leq 2.55Q^{2}\lambda^{-\gamma}$ by Proposition 10 of [9] and $\lambda\leq\|L\|$ . Using $m\leq\frac{1}{32F\log(2/\delta)}N^{\frac{2r+\gamma-\alpha}{2r+\gamma}}$ and the worst case $\alpha=1$ , it holds

[TABLE]

where $c_{1}=128\left(\frac{(B+\sigma)^{2}}{32F\log(2/\delta)}+2.55\sigma^{2}Q^{2}\right)\log^{2}\frac{2}{\delta}.$

According to Lemma 6, there holds for the empirical error

[TABLE]

Using Lemma 7, for random features error, it holds

[TABLE]

Using Lemma 8, for approximation error, it holds

[TABLE]

Substituting the above inequalities (LABEL:eq.proof.excess_risk.variance_bounded) (54) (55) (56) to Lemma 2, we then get the final result

[TABLE]

where $c_{2}=c_{1}+495R^{2}$ . Note that, the proof use inequalities with high probability $1-\delta$ , including Lemmas 4, 5, 10, 12, 13, and thus the final result holds with the probability at least $1-5\delta$ . ∎

Proof of Theorem 1.

The results in Theorem 1 is a trivial extension of Theorem 2 in [9] and Corollary 1 in [14]. Only considering the attainable case $r\in[1/2,1]$ , this theorem can be proved by combining the proofs in [14] and [9].

Following the error decomposition and proof process in the proof of Theorem 3, one can easily prove Theorem 1. However, the main difference is how to bound the term $\|\widehat{C}_{M,\lambda}^{-1/2}C_{M,\lambda}^{1/2}\|$ as a constant. Using Proposition 1 and the second-order decomposition of operator difference in [14], one can obtain the following identities

[TABLE]

Applying $A=\widehat{C}_{M,\lambda}$ , $B=C_{M,\lambda}$ , the facts $\|\widehat{C}_{M,\lambda}^{-1}\|\leq 1/\lambda$ and $\|C_{M,\lambda}^{-1/2}\|\leq 1/\sqrt{\lambda}$ , it holds

[TABLE]

With confidence at least $1-\delta$ , there holds for $\delta\in(0,1)$ and can be found in [13, 4]

[TABLE]

To guarantee the term $\|\widehat{C}_{M,\lambda}^{-1/2}C_{M,\lambda}^{1/2}\|$ be a constant, it requires

[TABLE]

to make sure that

[TABLE]

Using Assumption 3 and $\lambda=N^{-\frac{1}{2r+\gamma}}$ , one can obtain the condition $n\gtrsim N^{\frac{\gamma+1}{2r+\gamma}}$ same as in [13, 4, 14]. However, in Lemma 9 and Lemma 10, we directly apply a relaxed condition $n\gtrsim N^{\frac{\alpha}{2r+\gamma}}$ by Bernstein’s inequality to guarantee the term $\|\widehat{C}_{M,\lambda}^{-1/2}C_{M,\lambda}^{1/2}\|$ be a constant.

To prove Theorem 1, we just need use the condition $n\gtrsim N^{\frac{\gamma+1}{2r+\gamma}}$ to replace the condition $n\gtrsim N^{\frac{\alpha}{2r+\gamma}}$ in the proof of Lemma 3 and Lemma 6. Then, following the proof of Theorem 5 for $r\in[1/2,1]$ , we prove the result with $\alpha=1$ due to $\mathcal{N}_{\infty}(\lambda)\leq\kappa^{2}\lambda^{-1}.$ ∎

Proof of Theorem 2.

Consider the worst case of Assumption 5, it is equivalent to making no assumption on $\mathcal{N}_{\infty}(\lambda)$ , and there always exists $\mathcal{N}_{\infty}(\lambda)\leq\kappa^{2}\lambda^{-1}.$ Applying Theorem 5 with $\tilde{F}=\kappa^{2}$ and $\alpha=1$ , we prove the result. ∎

Proof of Theorem 3.

Theorem 5 is the detailed version of Theorem 3. ∎

Theorem 6 (Improved Bounds with Additional Unlabeled Samples).

Let $\delta\in(0,1]$ and $\widehat{f}_{D_{j}^{*},\lambda}^{M}$ be defined by (10). Under Assumptions 1, 2, 3, 4 and 5, when $\lambda=N^{-\frac{1}{2r+\gamma}}$ , the total number of samples satisfies

[TABLE]

the number of local processors satisfies

[TABLE]

and the dimension of random features satisfies

[TABLE]

then the following holds with a probability at least $1-\delta$ ,

[TABLE]

where $\tilde{F}=\max(F,\kappa^{2})$ , $\tilde{Q}^{2}=\max(Q^{2},1)$ and $c_{2}$ is a constant independent on $m,n,N*$ that

[TABLE]

Proof.

From Lemma 2, there holds the upper bound for excess risk

[TABLE]

Using the above equality and Lemma 6, we find that empirical error is data-dependent but output-independent. Meanwhile, the sample variance (Lemma 3) is dependent on the number of labeled samples $n=N/m$ , while other terms (including $\|\widehat{C}_{M,\lambda}^{-1/2}C_{M,\lambda}^{1/2}\|$ ) can be related to total sample size $n^{*}=N^{*}/m$ .

Based the sample variance, we first estimate the number of required labeled samples $n$ . Using Lemma 3 and (LABEL:eq.proof.excess_risk.variance_bounded), we have

[TABLE]

To guarantee the optimal learning rate, we need $mN^{\frac{1-4r-2\gamma}{2r+\gamma}}\leq\mathcal{O}\big{(}N^{\frac{-2r}{2r+\gamma}}\big{)}$ , and thus

[TABLE]

We then consider the additional unlabeled samples to reduce empirical error, where the local samples is label-free and the constraint is related to total sample size from Lemma 10:

[TABLE]

Let $\lambda=N^{-\frac{1}{2r+\gamma}}$ , then the restriction on the dimension of random features $M$ is same to Theorem 5. But the restriction on the number of partitions $m$ is changed to

[TABLE]

From the constraint (57) due to sample variance, we know that the number of partitions $m$ can not be bigger than $\mathcal{O}(N^{\frac{2r+2\gamma-1}{2r+\gamma}})$ and plays the leading role. Thus, combining (57) and (58), one can obtain

[TABLE]

We consider the following two conditions for $\alpha$

•

The case $\alpha<1-\gamma$ . It holds $2r+2\gamma-1<2r+\gamma-\alpha$ , thus the constraint of the number of partition is $m\lesssim N^{\frac{2r+2\gamma-1}{2r+\gamma}}$ .

•

The case $\alpha\geq 1-\gamma$ . It holds $\gamma+\alpha-1\geq 0$ and we make use of additional unlabeled examples $N^{*}\gtrsim NN^{\frac{\gamma+\alpha-1}{2r+\gamma}}$ to guarantee $m\lesssim N^{\frac{2r+\gamma-\alpha}{2r+\gamma}}\leq N^{\frac{2r+2\gamma-1}{2r+\gamma}}$ .

Therefore, using unlabeled examples, the number of partitions always achieves $m\lesssim N^{\frac{2r+2\gamma-1}{2r+\gamma}}$ .

Considering the following constraints on the number of partitions $m$ and the dimension of random features $M$ :

[TABLE]

We first estimate the output-dependent error term: sample variance. Using $\lambda=N^{-\frac{1}{2r+\gamma}}$ , $m\leq N^{\frac{2r+2\gamma-1}{2r+\gamma}}$ and (LABEL:eq.proof.excess_risk.variance_bounded), the global sample variance is bounded by

[TABLE]

where $c_{3}=128\left(\kappa^{2}(B+\sigma)^{2}+2.55\sigma^{2}Q^{2}\right)\log^{2}\frac{2}{\delta}$ .

We then bound the label-free terms in Lemma 2 with $\lambda=N^{-\frac{1}{2r+\gamma}}$ . Using Lemma 6, Lemma 7 and Lemma 8, it holds

[TABLE]

Combining the above inequalities (LABEL:eq.unlabel.proof.global_sample_variance) and (60) to Lemma 2, one can prove the desired result. ∎

Proof of Theorem 4.

Theorem 6 is the detailed version of Theorem 4. ∎

Corollary 1.

Under the same assumptions of Theorem 3, if $r\in(0,1]$ , $\gamma\in[0,1]$ and $\lambda=N^{-\frac{1}{2r+\gamma}}$ , then $m=1$ and the number of random features $M$ satisfying

[TABLE]

are sufficient to guarantee, with a high probability, that

[TABLE]

The above error bound is a special case of Theorem 3 with only using one partition $m=1$ , namely KRR-RF. Compared to theoretical results in [9] which only take effect in the attainable case $r\in[1/2,1]$ , Corollary 1 pertain to both the attainable and non-attainable cases $r\in(0,1]$ , covering all difficult problems. Meanwhile, the requirements on the number of random features are reasonable and lead to higher computational efficiency.

Corollary 2.

Under the same assumptions of Theorem 4, if $r\in(0,1],\gamma\in[0,1],2r+2\gamma\geq 1$ and $\lambda=N^{-\frac{1}{2r+\gamma}},$ then the total number of samples corresponding to

[TABLE]

and the number of local processors satisfying

[TABLE]

are sufficient to guarantee, with a high probability, that

[TABLE]

The above Corollary is a special case of DKRR-RF with the induced kernel rather than random features, i.e., KRR-DC. The existing theoretical results on KRR-DC are still restricted with $m\lesssim N^{\frac{2r+\gamma-1}{2r+\gamma}}$ , while we improve the condition to $m\lesssim N^{\frac{2r+2\gamma-1}{2r+\gamma}}$ for the first time, which admits higher computational complexities and covers more complicated problems in the non-attainable cases. Using the condition $m\lesssim N^{\frac{2r+2\gamma-1}{2r+\gamma}}$ , it is worthy of devising more efficient distributed KRR methods together with Nyström subsampling, random projections, stochastic optimization, and other techniques in the future.

A.5 Probabilistic Inequalities

Proposition 2 (Lemma 2 in [12]).

Let $\mathcal{L}$ be a separable Hilbert space and $\{\xi_{1},\cdots,\xi_{n}\}$ be a sequence of i.i.d random variables in $\mathcal{L}$ . Assume the bound be $\|\xi_{i}\|\leq\widetilde{M}\leq\infty$ and the variance be $\tilde{\sigma}^{2}=\mathbb{E}(\|\xi_{i}-\mathbb{E}(\xi_{i})\|^{2})$ for any $i\in[n]$ . For any $\delta\in(0,1)$ , with confidence $1-\delta$ ,

[TABLE]

The above Bernstein’s inequality is the key to analyzing the relationship between the empirical random vector and its expected counterpart, which is used to prove Lemma 9 and Lemma 4. The above Bernstein’s inequality for random vectors was provided in [12, 9] and later was extended to the random operator case in Lemma 24 of [10].

Proposition 3 (Lemma E.2 of [39]).

For any self-adjoint and positive semi-definite operators $A$ and $B$ , if there exists $0<\eta<1$ such that the following inequality holds

[TABLE]

then

[TABLE]

The above inequality [39] was used to establish the connection between $\|(A+\lambda I)^{-1/2}(B-A)(A+\lambda I)^{-1/2}\|$ and $\|(A+\lambda I)^{1/2}(B+\lambda I)^{-1/2}\|$ . In this paper, those two terms $\|C_{M,\lambda}^{-1/2}(C_{M}-\widehat{C}_{M})C_{M,\lambda}^{-1/2}\|$ and $\|C_{M,\lambda}^{1/2}\widehat{C}_{M,\lambda}^{-1/2}\|$ often exist on the left parts of the estimates of error terms, where we make use of Proposition 3 to guarantee both of two terms of lhs as constants.

Proposition 4 (Proposition 9 in [9]).

Let $\mathcal{H},\mathcal{K}$ be two separable Hilbert spaces and $X,A$ be bounded linear operators, with $X:\mathcal{H}\to\mathcal{K}$ and $A:\mathcal{H}\to\mathcal{H}$ be positive semidefinite. The following holds

[TABLE]

Lemma 9.

Given $\phi_{M}(\boldsymbol{x})=M^{-1/2}\big{[}\psi(\boldsymbol{x},\omega_{1}),\cdots,\psi(\boldsymbol{x},\omega_{M})\big{]}^{\top}$ , let $i.i.d$ random vectors $\bigl{[}\phi_{M}(\boldsymbol{x}_{1}),\cdots,\phi_{M}(\boldsymbol{x}_{n})\bigr{]}$ with $n\geq 1$ be on a separable Hilbert space $\mathcal{H}_{M}$ such that $C_{M}=\mathbb{E}_{\rho_{X}}[\phi_{M}(\boldsymbol{x})\otimes\phi_{M}(\boldsymbol{x})]$ and $\widehat{C}_{M}=\frac{1}{n}\sum_{i=1}^{n}\phi_{M}(\boldsymbol{x}_{i})\otimes\phi_{M}(\boldsymbol{x}_{i})$ are trace class. Then for any $\delta\in(0,1)$ with the probability at least $1-\delta$ , the following holds

[TABLE]

Proof.

Let $C_{M,\lambda}^{-1/2}=(C_{M}+\lambda I)^{-1/2}$ and

[TABLE]

thus we have

[TABLE]

The left of the desired inequality becomes

[TABLE]

Note that

[TABLE]

To use Bernstein’s inequality (Proposition 2), we need to bound $\|\xi\|$ and $\mathbb{E}\|\xi\|^{2}$ as follows

[TABLE]

Substituting the above two identities to Bernstein’s inequality (61), we prove the result. ∎

Lemma 10.

When the number of the local samples $n\geq 16(\mathcal{N}_{\infty}(\lambda)+1)\log(2/\delta)$ , then for any $\delta\in(0,1)$ , there exists with the confidence $1-\delta$

[TABLE]

Proof.

From the Proposition 9, we set $n\geq 16(\mathcal{N}_{\infty}(\lambda)+1)\log(2/\delta)$ and obtain that

[TABLE]

From Proposition 3 and the above inequality, there exists

[TABLE]

∎

Lemma 11.

Let $\psi_{\omega_{1}},\cdots,\psi_{\omega_{M}}$ with $M\geq 1$ , be $i.i.d$ random vectors on a separable Hilbert space $\mathcal{H}_{M}$ such that $L=\mathbb{E}_{\omega}[\psi_{\omega}\otimes\psi_{\omega}]$ and $L_{M}=\frac{1}{M}\sum_{i=1}^{M}[\psi_{\omega_{i}}\otimes\psi_{\omega_{i}}]$ are trace class. Then for any $\delta\in(0,1)$ with the probability at least $1-\delta$ , the following holds

[TABLE]

Proof.

Let $L_{\lambda}^{-1/2}=(L+\lambda I)^{-1/2}$ and

[TABLE]

thus we have

[TABLE]

The left of the desired inequality becomes

[TABLE]

Note that

[TABLE]

To use Bernstein’s inequality (Proposition 2), we need to bound $\|\xi\|$ and $\mathbb{E}\|\xi\|^{2}$ . Note that

[TABLE]

Substituting the above two identities to Bernstein’s inequality (61), we prove the result. ∎

Lemma 12.

When the dimension of random features $M\geq 16(\mathcal{N}_{\infty}(\lambda)+1)\log(2/\delta)$ , then for any $\delta\in(0,1)$ , there exists with the confidence $1-\delta$

[TABLE]

Proof.

From the Proposition 11, we set $M\geq 16(\mathcal{N}_{\infty}(\lambda)+1)\log(2/\delta)$ and obtain that

[TABLE]

From Proposition 3 and the above inequality, there exists

[TABLE]

∎

Lemma 13.

Let $\psi_{\omega_{1}},\cdots,\psi_{\omega_{M}}$ with $M\geq 1$ , be $i.i.d$ random vectors on a separable Hilbert space $\mathcal{H}_{M}$ such that $L=\mathbb{E}_{\omega}[\psi_{\omega}\otimes\psi_{\omega}]$ and $L_{M}=\frac{1}{M}\sum_{i=1}^{M}[\psi_{\omega_{i}}\otimes\psi_{\omega_{i}}]$ are trace class. Then for any $\delta\in(0,1)$ with the probability at least $1-\delta$ , the following holds

[TABLE]

Proof.

Let $L_{\lambda}^{-1/2}=(L+\lambda I)^{-1/2}$ and

[TABLE]

thus we have

[TABLE]

The left of the desired inequality becomes

[TABLE]

Note that

[TABLE]

To use Bernstein’s inequality (Proposition 2), we need to bound $\|\xi\|$ and $\mathbb{E}\|\xi\|^{2}$ . Note that

[TABLE]

The last step is due to $\mathcal{N}(\lambda)\leq\mathcal{N}_{\infty}(\lambda).$ Substituting the above two identities and $\kappa\geq 1$ to Bernstein’s inequality (61), we prove the result. ∎

Bibliography39

The reference list from the paper itself. Each links out to its DOI / PubMed record.

1[1] Léon Bottou and Olivier Bousquet. The tradeoffs of large scale learning. In Advances in Neural Information Processing Systems 21 (NIPS) , pages 161–168, 2008.
2[2] Jian Li, Yong Liu, Rong Yin, Hua Zhang, Lizhong Ding, and Weiping Wang. Multi-class learning: From theory to algorithm. In Advances in Neural Information Processing Systems 31 , pages 1591–1600, 2018.
3[3] Yuchen Zhang, John Duchi, and Martin Wainwright. Divide and conquer kernel ridge regression: A distributed algorithm with minimax optimal rates. Journal of Machine Learning Research , 16(1):3299–3340, 2015.
4[4] Shao-Bo Lin, Xin Guo, and Ding-Xuan Zhou. Distributed learning with regularized least squares. The Journal of Machine Learning Research , 18(1):3202–3232, 2017.
5[5] Christopher KI Williams and Matthias Seeger. Using the nyström method to speed up kernel machines. In Advances in Neural Information Processing Systems 14 (NIPS) , pages 682–688, 2001.
6[6] Alessandro Rudi, Raffaello Camoriano, and Lorenzo Rosasco. Less is more: Nyström computational regularization. In Advances in Neural Information Processing Systems 28 (NIPS) , pages 1657–1665, 2015.
7[7] Jian Li, Yong Liu, Rong Yin, and Weiping Wang. Approximate manifold regularization: Scalable algorithm and generalization analysis. In IJCAI , pages 2887–2893, 2019.
8[8] Ali Rahimi and Benjamin Recht. Random features for large-scale kernel machines. In Advances in Neural Information Processing Systems 21 (NIPS) , pages 1177–1184, 2007.

TL;DR

Contribution

Findings

Abstract

Peer Reviews

Code & Models

Videos

Taxonomy

Towards Sharp Analysis for Distributed Learning with Random Features

Abstract

1 Introduction

1.1 Our Contributions

2 Distributed Learning with Random Feature

2.1 Kernel Ridge Regression (KRR)

2.2 Distributed KRR with Random Features (DKRR-RF)

3 Theoretical Assessment

3.1 Assumptions

Assumption 1** (Random features are continuous and bounded).**

Assumption 2** (Moment assumption).**

Definition 1** (Integral operators).**

Definition 2** (Effective dimension).**

Assumption 3** (Capacity assumption).**

Assumption 4** (Regularity assumption).**

3.2 General Results with Fast Rates

Theorem 1**.**

Remark 1**.**

3.3 Refined Results in the Non-attainable Case

Theorem 2**.**

Remark 2**.**

3.4 Fewer Features with Data-dependent Sampling

Assumption 5** (Compatibility assumption).**

Theorem 3**.**

Remark 3**.**

Remark 4**.**

3.5 More Partitions with Unlabeled Data

Theorem 4**.**

Remark 5**.**

4 Compared with Related Work

4.1 Applicable Area from r∈[1/2,1]r\in[1/2,1]r∈[1/2,1] to 2r+γ≥12r+\gamma\geq 12r+γ≥1

4.2 Applicable Area from 2r+γ≥12r+\gamma\geq 12r+γ≥1 to 2r+2γ≥12r+2\gamma\geq 12r+2γ≥1

4.3 Random Features Error in the Non-attainable Case

5 Experiments

5.1 Numerical Experiments (for Theorem 2)

5.1.1 Easy Problem

5.1.2 General Problem

5.1.3 Diffiult Problem

5.2 Influence of data-dependent random features (for Theorem 3)

5.3 Influence of Unlabeled Data (For Theorem 4)

5.4 Large-scale Real Data

6 Conclusion

acknowledgment

Appendix A Proofs

A.1 Definitions of Linear Operators

Definition 3** (Operators with kernel KKK).**

Definition 4** (Operators with random features).**

Remark 6**.**

A.2 Error decomposition

Lemma 1**.**

Proof.

Remark 7**.**

Proposition 1**.**

Proof.

Lemma 2**.**

Proof.

A.3 Estimate Error Terms

A.3.1 Estimates for Sample Variance ∥f^Dj,λM−f~Dj,λM∥\|\widehat{f}_{D_{j},\lambda}^{M}-\widetilde{f}_{D_{j},\lambda}^{M}\|∥f​Dj​,λM​−f​Dj​,λM​∥

Lemma 3**.**

Proof.

Lemma 4** (Lemma 6 in [9]).**

Lemma 5**.**

Proof.

A.3.2 Estimates for Empirical Error ∥f~Dj,λM−fλM∥\|\widetilde{f}_{D_{j},\lambda}^{M}-{f}_{\lambda}^{M}\|∥f​Dj​,λM​−fλM​∥

Lemma 6**.**

Proof.

A.3.3 Estimates for Random Features Error ∥fλM−fλ∥\|f_{\lambda}^{M}-f_{\lambda}\|∥fλM​−fλ​∥

Lemma 7**.**

Proof.

A.3.4 Estimates for Approximation Error ∥fλ−fρ∥\|{f}_{\lambda}-f_{\rho}\|∥fλ​−fρ​∥

Lemma 8**.**

Proof.

Assumption 1 (Random features are continuous and bounded).

Assumption 2 (Moment assumption).

Definition 1 (Integral operators).

Definition 2 (Effective dimension).

Assumption 3 (Capacity assumption).

Assumption 4 (Regularity assumption).

Theorem 1.

Remark 1.

Theorem 2.

Remark 2.

Assumption 5 (Compatibility assumption).

Theorem 3.

Remark 3.

Remark 4.

Theorem 4.

Remark 5.

4.1 Applicable Area from $r\in[1/2,1]$ to $2r+\gamma\geq 1$

4.2 Applicable Area from $2r+\gamma\geq 1$ to $2r+2\gamma\geq 1$

Definition 3 (Operators with kernel $K$ ).

Definition 4 (Operators with random features).

Remark 6.

Lemma 1.

Remark 7.

Proposition 1.

Lemma 2.

A.3.1 Estimates for Sample Variance $\|\widehat{f}_{D_{j},\lambda}^{M}-\widetilde{f}_{D_{j},\lambda}^{M}\|$

Lemma 3.

Lemma 4 (Lemma 6 in [9]).

Lemma 5.

A.3.2 Estimates for Empirical Error $\|\widetilde{f}_{D_{j},\lambda}^{M}-{f}_{\lambda}^{M}\|$

Lemma 6.

A.3.3 Estimates for Random Features Error $\|f_{\lambda}^{M}-f_{\lambda}\|$

Lemma 7.

A.3.4 Estimates for Approximation Error $\|{f}_{\lambda}-f_{\rho}\|$

Lemma 8.

Theorem 5 (General excess risk bound).

Theorem 6 (Improved Bounds with Additional Unlabeled Samples).

Corollary 1.

Corollary 2.

Proposition 2 (Lemma 2 in [12]).

Proposition 3 (Lemma E.2 of [39]).

Proposition 4 (Proposition 9 in [9]).

Lemma 9.

Lemma 10.

Lemma 11.

Lemma 12.

Lemma 13.