Spectrally-truncated kernel ridge regression and its free lunch

Arash A. Amini

arXiv:1906.06276·stat.ML·October 15, 2019

Spectrally-truncated kernel ridge regression and its free lunch

Arash A. Amini

PDF

TL;DR

This paper analyzes spectrally-truncated kernel ridge regression, revealing that truncation can outperform full KRR in minimax risk for infinite-dimensional RKHS, and explores the trade-offs between spectral truncation and regularization.

Contribution

It provides an exact risk expression for truncated KRR and demonstrates that spectral truncation can improve performance beyond full KRR in certain regimes.

Findings

01

Spectral truncation can outperform full KRR in minimax risk.

02

There exists a threshold on the number of eigenvalues retained for improved performance.

03

Implicit regularization from truncation complements Hilbert norm regularization.

Abstract

Kernel ridge regression (KRR) is a well-known and popular nonparametric regression approach with many desirable properties, including minimax rate-optimality in estimating functions that belong to common reproducing kernel Hilbert spaces (RKHS). The approach, however, is computationally intensive for large data sets, due to the need to operate on a dense $n \times n$ kernel matrix, where $n$ is the sample size. Recently, various approximation schemes for solving KRR have been considered, and some analyzed. Some approaches such as Nystr\"{o}m approximation and sketching have been shown to preserve the rate optimality of KRR. In this paper, we consider the simplest approximation, namely, spectrally truncating the kernel matrix to its largest $r < n$ eigenvalues. We derive an exact expression for the maximum risk of this truncated KRR, over the unit ball of the RKHS. This result can be…

Equations114

y_{i} = f^{*} (x_{i}) + w_{i}, i = 1, \dots, n, E w = 0, cov (w) = σ^{2} I_{n}

y_{i} = f^{*} (x_{i}) + w_{i}, i = 1, \dots, n, E w = 0, cov (w) = σ^{2} I_{n}

f^{*} \in B_{H} := {f \in H : ∥ f ∥_{H} \leq 1} .

f^{*} \in B_{H} := {f \in H : ∥ f ∥_{H} \leq 1} .

f_{n, λ} := f \in H min \frac{1}{n} i = 1 \sum n (y_{i} - f (x_{i}))^{2} + λ ∥ f ∥_{H}^{2},

f_{n, λ} := f \in H min \frac{1}{n} i = 1 \sum n (y_{i} - f (x_{i}))^{2} + λ ∥ f ∥_{H}^{2},

\displaystyle\min_{\omega\,\in\,\mathbb{R}^{n}}\;\frac{1}{n}\|y-\sqrt{n}K\omega\|^{2}+\lambda\omega^{T}K\omega,\quad\text{where}\quad K=\frac{1}{n}\big{(}\mathbb{K}(x_{i},x_{j})\big{)}\in\mathbb{R}^{n\times n}

\displaystyle\min_{\omega\,\in\,\mathbb{R}^{n}}\;\frac{1}{n}\|y-\sqrt{n}K\omega\|^{2}+\lambda\omega^{T}K\omega,\quad\text{where}\quad K=\frac{1}{n}\big{(}\mathbb{K}(x_{i},x_{j})\big{)}\in\mathbb{R}^{n\times n}

f_{ω} := \frac{1}{n} j \sum ω_{j} K (\cdot, x_{j}) .

f_{ω} := \frac{1}{n} j \sum ω_{j} K (\cdot, x_{j}) .

∥ f_{ω} ∥_{H}^{2} = ω^{T} K ω, f_{ω} (x_{i}) = n (K ω)_{i},

∥ f_{ω} ∥_{H}^{2} = ω^{T} K ω, f_{ω} (x_{i}) = n (K ω)_{i},

L_{X} := span {K (\cdot, x_{i}) : i \in [n]} = {f_{ω} : ω \in R^{n}},

L_{X} := span {K (\cdot, x_{i}) : i \in [n]} = {f_{ω} : ω \in R^{n}},

{f \in H : f (x_{i}) = f^{*} (x_{i}), \forall i}

{f \in H : f (x_{i}) = f^{*} (x_{i}), \forall i}

= {f^{*} + g : g ⊥_{H} L_{X}}

= {f_{ω^{*}} + g : g \in L_{X}^{⊥}} = f_{ω^{*}} + L_{X}^{⊥}

\displaystyle\|f-g\|_{n}=\Big{[}\frac{1}{n}\sum_{i=1}^{n}\big{(}f(x_{i})-g(x_{i})\big{)}^{2}\Big{]}^{1/2}.

\displaystyle\|f-g\|_{n}=\Big{[}\frac{1}{n}\sum_{i=1}^{n}\big{(}f(x_{i})-g(x_{i})\big{)}^{2}\Big{]}^{1/2}.

∥ f - f^{*} ∥_{n} = ∥ f - f_{ω^{*}} ∥_{n}, \forall f \in H,

∥ f - f^{*} ∥_{n} = ∥ f - f_{ω^{*}} ∥_{n}, \forall f \in H,

K = K_{r} := U (D_{r} 0 00) U^{T} = U_{r} D_{r} U_{r}^{T} .

K = K_{r} := U (D_{r} 0 00) U^{T} = U_{r} D_{r} U_{r}^{T} .

ω \in

ω \in

such that K ω = K ω .

MSE (f, f^{*}) = E ∥ f - f^{*} ∥_{n}^{2} .

MSE (f, f^{*}) = E ∥ f - f^{*} ∥_{n}^{2} .

H_{r} (λ) := 1 \leq i \leq r max h (λ; μ_{i})

H_{r} (λ) := 1 \leq i \leq r max h (λ; μ_{i})

\displaystyle\sup_{f^{*}\,\in\,\operatorname{\mathbb{B}}_{\mathcal{H}}}\operatorname{MSE}(\widetilde{f}_{r,\lambda},f^{*})\;=\;\max\big{\{}H_{r}(\lambda),\;\mu_{r+1}\big{\}}+\frac{\sigma^{2}}{n}\sum_{i=1}^{r}\Big{(}\frac{\mu_{i}}{\mu_{i}+\lambda}\Big{)}^{2},

\displaystyle\sup_{f^{*}\,\in\,\operatorname{\mathbb{B}}_{\mathcal{H}}}\operatorname{MSE}(\widetilde{f}_{r,\lambda},f^{*})\;=\;\max\big{\{}H_{r}(\lambda),\;\mu_{r+1}\big{\}}+\frac{\sigma^{2}}{n}\sum_{i=1}^{r}\Big{(}\frac{\mu_{i}}{\mu_{i}+\lambda}\Big{)}^{2},

\displaystyle\text{WAE}_{r,\lambda}\leq\max\Big{\{}\frac{\lambda}{4},\;\mu_{r+1}\Big{\}}.

\displaystyle\text{WAE}_{r,\lambda}\leq\max\Big{\{}\frac{\lambda}{4},\;\mu_{r+1}\Big{\}}.

\displaystyle\sup_{f^{*}\,\in\,\operatorname{\mathbb{B}}_{\mathcal{H}}}\|f^{*}-\bar{f}\|_{n}^{2}+\lambda\|\bar{f}\|_{\mathcal{H}}^{2}\;=\;\max\Big{\{}\max_{1\leq i\leq n}\frac{\lambda\mu_{i}}{\mu_{i}+\lambda},\;\mu_{r+1}\Big{\}}.

\displaystyle\sup_{f^{*}\,\in\,\operatorname{\mathbb{B}}_{\mathcal{H}}}\|f^{*}-\bar{f}\|_{n}^{2}+\lambda\|\bar{f}\|_{\mathcal{H}}^{2}\;=\;\max\Big{\{}\max_{1\leq i\leq n}\frac{\lambda\mu_{i}}{\mu_{i}+\lambda},\;\mu_{r+1}\Big{\}}.

r (λ) := min {r \in [n] : μ_{r + 1} \leq H_{n} (λ)} .

r (λ) := min {r \in [n] : μ_{r + 1} \leq H_{n} (λ)} .

λ_{n} := λ > 0 argmin f^{*} \in B_{H} sup MSE (f_{n, λ}, f^{*}), and r_{n} := r (λ_{n}) .

λ_{n} := λ > 0 argmin f^{*} \in B_{H} sup MSE (f_{n, λ}, f^{*}), and r_{n} := r (λ_{n}) .

f^{*} \in B_{H} sup MSE (f_{r, λ}, f^{*}) \leq f^{*} \in B_{H} sup MSE (f_{n, λ}, f^{*}) .

f^{*} \in B_{H} sup MSE (f_{r, λ}, f^{*}) \leq f^{*} \in B_{H} sup MSE (f_{n, λ}, f^{*}) .

λ > 0 min f^{*} \in B_{H} sup MSE (f_{r, λ}, f^{*}) \leq λ > 0 min f^{*} \in B_{H} sup MSE (f_{n, λ}, f^{*}) .

λ > 0 min f^{*} \in B_{H} sup MSE (f_{r, λ}, f^{*}) \leq λ > 0 min f^{*} \in B_{H} sup MSE (f_{n, λ}, f^{*}) .

λ_{r} := λ > 0 argmin f^{*} \in B_{H} sup MSE (f_{r, λ}, f^{*})

λ_{r} := λ > 0 argmin f^{*} \in B_{H} sup MSE (f_{r, λ}, f^{*})

\displaystyle\lambda_{r}\geq\max\Big{\{}\frac{\mu_{r}}{\sqrt{\mu_{r}/\mu_{r+1}}-1},\frac{\sigma^{2}}{n}\big{(}1+B_{r}\big{)}\Big{\}}

\displaystyle\lambda_{r}\geq\max\Big{\{}\frac{\mu_{r}}{\sqrt{\mu_{r}/\mu_{r+1}}-1},\frac{\sigma^{2}}{n}\big{(}1+B_{r}\big{)}\Big{\}}

\displaystyle R_{r}(\delta)=\Big{(}\frac{\sigma^{2}}{n}\sum_{i=1}^{r}\min\{\mu_{i},\delta^{2}\}\Big{)}^{1/2}.

\displaystyle R_{r}(\delta)=\Big{(}\frac{\sigma^{2}}{n}\sum_{i=1}^{r}\min\{\mu_{i},\delta^{2}\}\Big{)}^{1/2}.

\displaystyle\sup_{f^{*}\,\in\,\operatorname{\mathbb{B}}_{\mathcal{H}}}\operatorname{MSE}(\widetilde{f}_{r,\lambda},f^{*})\;\leq\;\frac{1}{4}\lambda+\Big{(}\frac{R_{r}(\delta)}{\delta}\Big{)}^{2}.

\displaystyle\sup_{f^{*}\,\in\,\operatorname{\mathbb{B}}_{\mathcal{H}}}\operatorname{MSE}(\widetilde{f}_{r,\lambda},f^{*})\;\leq\;\frac{1}{4}\lambda+\Big{(}\frac{R_{r}(\delta)}{\delta}\Big{)}^{2}.

RMSE = MSE \leq \frac{δ}{2} + \frac{R _{r} ( δ )}{δ} \leq \frac{δ}{2} + \frac{R _{n} ( δ )}{δ} .

RMSE = MSE \leq \frac{δ}{2} + \frac{R _{r} ( δ )}{δ} \leq \frac{δ}{2} + \frac{R _{n} ( δ )}{δ} .

∥ f_{ω} - f_{ω^{*}} ∥_{n} = ∥ f_{ω - ω^{*}} ∥_{n} = ∥ K (ω - ω^{*}) ∥ = ∥ u - u^{*} ∥

∥ f_{ω} - f_{ω^{*}} ∥_{n} = ∥ f_{ω - ω^{*}} ∥_{n} = ∥ K (ω - ω^{*}) ∥ = ∥ u - u^{*} ∥

u \in ran (K) min \frac{1}{n} ∥ y - n u ∥^{2} + λ u^{T} K^{+} u

u \in ran (K) min \frac{1}{n} ∥ y - n u ∥^{2} + λ u^{T} K^{+} u

K ω = Ψ_{λ} y, where Ψ_{λ} = (K + λ I)^{- 1} K .

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Full text

Spectrally-truncated kernel ridge regression and its free lunch

Arash A. Amini

Department of Statistics

University of California

Los Angeles

Abstract

Kernel ridge regression (KRR) is a well-known and popular nonparametric regression approach with many desirable properties, including minimax rate-optimality in estimating functions that belong to common reproducing kernel Hilbert spaces (RKHS). The approach, however, is computationally intensive for large data sets, due to the need to operate on a dense $n\times n$ kernel matrix, where $n$ is the sample size. Recently, various approximation schemes for solving KRR have been considered, and some analyzed. Some approaches such as Nyström approximation and sketching have been shown to preserve the rate optimality of KRR. In this paper, we consider the simplest approximation, namely, spectrally truncating the kernel matrix to its largest $r<n$ eigenvalues. We derive an exact expression for the maximum risk of this truncated KRR, over the unit ball of the RKHS. This result can be used to study the exact trade-off between the level of spectral truncation and the regularization parameter. We show that, as long as the RKHS is infinite-dimensional, there is a threshold on $r$ , above which, the spectrally-truncated KRR surprisingly outperforms the full KRR in terms of the minimax risk, where the minimum is taken over the regularization parameter. This strengthens the existing results on approximation schemes, by showing that not only one does not lose in terms of the rates, truncation can in fact improve the performance, for all finite samples (above the threshold). Moreover, we show that the implicit regularization achieved by spectral truncation is not a substitute for Hilbert norm regularization. Both are needed to achieve the best performance.

Keywords: kernel methods; ridge regression; spectral truncation; nonparametric regression; minimax estimation.

1 Introduction

The general nonparametric regression problem can be stated as

[TABLE]

where $w=(w_{i})\in\mathbb{R}^{n}$ is a noise vector and $f^{*}:\mathcal{X}\to\mathbb{R}$ is the function of interest to be approximated from the noisy observations $\{y_{i}\}$ . Here, $\mathcal{X}$ is the space to which the covariates $\{x_{i}\}$ belong. We consider the fixed design regression where the covariates are assumed to be deterministic. The problem has a long history in statistics and machine learning [1, 2]. In this paper, we assume that $f^{*}$ belongs to a reproducing kernel Hilbert space (RKHS), denoted as $\mathcal{H}$ [3]. Such spaces are characterized by the existence of a reproducing kernel, that is, a positive semidefinite function $\mathbb{K}:\mathcal{X}\times\mathcal{X}\to\mathbb{R}$ that uniquely determines the underlying function space $\mathcal{H}$ . RKHSs are very versatile modeling tools and include, for example, Sobolev spaces of smooth functions whose norms are measures of function roughness [4]. Throughout, we think of these Sobolev spaces as the concrete examples of $\mathcal{H}$ . By assuming an upper bound on the Hilbert norm of $f^{*}$ , we can encode a prior belief that the true data generating function $f^{*}$ has a certain degree of smoothness. Without loss of generality, we assume that $f^{*}$ belongs to the unit ball of the RKHS, that is,

[TABLE]

A natural estimator is then, the kernel ridge regression (KRR), defined as the solution of the following optimization problem:

[TABLE]

where $\lambda>0$ is a regularization parameter. It is well-known that this problem can be reduced to a finite-dimensional problem, by an application of the so-called representer theorem [5]:

[TABLE]

is the (normalized empirical) kernel matrix. Although (4) has a closed form solution, it involves inverting an $n\times n$ dense matrix, with time complexity $O(n^{3})$ , which is prohibitive in practice.

Various approximation schemes have been proposed to mitigate the computational costs, including (i) approximating the kernel matrix or (ii) directly approximating the optimization problem (4). Examples of the former are the Nyström approximation, column sampling and their variants [6, 7, 8, 9, 10]. An example of the latter is sketching [11, 12] where one restricts $\omega$ to the subspace $\operatorname{ran}(S):=\{S\alpha\mid\alpha\in\mathbb{R}^{r}\}$ , for some random matrix $S\in\mathbb{R}^{n\times r}$ . It is in fact known that Nyström can be considered a special case of sketching with random standard basis vectors [12]. Sketching, with sufficiently large $r$ , has been shown in [12] to achieve minimax optimal rates over Sobolev spaces, under mild conditions on the sketching matrix $S$ . Similarly, the Nyström approximation has been analyzed in [13, 14, 15, 11, 16] and [17], the latter showing minimax rate optimality. In addition to the above, (iii) divide and conquer approaches have been proposed [18], where one solves the problem over subsamples and then aggregates by averaging, with some rate optimality guarantees. Other notable approaches to scaling include (iv) approximating translation-invariant kernel functions via Monte Carlo averages of tensor products of randomized feature maps [19, 20] and (v) applying stochastic gradient in the function space [21]. Memory efficiency in kernel approximation is considered in [22].

In this paper, we consider the most direct kernel approximation, namely, replacing $K$ by its best rank $r$ approximation (in Frobenius norm). This amounts to truncating the eigenvalue decomposition of $K$ to its top $r$ eigenvalues. We refer to the resulting KRR approximation as the spectrally-truncated KRR (ST-KRR). Although somewhat slower than the Nyström approximation and fast forms of sketching, ST-KRR can be considered an ideal rank- $r$ spectral approximation. By analyzing it, one can also gain insights about approximate SVD truncation approaches such as Nyström or sketching. Practically, ST-KRR is a very viable solution for moderate-size problems. See Appendix A for a discussion of the time complexity of various schemes.

We derive an exact expression for the maximum (empirical) mean-squared error (MSE) of ST-KRR, uniformly over the unit ball of the RKHS. This expression is solely in terms of the eigenvalues $\{\mu_{i}\}$ of the kernel matrix $K$ , the regularization parameter $\lambda$ , the truncation level $r$ , and the noise level $\sigma^{2}$ . Thus if one has access to $\{\mu_{i}\}$ and the noise level (or estimates of them), one can plot the exact regularization curve (maximum MSE versus $\lambda$ ) for a given truncation level $r$ and sample size $n$ , and determine the optimal value of $\lambda$ . We also note that since the empirical eigenvalues $\{\mu_{i}\}$ quickly approach those of the integral operator associated with $\mathbb{K}$ , as $n\to\infty$ [23], one can use these idealized eigenvalues instead of $\{\mu_{i}\}$ to get an excellent approximation of these regularization curves.

We then show that there is an optimal threshold on $r$ , the truncation level, which we denote as $r_{n}$ , such that for all $r\geq r_{n}$ , the minimax risk of the $r$ -truncated KRR, with the minimum taken over the regularization parameter, is strictly smaller than that of the full KRR whenever $\mu_{r+1}>0$ . For infinite-dimensional RKHSs, we always have $\mu_{r_{n}+1}>0$ , hence truncating at level $r_{n}$ is guaranteed to strictly improve performance. The slower the decay of the eigenvalues, the larger this gap in performance.

This result shows that although the spectral truncation is mainly used as a computational device, it also has a statistical regularization effect. The next question is whether the regularization provided by the spectral truncation renders Hilbert norm regularization (via $\lambda$ ) unnecessary. We answer this question in the negative by showing that for any truncation level $r$ , the optimal maximum risk is achieved for a positive $\lambda$ . Together, these results show that the “ $r$ -truncated $\lambda$ -regularized KRR” defines a new class of estimators whose performance cannot be achieved (in finite sample) with either regularization alone.

We also show how the exact expression for the maximum MSE can be used to easily establish a slightly weaker bound for ST-KRR, similar to those derived in [12] for sketching. We discuss the link between the statistical dimension considered in [12] and the optimal truncation level $r_{n}$ , and show how the same rate-optimality guarantees hold for ST-KRR. Rate-optimality also follows form the fact that ST-KRR, with proper $r$ , strictly dominates full KRR and the latter is rate-optional. However, we do these calculations to make the comparison easier.

Finally, we illustrate the results with some numerical simulations showing some further surprises. For example, the Gaussian kernel has a much faster eigendecay rate than a Sobolev-1 kernel (exponential versus polynomial decay). Hence, the optimal truncation level $r_{n}$ asymptotically grows much slower for the Gaussian kernel. However, for finite samples, depending on the choice of the Gaussian bandwidth, the exact optimal truncation level, computed numerically, can be larger than that of Sobolev-1.

2 Preliminaries

Let us start with some observations regarding the original KRR problem in (3). For $\omega\in\mathbb{R}^{n}$ , consider the kernel mapping

[TABLE]

Note that $\omega\mapsto f_{\omega}$ is a linear map from $\mathbb{R}^{n}\to\mathcal{H}$ . This map is the link between the solutions of the two optimization problems (3) and (4): For any optimal solution $\omega$ of (4), $f_{\omega}$ will be an optimal solution of (3). The link is easy to establish by observing the following two identities:

[TABLE]

the first of which uses the reproducing property of the kernel: $\langle f,\mathbb{K}(\,\cdot\,,x)\rangle_{\mathcal{H}}=f(x)$ . We will frequently use this property in the sequel. The proof of the equivalence follows from an argument similar to our discussion of the identifiability below.

2.1 Identifiability

Let us first observe that $f^{*}$ in (1) is not (statistically) identifiable. That is, there are multiple functions $f^{*}$ (in fact, infinitely many if $\mathcal{H}$ is infinite-dimensional) for which the vector $(y_{i})$ has the exact same distribution. To see this, let

[TABLE]

and let $f_{\omega^{*}}$ be the projection of $f^{*}$ onto $\mathcal{L}_{X}$ . (It is always possible to choose at least one such $\omega^{*}$ by the definition of projection and since $\mathcal{L}_{X}$ is a closed subspace of $\mathcal{H}$ .) Given observations $(y_{i})$ , we can only hope to recover the following equivalence class:

[TABLE]

where the last line follows since $f^{*}-f_{\omega^{*}}\in\mathcal{L}_{X}^{\perp}$ by the property of orthogonal projection (and can be absorbed into $g$ ).

We will use $f_{\omega^{*}}$ as the representative of the (identifiable) equivalence class of $f^{*}$ . We are interested in measuring functional deviations (e.g., the error in our estimate relative to the true function) in the empirical $\ell_{2}$ norm:

[TABLE]

The use of this norm is common in the literature of nonparametric regression [24, 25]. It is interesting to note that $\|f^{*}-f_{\omega^{*}}\|_{n}=0$ ,

[TABLE]

and $\|f_{\omega^{*}}\|_{\mathcal{H}}\leq\|f^{*}\|_{\mathcal{H}}$ , since projections are contractive. Thus, recalling (2), $f_{\omega^{*}}$ also belongs to the Hilbert unit ball: $f_{\omega^{*}}\in\operatorname{\mathbb{B}}_{\mathcal{H}}$ . It is in fact easy to see that $f_{\omega^{*}}$ has the least Hilbert norm among the members in the equivalence class (i.e., the smoothest version). Thus, without loss of generality, we can identify $f^{*}$ with $f_{\omega^{*}}$ . Equivalently, we can assume from the start that $f^{*}$ is of the form $f_{\omega^{*}}$ for some $\omega^{*}\in\mathbb{R}^{n}$ . Note that the “no loss of generality” statement holds as long as we are working with the empirical $\ell_{2}$ norm, due to (8).

3 Main results

Let $K=UDU^{T}$ be the eigenvalue decomposition (EVD) of the empirical kernel matrix defined in (4). Here, $U\in\mathbb{R}^{n\times n}$ is an orthogonal matrix and $D=\operatorname{diag}(\mu_{i})_{i=1}^{n}$ where $\mu_{1}\geq\mu_{2}\geq\dots\geq\mu_{n}\geq 0$ are the eigenvalues of $K$ . We assume for simplicity that $\mu_{n}>0$ , that is, the exact kernel matrix is invertible. Consider the rank $r$ approximation of $K$ , obtained by keeping the top $r$ eigenvalues and truncating the rest to zero, that is,

[TABLE]

Here, $D_{r}=\operatorname{diag}(\mu_{1},\dots,\mu_{r})$ and $U_{r}\in\mathbb{R}^{n\times r}$ collects the first $r$ columns of $U$ . The idea is to solve (4) with $K$ replaced with $\widetilde{K}$ , to obtain $\widetilde{\omega}$ . We then form our functional estimate $\widetilde{f}$ by using the (exact) kernel mapping (5).

Definition 1.

An $r$ -truncated $\lambda$ -regularized KRR estimator with input $y\in\mathbb{R}^{n}$ , is a function $\widetilde{f}:=f_{\widetilde{\omega}}=\frac{1}{\sqrt{n}}\sum_{j}\widetilde{\omega}_{j}\mathbb{K}(\cdot,x_{j})$ where

[TABLE]

A minimizer in (9), without the additional condition $\widetilde{K}\widetilde{\omega}=K\widetilde{\omega}$ , is not unique due to the rank deficiency of $\widetilde{K}$ . Thus, we can ask for it to satisfy additional constraints. The equality condition in (10), which can be stated as $\widetilde{\omega}\in\ker(\widetilde{K}-K)$ can always be satisfied. It is enough to choose $\widetilde{\omega}$ to be the unique minimizer in $\operatorname{ran}(\widetilde{K})=\operatorname{ran}(U_{r})$ , that is, $\widetilde{\omega}=U_{r}\alpha$ for some $\alpha\in\mathbb{R}^{r}$ . This is how the estimator is often implemented in practice.

We are interested in the deviation of $\widetilde{f}$ from the true function $f^{*}$ in the empirical $\ell_{2}$ norm. More precisely, we are interested in the mean-squared error as the statistical risk:

[TABLE]

Our main result is an expression for the worst-case risk of $\widetilde{f}$ over the unit ball of the RKHS:

Theorem 1.

Let $\widetilde{f}=\widetilde{f}_{r,\lambda}$ be an $r$ -truncated $\lambda$ -regularized KRR estimator (Definition 1) applied to input $y$ generated from model (1). Let

[TABLE]

where $h(\lambda;x)=\lambda^{2}x/(x+\lambda)^{2}$ . Then, for all $r=1,2,\dots,n$ and $\lambda>0$ ,

[TABLE]

with $\mu_{n+1}:=0$ .

The first term in (11) is the worst-case approximation error (WAE) and the second term the estimation error (EE). The approximation error (AE) is the risk (relative to $f^{*}$ ) of $\bar{f}$ which is obtained by passing the noiseless observations $(f^{*}(x_{i}))$ , instead of $y$ , through the estimation procedure. The AE is the deterministic part of the risk and is given by $\|\bar{f}-f^{*}\|_{n}^{2}$ . The estimation error is the stochastic part of the risk and is given by $\mathbb{E}\|\widetilde{f}-\bar{f}\|_{n}^{2}$ .

The function $x\mapsto h(\lambda;x)$ attains its maximum of $\lambda/4$ , over $[0,\infty)$ , at $x=\lambda$ . Thus, as long as $\lambda\in[\mu_{r},\mu_{1}]$ , the bound $H_{r}(\lambda)\leq\lambda/4$ is good. In general,

[TABLE]

We note that since the KRR estimates are linear in $y$ , Theorem 1 easily gives the maximum MSE expression over the Hilbert ball of arbitrary radius $R$ , by replacing $\sigma^{2}$ in (11) with $\sigma^{2}/R^{2}$ and multiplying the entire right-hand side by $R^{2}$ .

We also have a precise result on the regularized risk of the approximating function:

Proposition 1.

Let $\bar{f}=\bar{f}_{r,\lambda}$ be obtained by passing the noiseless observations $(f^{*}(x_{i}))$ , instead of $y$ , through the estimation procedure in Definition 1. Then,

[TABLE]

3.1 Maximum-risk inadmissibility

Let us now consider how the maximum risk of the truncated KKR compares with the full version. For every, $\lambda>0$ , define

[TABLE]

In addition, recalling that $\widetilde{f}_{n,\lambda}$ is the full KRR estimator, let

[TABLE]

That is, $\lambda_{n}$ is the regularization parameter that achieves the minimal maximum-risk for the full KRR. We have the following corollary of Theorem 1:

Corollary 1.

For every $\lambda>0$ , and every $r\in[n]$ with $r\geq r(\lambda)$ ,

[TABLE]

In particular, for every $r\geq r_{n}$ ,

[TABLE]

Both inequalities are strict whenever $\mu_{r+1}>0$ .

Corollary 1 shows that $\lambda$ -optimized $\widetilde{f}_{r_{n},\lambda}$ strictly improves on optimized full KRR whenever $\mu_{r_{n}+1}>0$ , in a sense rendering the full KRR inadmissible, as far as the maximum risk over $\operatorname{\mathbb{B}}_{\mathcal{H}}$ is concerned. Note that we are not claiming inadmissibility in the classical sense which requires one estimator to improve on another for all $f^{*}\in\operatorname{\mathbb{B}}_{\mathcal{H}}$ . In general, the slower the decay of $\{\mu_{i}\}$ , the more significant the improvement gained by truncation. Note that (14) allows one to set the precise truncation level including the exact constants if one has access to the eigenvalues of the kernel matrix. In practice, for large $n$ , the eigenvalues of the associated kernel integral operator (if available) can act as excellent surrogates for $\{\mu_{i}\}$ [23].

3.2 Do we need both regularizations?

Although the spectral truncation is used as a computational device, intuitively, it also has an implicit regularization effect. This is confirmed more rigorously by Corollary 1 where truncation is shown to lead to a smaller optimal worst-case MSE. The intuition is also supported by the link between the (full) KRR and Tikhonov regularization. In both cases, one forms $(K+\lambda I_{n})^{-1}$ which can be considered as a form of “spectral filtering”. Eigenvalue truncation followed by taking the pseudo-inverse can be considered as another form of such filtering. A common conception is that these two approaches are performing essentially the same task, hence one of them is enough to achieve the desired regularization effect. More specifically, one can ask the following: Is Hilbert norm regularization, or $\lambda$ -regularization, really needed in the presence of spectral truncation? Theorem 1 allows us to settle this question. For a given truncation level $r$ , let

[TABLE]

be the optimal threshold for the $r$ -truncated $\lambda$ -regularized KRR estimator.

Corollary 2.

For every $r<n$ , we have

[TABLE]

where $B_{r}:=\min_{j}\sum_{i:\,i>j}(\mu_{j}/\mu_{i})+\sum_{i:\,i<j}(\mu_{i}^{2}/\mu_{j}^{2})$ with $i$ and $j$ running in $\{1,\dots,r\}$ .

Corollary 2 shows that for any truncation level $r$ , the optimal choice of $\lambda$ is always positive, hence $\lambda$ -regularization further improves the performance. The effect is more pronounced when $\mu_{r}$ is close to $\mu_{r+1}$ or, in general, when the spectrum decays slowly (hence $\mu_{i}\approx\mu_{j}$ for most $i,j\in[r]$ ). The effect is also more significant for higher effective noise levels $\sigma^{2}/n$ .

3.3 Gaussian complexity and rates

Less precise bounds, albeit good enough to capture the correct asymptotic rate as $n\to\infty$ , can be obtained in terms of the Gaussian complexity of the unit ball of the RKHS. These types of results have been obtained for the Sketched-KRR. To make a comparison easier, let us show how such bounds can be obtained from Theorem 1.

Let us define the $r$ -truncated complexity (of the empirical Hilbert ball) as

[TABLE]

For the case $r=n$ , this matches the definition of the kernel complexity in [12], which we refer to for the related background. In particular, (18) is a tight upper bound on the Gaussian complexity of the intersection of $\operatorname{\mathbb{B}}_{\mathcal{H}}$ and $\{f:\;\|f\|_{n}\leq\delta\}$ [25, Chapter 13]. We have:

Corollary 3 (Looser bound).

Under the setup of Theorem 1, for $\lambda\geq\max\{\delta^{2},4\mu_{r+1}\}$ ,

[TABLE]

If $\lambda\geq\mu_{1}$ , one can replace the first term with $\mu_{1}\lambda^{2}/(\lambda+\mu_{1})^{2}$ for a better bound.

Choosing $\lambda=\delta^{2}\geq 4\mu_{r+1}$ , we obtain

[TABLE]

The latter upper bound is what one would get for the full KRR. Matching the two terms in that bound, we chooses $\delta_{n}$ such that $\delta_{n}^{2}=2R_{n}(\delta_{n})$ which gives the well-known critical radius for the KRR problem [25]. It is known that $\delta_{n}$ gives the optimal rate of convergence for estimating functions in $\operatorname{\mathbb{B}}_{\mathcal{H}}$ , i.e., its rate of decay matches that of the minimax risk [12]. The above argument shows that as long as $r$ is taken large enough so that $4\mu_{r+1}\leq\delta_{n}^{2}$ , the $r$ -truncated KRR achieves (at least) the same rate as the full KRR. For the sketching, the same conclusion is established in [12], where the smallest $r$ satisfying $\mu_{r}\leq\delta_{n}^{2}$ is referred to as the statistical dimension of the kernel.

For Sobolev- $\alpha$ kernels, with eigendecay $\mu_{i}\asymp i^{-2\alpha}$ , we obtain $\operatorname{MSE}\lesssim\delta_{n}^{2}\asymp(\sigma^{2}/n)^{-\frac{2\alpha}{2\alpha+1}}$ . Interestingly, in this case, the estimate based on the weaker bound (19) and the exact bound (11) give the same rate (cf. Appendix C). This is expected since the given rate is known to be minimax optimal for Sobolev spaces. The same goes for the Gaussian kernel for which $\mu_{j}\asymp e^{-cj\log j}$ and the rate is $\gamma\log(1/\gamma)$ for $\gamma=\sigma^{2}/n$ .

Order-wise, $\delta_{n}^{2}$ will be the same as $\lambda_{n}$ defined in (14), that is $\lambda_{n}\asymp\delta_{n}^{2}$ , whenever $\delta_{n}^{2}$ matches the optimal rate. Hence, often $\mu_{1}>\lambda_{n}\gg\mu_{n}$ for large $n$ and the argument leading to (12) suggests that in this case $H_{n}(\lambda_{n})\approx\lambda_{n}/4$ . Then, $r_{n}\approx\min\big{\{}r\in[n]:\;\mu_{r+1}\leq\frac{\lambda_{n}}{4}\big{\}}.$

For Sobolev- $\alpha$ kernels, this suggests truncation level $r_{n}\gtrsim(\sigma^{2}/n)^{\frac{1}{2\alpha+1}}$ which gives moderate savings for high smoothness levels $\alpha$ . Similarly, for the Gaussian kernel, it is not hard to see that truncating to $r_{n}\gtrsim\log(n/\sigma^{2})$ is enough to get the same rate as the full KRR, which is a substantial saving.

4 Simulations

We now present some numerical experiments to corroborate the theory. We consider a Gaussian kernel $K(s,t)=e^{-(u-v)^{2}/2b^{2}}$ of bandwidth $b=0.1$ on $[-1,1]$ , as well as the Sobolev-1 kernel $K(s,t)=\min(s,t)$ on $[0,1]$ . We take the covariates $\{x_{i}\}$ to be $n=200$ equi-spaced points in each interval. The top row of Fig. 1 shows the plot of the theoretical maximum MSE as given by Theorem 1 for the two kernels, for both the full KRR ( $r=n$ ), and the optimally truncated version ( $r=r_{n}$ ). We have used $\sigma=2$ in (11). As predicted by Theorem 1, the minimum achievable maximum MSE is smaller for the truncated KRR.

To compute the optimal truncation, we have evaluated the regularization curve of the full KRR first, obtained the minimizer $\lambda_{n}$ and then used (14) to compute the optimal truncation level $r_{n}$ . For the setup of the simulation, we get $r_{n}=10$ for the Gaussian and $r_{n}=3$ for the Sobolev-1. It is interesting to note that although in terms of rates, $r_{n}$ for the Gaussian should be asymptotically much smaller than that of Sobolev-1, in finite samples, the truncation level for the Gaussian could be bigger as can be seen here. This is due to the unspecified, potentially large, constants in the rates (that depend on the bandwidth $b$ as well). Also, notice how surprisingly small $r_{n}$ is relative to $n$ in both cases.

The bottom row of Fig. 1 shows the empirical MSE obtained for a typical random $f^{*}\in\operatorname{\mathbb{B}}_{\mathcal{H}}$ , by computing the KRR estimates for observation $y$ and comparing with $f^{*}$ . The random true function is generated as $f^{*}=f_{\omega^{*}}$ where $\omega^{*}\sim N(0,I_{n})$ and further normalized so that $(\omega^{*})^{T}K\omega^{*}=1$ . We have generated $n=200$ observations from (1) with $\sigma=2$ . The plots were obtained using 1000 replications. The truncation levels are those calculated based on the maximum MSE formula (11). The plots show that for a typical application, the truncated KRR also dominates the full KRR.

5 Proof of the main result

Here we give the proof of Theorem 1 and Corollaries 1 and 2. The remaining proofs can be found in Appendix B.

From the discussion in Section 2.1, both the KRR estimate and the true function belong to $\mathcal{L}_{X}$ given in (7). It is then useful to have an expression for the empirical $\ell_{2}$ error of functions belonging to this space. First, we observe that $\|f_{\omega}\|_{n}^{2}=\frac{1}{n}\sum_{i=1}^{n}[f_{\omega}(x_{i})]^{2}=\|K\omega\|^{2}$ . Now, take any $\omega,\omega^{*}\in\mathbb{R}^{d}$ , and let $u=K\omega$ and $u^{*}=K\omega^{*}$ . Then, we have

[TABLE]

where the fist equality is by the linearity of $\omega\mapsto f_{\omega}$ . For any function $f_{\omega}\in\mathcal{L}_{X}$ , we call $u=K\omega$ the $u$ -space representation of $f_{\omega}$ . Identity (20) shows that it is often easier to work in the $u$ -space since the $u$ -transform turns empirical $\ell_{2}$ norms on functions into the usual $\ell_{2}$ norms on vectors. In other words, the map $f_{\omega}\mapsto u$ , is a Hilbert space isometry from $(\mathcal{L}_{X},\|\cdot\|_{n})$ to $(\mathbb{R}^{n},\|\cdot\|)$ . In the $u$ -space, the KRR optimization problem can be equivalently stated as:

[TABLE]

where $K^{+}$ is the pseudo inverse of $K$ , and $\operatorname{ran}(K)$ its range. More precisely:

Lemma 1.

For any $K\in\mathbb{R}^{n\times n}$ , problems (4) and (21) are equivalent in the following sense:

For any minimizer $\bar{\omega}$ of (4), $K\bar{\omega}$ is a minimizer of (21), and 2. -

for any minimizer $\bar{u}$ of (21), any $\bar{\omega}\in\{\omega:\;K\omega=\bar{u}\}$ is a minimizer of (4).

It is often the case that the kernel matrix itself is invertible, in which case $K^{+}=K^{-1}$ , $\operatorname{ran}(K)=\mathbb{R}^{n}$ and problem (21) simplifies. However, the equivalence in Lemma 1 holds even if we replace $K$ with an approximation which is rank deficient. This observation will be useful in the sequel.

Theorem 1.

Take $\widetilde{\omega}$ to be as in Definition 1 and let $\widetilde{y}=y/\sqrt{n}$ . Since $\widetilde{\omega}$ is the minimizer of $F(\omega;y)=\|\widetilde{y}-\widetilde{K}\omega\|^{2}+\lambda\omega^{T}\widetilde{K}\omega,$ we have $\nabla F(\widetilde{\omega};y)=0$ or $\widetilde{K}(\widetilde{K}\widetilde{\omega}-\widetilde{y})+\lambda\widetilde{K}\widetilde{\omega}=0$ . Hence, $(\widetilde{K}+\lambda I)\widetilde{K}\widetilde{\omega}=\widetilde{K}\widetilde{y}$ or

[TABLE]

Let $w=(w_{i})\in\mathbb{R}^{n}$ be the noise vector in (1) and $\widetilde{w}=w/\sqrt{n}$ . We also let

[TABLE]

Then, we can write model (1) as $\widetilde{y}=u^{*}+\widetilde{w}$ , where $\widetilde{w}$ is zero mean with $\operatorname{cov}(\widetilde{w})=\sigma^{2}I_{n}/n$ . From (20), we have $\|\widetilde{f}-f^{*}\|_{n}^{2}=\|K(\widetilde{\omega}-\omega^{*})\|^{2}$ , and

[TABLE]

where the first equality uses assumption (10). It follows that

[TABLE]

where the first term is the approximation error (AE) and the second term, the estimation error (EE). Let us write $\widetilde{D}=\operatorname{diag}(\mu_{1},\dots,\mu_{r},0,\dots,0)\in\mathbb{R}^{n\times n}$ so that $\widetilde{K}=U\widetilde{D}U^{T}$ . We define

[TABLE]

and note that $\Gamma_{\lambda}$ is diagonal. Let $v^{*}=U^{T}u^{*}$ and $\widehat{w}=U^{T}\widetilde{w}$ . Then, since $\ell_{2}$ norm is unitarily invariant, we have

[TABLE]

Controlling the estimation error: We have

[TABLE]

using $\operatorname{cov}(\widehat{w})=U^{T}\operatorname{cov}(\widetilde{w})U=(\sigma^{2}/n)U^{T}U=\sigma^{2}I_{n}/n$ since $U$ is an orthogonal matrix. Then,

[TABLE]

establishing the EE part of the result.

Controlling the approximation error: Recall that we are interested in the worst-case approximation error (WAE) over the unit ball of the Hilbert space, i.e., over $f^{*}\in\operatorname{\mathbb{B}}_{\mathcal{H}}$ . Also, recall that without loss of generality, we can take $f^{*}=f_{\omega^{*}}$ . Hence,

[TABLE]

where the second equality is from (6), and the latter two are by definitions of $u^{*}$ and $v^{*}=U^{T}u^{*}$ . We obtain

[TABLE]

A further change of variable $v^{*}=D^{1/2}v$ gives

[TABLE]

where $\|\cdot\|_{\text{}}$ , applied to matrices, is the $\ell_{2}\to\ell_{2}$ operator norm. Note that $\Gamma_{\lambda}$ is a diagonal matrix with diagonal elements, $\mu_{i}/(\mu_{i}+\lambda)$ for $i=1,\dots,r$ followed by $n-r$ zeros. It follows that $(I-\Gamma_{\lambda})D^{1/2}$ is diagonal with diagonal elements:

[TABLE]

Since $\{\mu_{i}\}$ is a non-increasing sequence, we obtain

[TABLE]

which is the desired result. ∎

Corollary 1.

Let $\text{EE}_{r,\lambda}:=\frac{\sigma^{2}}{n}\sum_{i=1}^{r}[\mu_{i}/(\mu_{i}+\lambda)]^{2}$ be the estimation error of $\widetilde{f}_{r,\lambda}$ as in (11). Note that as long as $\mu_{r+1}>0$ , we have $\text{EE}_{r,\lambda}<\text{EE}_{r+1,\lambda}\leq\text{EE}_{n,\lambda}$ . It remains to show that the WAE of the truncated KRR is less than that of full KRR. We have for $r\geq r(\lambda)$ ,

[TABLE]

This proves (15). For the second assertion, it is enough to apply (15) with $\lambda=\lambda_{n}$ , noting that in this case, the RHS will be the minimax risk of the full KRR and the LHS is further lower bounded by the minimax risk of the truncated KRR. ∎

of Corollary 2.

Let us write $\text{WAE}_{r}(\lambda)$ and $E_{r}(\lambda)$ for the worst-case approximation and estimation errors, respectively, as a function of $\lambda$ . Let $M_{r}(\lambda)$ be the worst-case MSE, so that $M_{r}(\lambda)=\text{WAE}_{r}(\lambda)+E_{r}(\lambda)$ . The $\text{WAE}_{r}(\cdot)$ starts off with the constant branch $\text{WAE}_{r}(\lambda)=\mu_{r+1}$ for small values of $\lambda$ . Let $h_{i}(\lambda):=h(\lambda;\mu_{i})$ . The constant branch starts at $\lambda=0$ and extends to $\lambda=\lambda^{(1)}$ where $h_{r}(\lambda^{(1)})=\mu_{r+1}$ . Some algebra gives $\lambda^{(1)}=\mu_{r}/(\sqrt{\mu_{r}/\mu_{r+1}}-1)$ . For $\lambda\in[0,\lambda^{(1)}]$ , we have $M_{r}^{\prime}(\lambda)=E_{r}^{\prime}(\lambda)<0$ showing that the minimizer of $M_{r}$ is $\geq\lambda^{(1)}$ .

The next branch of WAE starts at $\lambda^{(1)}$ and ends at $\lambda^{(2)}$ which solves $h_{r}(\lambda^{(2)})=h_{r-1}(\lambda^{(2)})$ . The knots $\lambda^{(i)}$ determining subsequent branches are determined similarly: $h_{r-i+2}(\lambda^{(i)})=h_{r-i+1}(\lambda^{(i)})$ for $i=2,3,\dots,r$ and $\lambda^{(r+1)}=\infty$ . We have $\text{WAE}_{r}(\lambda)=h_{r-i+1}(\lambda)$ for $\lambda\in I_{i}:=[\lambda^{(i)},\lambda^{(i+1)})$ for $i=1,\dots,r$ . See Fig. 2.

Fix $i_{*}\in[r]$ and let $j_{*}=r-i_{*}+1$ . Then for $\lambda\in\text{int}(I_{i^{*}})$

[TABLE]

where $i$ ranges over $[r]\setminus\{j^{*}\}$ . Note that

[TABLE]

The first inequality is since $\lambda\mapsto(\mu_{j^{*}}+\lambda)/(\mu_{i}+\lambda)$ is increasing in $[0,\infty)$ if $\mu_{j^{*}}<\mu_{i}$ , hence lower-bounded by its value at $\lambda=0$ , and is decreasing on $[0,\infty)$ if $\mu_{j^{*}}>\mu_{i}$ , hence lower-bounded by its value as $\lambda\to\infty$ . Then,

[TABLE]

It follows that $M^{\prime}_{r}(\lambda)<0$ as long as $\lambda<\sigma^{2}(1+B_{r})/n$ no matter which interval $I_{i}$ contains $\lambda$ . This shows that the minimizer of $M_{r}$ has to be $\geq\sigma^{2}(1+B_{r})/n$ completing the proof. ∎

Acknowledgement

We thank Chad Hazlett and Linfan Zhang for helpful discussions and Zahra Razaee for comments on the manuscript.

Appendix A Time complexity comparison

The ST-KRR and approximate versions, such as Nyström and sketching, all have time complexity of $O(nr^{2}+r^{3})=O(nr^{2})$ for computing the $r$ -truncated KRR estimate, once the pieces required for approximating the kernel matrix (e.g., $KS$ and $S^{T}KS$ in the case of sketching, $U_{r}$ and $D_{r}$ in the case of ST-KRR and so on) are computed. Computing these pieces is where these methods differ. For sketching, this step could have complexity as large as $O(n^{2}r^{2})$ for dense sketches, $O(n^{2}\log r)$ for randomized Fourier and Hadamard sketches, to as low as $O(nr)$ for the Nyström.

For the ST-KRR, this step involves computing the top- $r$ eigenpairs of the symmetric matrix $K$ , for which the Lanczos algorithm is the standard and for which a complexity analysis is hard to find in the literature. However, results of [26] suggest that it has average-case complexity $O(n^{2}(r+\log n))$ . More precisely, [26] show that on average $k=O(\log n/\sqrt{\varepsilon})$ Lanczos iterations are enough to compute the top eigenvalue to within relative error $\varepsilon$ , hence an overall average-case complexity $O(kN)$ where $N$ is the number of nonzero entries of matrix $K$ .

Appendix B Remaining proofs

Proposition 1.

We will use the same notation as in the proof of Theorem 1. By the same argument as in that proof, we have $\|\bar{f}-f^{*}\|_{n}^{2}=\|(I-\Gamma_{\lambda})v^{*}\|^{2}$ where $\Gamma_{\lambda}$ is defined in (24) and $v^{*}=U^{T}u^{*}$ for $u^{*}$ given in (23). Let $\bar{\omega}$ be the solution of (9) for the input $(f^{*}(x_{i}))$ (instead of $y$ ) so that $\bar{f}=f_{\bar{\omega}}$ . Using the optimality condition in the proof of Theorem 1,

[TABLE]

where we have used (10) and (22), with $\widetilde{y}=u^{*}$ (i.e., $\widetilde{w}=0$ ). We can write

[TABLE]

using $\Psi_{\lambda}=U\Gamma_{\lambda}U^{T}$ , $K=UDU^{T}$ and $v^{*}=U^{T}u^{*}$ . Recall from (26) that $f^{*}\in\operatorname{\mathbb{B}}_{\mathcal{H}}$ is equivalent to $(v^{*})^{T}D^{-1}v^{*}\leq 1$ . It follows that

[TABLE]

where the third equality is using the change of variable $v^{*}=D^{1/2}v$ as in the proof of Theorem 1, and the last line follows since all the matrices are diagonal and hence commute. The result now follows by combining (25) and (27), after some algebra. ∎

Corollary 3.

For any $a,b>0$ , we have $\frac{1}{2}(a\wedge b)\;\leq\;(a^{-1}+b^{-1})^{-1}\;\leq\;a\wedge b$ , where $a\wedge b:=\min\{a,b\}$ . Hence, the estimation error in (11) is bounded as

[TABLE]

This upper bound is within a factor of $4$ of the estimation error. Using $\mu_{i}/(\lambda+\mu_{i})\leq 1$ to shave off the power by one, we obtain the weaker bound:

[TABLE]

Recalling definition (18), we conclude that if $\lambda\geq\delta^{2}$ ,

[TABLE]

Combining with the WAE bound (12), we obtain the desired result. ∎

Lemma 1.

Let $F(\omega)$ and $G(u)$ be the objective functions in (4) and (21), respectively. We have $F(\omega)=G(K\omega)$ for any $\omega\in\mathbb{R}^{n}$ , which follows from the identity $KK^{+}K=K$ . Now, assume that $\bar{\omega}$ is a minimizer of $F$ , and let $\bar{u}:=K\bar{\omega}$ . Pick any $u\in\operatorname{ran}(K)$ ; there exists $\omega$ such that $u=K\omega$ , and we have $G(\bar{u})=F(\bar{\omega})\leq F(\omega)=G(u)$ . The other direction follows similarly. ∎

Appendix C Rate calculations

Here we compute the error rate predicted by the strong and weak bounds and show that they are the same. Let $\gamma:=\sigma^{2}/n$ . Assume the polynomial eigendecay of the Sobolev- $\alpha$ kernel, i.e., $\mu_{i}\asymp i^{-2\alpha}$ . Taking $k$ to be the smallest integer satisfying $k^{-2\alpha}\lesssim\delta^{2}$ , we have

[TABLE]

where the first inequality uses an integral approximation to the sum and the second uses the definition of $k$ . Setting $\delta^{2}\asymp R_{n}(\delta)$ we have $\delta^{2}\asymp\gamma k\asymp\gamma(\delta^{2})^{-\frac{1}{2\alpha}}$ , hence the critical radius $\delta_{n}^{2}\asymp\gamma^{\frac{2\alpha}{\alpha+1}}$ .

Now consider the strong bound. As discussed in the text, $\text{WAE}_{n,\lambda}\asymp\lambda$ . Also, as the proof of Corollary 3 shows, we have

[TABLE]

Letting $k$ be defined as the smallest integer such that $\mu_{k}\lesssim\lambda$ , we get $k^{-2\alpha}\lesssim\lambda$ as before. Then, the maximum MSE is bounded as

[TABLE]

Since $k^{-4\alpha+1}\lesssim k\lambda^{2}$ , by the definition of $k$ , we obtain $\operatorname{MSE}\lesssim\lambda+\gamma k\lesssim\lambda+\gamma\lambda^{-\frac{1}{2\alpha}}$ . Equating the two terms we obtain $\operatorname{MSE}\asymp\lambda_{n}\asymp\gamma^{\frac{2\alpha}{2\alpha+1}}$ as before.

For the Gaussian kernel, with $\mu_{j}\asymp e^{-cj\log j}$ , it is not hard to verify that with $e^{-ck}\lesssim\lambda$ , we get $\operatorname{MSE}\lesssim\lambda+\gamma k\lesssim\lambda+\gamma\log(1/\lambda)$ . Minimizing the bound over $\lambda$ , we obtain $\lambda\asymp\gamma$ , hence $\operatorname{MSE}\lesssim\gamma\log(1/\gamma)$ .

Bibliography26

The reference list from the paper itself. Each links out to its DOI / PubMed record.

1[1] Larry Wasserman “All of nonparametric statistics” Springer Science & Business Media, 2006
2[2] Alexandre B Tsybakov “Introduction to Nonparametric Estimation” Springer, New York, NY, 2009
3[3] Vern I Paulsen and Mrinal Raghupathi “An introduction to the theory of reproducing kernel Hilbert spaces” Cambridge University Press, 2016
4[4] Grace Wahba “Spline models for observational data” Siam, 1990
5[5] George Kimeldorf and Grace Wahba “Some results on Tchebycheffian spline functions” In Journal of mathematical analysis and applications 33.1 Elsevier, 1971, pp. 82–95
6[6] Christopher KI Williams and Matthias Seeger “Using the Nyström method to speed up kernel machines” In Advances in neural information processing systems , 2001, pp. 682–688
7[7] Kai Zhang, Ivor W Tsang and James T Kwok “Improved Nyström low-rank approximation and error analysis” In Proceedings of the 25th international conference on Machine learning , 2008, pp. 1232–1239 ACM
8[8] Sanjiv Kumar, Mehryar Mohri and Ameet Talwalkar “Ensemble nystrom method” In Advances in Neural Information Processing Systems , 2009, pp. 1060–1068

TL;DR

Contribution

Findings

Abstract

Peer Reviews

Videos

Spectrally-truncated kernel ridge regression and its free lunch

Abstract

1 Introduction

2 Preliminaries

2.1 Identifiability

3 Main results

Definition 1**.**

Theorem 1**.**

Proposition 1**.**

3.1 Maximum-risk inadmissibility

Corollary 1**.**

3.2 Do we need both regularizations?

Corollary 2**.**

3.3 Gaussian complexity and rates

Corollary 3** (Looser bound).**

4 Simulations

5 Proof of the main result

Lemma 1**.**

Theorem 1.

Corollary 1.

of Corollary 2.

Acknowledgement

Appendix A Time complexity comparison

Appendix B Remaining proofs

Proposition 1.

Corollary 3.

Lemma 1.

Appendix C Rate calculations

Definition 1.

Theorem 1.

Proposition 1.

Corollary 1.

Corollary 2.

Corollary 3 (Looser bound).

Lemma 1.