Spectrally-truncated kernel ridge regression and its free lunch
Arash A. Amini

TL;DR
This paper analyzes spectrally-truncated kernel ridge regression, revealing that truncation can outperform full KRR in minimax risk for infinite-dimensional RKHS, and explores the trade-offs between spectral truncation and regularization.
Contribution
It provides an exact risk expression for truncated KRR and demonstrates that spectral truncation can improve performance beyond full KRR in certain regimes.
Findings
Spectral truncation can outperform full KRR in minimax risk.
There exists a threshold on the number of eigenvalues retained for improved performance.
Implicit regularization from truncation complements Hilbert norm regularization.
Abstract
Kernel ridge regression (KRR) is a well-known and popular nonparametric regression approach with many desirable properties, including minimax rate-optimality in estimating functions that belong to common reproducing kernel Hilbert spaces (RKHS). The approach, however, is computationally intensive for large data sets, due to the need to operate on a dense kernel matrix, where is the sample size. Recently, various approximation schemes for solving KRR have been considered, and some analyzed. Some approaches such as Nystr\"{o}m approximation and sketching have been shown to preserve the rate optimality of KRR. In this paper, we consider the simplest approximation, namely, spectrally truncating the kernel matrix to its largest eigenvalues. We derive an exact expression for the maximum risk of this truncated KRR, over the unit ball of the RKHS. This result can be…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Spectrally-truncated kernel ridge regression and its free lunch
Arash A. Amini
Department of Statistics
University of California
Los Angeles
Abstract
Kernel ridge regression (KRR) is a well-known and popular nonparametric regression approach with many desirable properties, including minimax rate-optimality in estimating functions that belong to common reproducing kernel Hilbert spaces (RKHS). The approach, however, is computationally intensive for large data sets, due to the need to operate on a dense kernel matrix, where is the sample size. Recently, various approximation schemes for solving KRR have been considered, and some analyzed. Some approaches such as Nyström approximation and sketching have been shown to preserve the rate optimality of KRR. In this paper, we consider the simplest approximation, namely, spectrally truncating the kernel matrix to its largest eigenvalues. We derive an exact expression for the maximum risk of this truncated KRR, over the unit ball of the RKHS. This result can be used to study the exact trade-off between the level of spectral truncation and the regularization parameter. We show that, as long as the RKHS is infinite-dimensional, there is a threshold on , above which, the spectrally-truncated KRR surprisingly outperforms the full KRR in terms of the minimax risk, where the minimum is taken over the regularization parameter. This strengthens the existing results on approximation schemes, by showing that not only one does not lose in terms of the rates, truncation can in fact improve the performance, for all finite samples (above the threshold). Moreover, we show that the implicit regularization achieved by spectral truncation is not a substitute for Hilbert norm regularization. Both are needed to achieve the best performance.
Keywords: kernel methods; ridge regression; spectral truncation; nonparametric regression; minimax estimation.
1 Introduction
The general nonparametric regression problem can be stated as
[TABLE]
where is a noise vector and is the function of interest to be approximated from the noisy observations . Here, is the space to which the covariates belong. We consider the fixed design regression where the covariates are assumed to be deterministic. The problem has a long history in statistics and machine learning [1, 2]. In this paper, we assume that belongs to a reproducing kernel Hilbert space (RKHS), denoted as [3]. Such spaces are characterized by the existence of a reproducing kernel, that is, a positive semidefinite function that uniquely determines the underlying function space . RKHSs are very versatile modeling tools and include, for example, Sobolev spaces of smooth functions whose norms are measures of function roughness [4]. Throughout, we think of these Sobolev spaces as the concrete examples of . By assuming an upper bound on the Hilbert norm of , we can encode a prior belief that the true data generating function has a certain degree of smoothness. Without loss of generality, we assume that belongs to the unit ball of the RKHS, that is,
[TABLE]
A natural estimator is then, the kernel ridge regression (KRR), defined as the solution of the following optimization problem:
[TABLE]
where is a regularization parameter. It is well-known that this problem can be reduced to a finite-dimensional problem, by an application of the so-called representer theorem [5]:
[TABLE]
is the (normalized empirical) kernel matrix. Although (4) has a closed form solution, it involves inverting an dense matrix, with time complexity , which is prohibitive in practice.
Various approximation schemes have been proposed to mitigate the computational costs, including (i) approximating the kernel matrix or (ii) directly approximating the optimization problem (4). Examples of the former are the Nyström approximation, column sampling and their variants [6, 7, 8, 9, 10]. An example of the latter is sketching [11, 12] where one restricts to the subspace , for some random matrix . It is in fact known that Nyström can be considered a special case of sketching with random standard basis vectors [12]. Sketching, with sufficiently large , has been shown in [12] to achieve minimax optimal rates over Sobolev spaces, under mild conditions on the sketching matrix . Similarly, the Nyström approximation has been analyzed in [13, 14, 15, 11, 16] and [17], the latter showing minimax rate optimality. In addition to the above, (iii) divide and conquer approaches have been proposed [18], where one solves the problem over subsamples and then aggregates by averaging, with some rate optimality guarantees. Other notable approaches to scaling include (iv) approximating translation-invariant kernel functions via Monte Carlo averages of tensor products of randomized feature maps [19, 20] and (v) applying stochastic gradient in the function space [21]. Memory efficiency in kernel approximation is considered in [22].
In this paper, we consider the most direct kernel approximation, namely, replacing by its best rank approximation (in Frobenius norm). This amounts to truncating the eigenvalue decomposition of to its top eigenvalues. We refer to the resulting KRR approximation as the spectrally-truncated KRR (ST-KRR). Although somewhat slower than the Nyström approximation and fast forms of sketching, ST-KRR can be considered an ideal rank- spectral approximation. By analyzing it, one can also gain insights about approximate SVD truncation approaches such as Nyström or sketching. Practically, ST-KRR is a very viable solution for moderate-size problems. See Appendix A for a discussion of the time complexity of various schemes.
We derive an exact expression for the maximum (empirical) mean-squared error (MSE) of ST-KRR, uniformly over the unit ball of the RKHS. This expression is solely in terms of the eigenvalues of the kernel matrix , the regularization parameter , the truncation level , and the noise level . Thus if one has access to and the noise level (or estimates of them), one can plot the exact regularization curve (maximum MSE versus ) for a given truncation level and sample size , and determine the optimal value of . We also note that since the empirical eigenvalues quickly approach those of the integral operator associated with , as [23], one can use these idealized eigenvalues instead of to get an excellent approximation of these regularization curves.
We then show that there is an optimal threshold on , the truncation level, which we denote as , such that for all , the minimax risk of the -truncated KRR, with the minimum taken over the regularization parameter, is strictly smaller than that of the full KRR whenever . For infinite-dimensional RKHSs, we always have , hence truncating at level is guaranteed to strictly improve performance. The slower the decay of the eigenvalues, the larger this gap in performance.
This result shows that although the spectral truncation is mainly used as a computational device, it also has a statistical regularization effect. The next question is whether the regularization provided by the spectral truncation renders Hilbert norm regularization (via ) unnecessary. We answer this question in the negative by showing that for any truncation level , the optimal maximum risk is achieved for a positive . Together, these results show that the “-truncated -regularized KRR” defines a new class of estimators whose performance cannot be achieved (in finite sample) with either regularization alone.
We also show how the exact expression for the maximum MSE can be used to easily establish a slightly weaker bound for ST-KRR, similar to those derived in [12] for sketching. We discuss the link between the statistical dimension considered in [12] and the optimal truncation level , and show how the same rate-optimality guarantees hold for ST-KRR. Rate-optimality also follows form the fact that ST-KRR, with proper , strictly dominates full KRR and the latter is rate-optional. However, we do these calculations to make the comparison easier.
Finally, we illustrate the results with some numerical simulations showing some further surprises. For example, the Gaussian kernel has a much faster eigendecay rate than a Sobolev-1 kernel (exponential versus polynomial decay). Hence, the optimal truncation level asymptotically grows much slower for the Gaussian kernel. However, for finite samples, depending on the choice of the Gaussian bandwidth, the exact optimal truncation level, computed numerically, can be larger than that of Sobolev-1.
2 Preliminaries
Let us start with some observations regarding the original KRR problem in (3). For , consider the kernel mapping
[TABLE]
Note that is a linear map from . This map is the link between the solutions of the two optimization problems (3) and (4): For any optimal solution of (4), will be an optimal solution of (3). The link is easy to establish by observing the following two identities:
[TABLE]
the first of which uses the reproducing property of the kernel: . We will frequently use this property in the sequel. The proof of the equivalence follows from an argument similar to our discussion of the identifiability below.
2.1 Identifiability
Let us first observe that in (1) is not (statistically) identifiable. That is, there are multiple functions (in fact, infinitely many if is infinite-dimensional) for which the vector has the exact same distribution. To see this, let
[TABLE]
and let be the projection of onto . (It is always possible to choose at least one such by the definition of projection and since is a closed subspace of .) Given observations , we can only hope to recover the following equivalence class:
[TABLE]
where the last line follows since by the property of orthogonal projection (and can be absorbed into ).
We will use as the representative of the (identifiable) equivalence class of . We are interested in measuring functional deviations (e.g., the error in our estimate relative to the true function) in the empirical norm:
[TABLE]
The use of this norm is common in the literature of nonparametric regression [24, 25]. It is interesting to note that ,
[TABLE]
and , since projections are contractive. Thus, recalling (2), also belongs to the Hilbert unit ball: . It is in fact easy to see that has the least Hilbert norm among the members in the equivalence class (i.e., the smoothest version). Thus, without loss of generality, we can identify with . Equivalently, we can assume from the start that is of the form for some . Note that the “no loss of generality” statement holds as long as we are working with the empirical norm, due to (8).
3 Main results
Let be the eigenvalue decomposition (EVD) of the empirical kernel matrix defined in (4). Here, is an orthogonal matrix and where are the eigenvalues of . We assume for simplicity that , that is, the exact kernel matrix is invertible. Consider the rank approximation of , obtained by keeping the top eigenvalues and truncating the rest to zero, that is,
[TABLE]
Here, and collects the first columns of . The idea is to solve (4) with replaced with , to obtain . We then form our functional estimate by using the (exact) kernel mapping (5).
Definition 1**.**
An -truncated -regularized KRR estimator with input , is a function where
[TABLE]
A minimizer in (9), without the additional condition , is not unique due to the rank deficiency of . Thus, we can ask for it to satisfy additional constraints. The equality condition in (10), which can be stated as can always be satisfied. It is enough to choose to be the unique minimizer in , that is, for some . This is how the estimator is often implemented in practice.
We are interested in the deviation of from the true function in the empirical norm. More precisely, we are interested in the mean-squared error as the statistical risk:
[TABLE]
Our main result is an expression for the worst-case risk of over the unit ball of the RKHS:
Theorem 1**.**
Let be an -truncated -regularized KRR estimator (Definition 1) applied to input generated from model (1). Let
[TABLE]
where . Then, for all and ,
[TABLE]
with .
The first term in (11) is the worst-case approximation error (WAE) and the second term the estimation error (EE). The approximation error (AE) is the risk (relative to ) of which is obtained by passing the noiseless observations , instead of , through the estimation procedure. The AE is the deterministic part of the risk and is given by . The estimation error is the stochastic part of the risk and is given by .
The function attains its maximum of , over , at . Thus, as long as , the bound is good. In general,
[TABLE]
We note that since the KRR estimates are linear in , Theorem 1 easily gives the maximum MSE expression over the Hilbert ball of arbitrary radius , by replacing in (11) with and multiplying the entire right-hand side by .
We also have a precise result on the regularized risk of the approximating function:
Proposition 1**.**
Let be obtained by passing the noiseless observations , instead of , through the estimation procedure in Definition 1. Then,
[TABLE]
3.1 Maximum-risk inadmissibility
Let us now consider how the maximum risk of the truncated KKR compares with the full version. For every, , define
[TABLE]
In addition, recalling that is the full KRR estimator, let
[TABLE]
That is, is the regularization parameter that achieves the minimal maximum-risk for the full KRR. We have the following corollary of Theorem 1:
Corollary 1**.**
For every , and every with ,
[TABLE]
In particular, for every ,
[TABLE]
Both inequalities are strict whenever .
Corollary 1 shows that -optimized strictly improves on optimized full KRR whenever , in a sense rendering the full KRR inadmissible, as far as the maximum risk over is concerned. Note that we are not claiming inadmissibility in the classical sense which requires one estimator to improve on another for all . In general, the slower the decay of , the more significant the improvement gained by truncation. Note that (14) allows one to set the precise truncation level including the exact constants if one has access to the eigenvalues of the kernel matrix. In practice, for large , the eigenvalues of the associated kernel integral operator (if available) can act as excellent surrogates for [23].
3.2 Do we need both regularizations?
Although the spectral truncation is used as a computational device, intuitively, it also has an implicit regularization effect. This is confirmed more rigorously by Corollary 1 where truncation is shown to lead to a smaller optimal worst-case MSE. The intuition is also supported by the link between the (full) KRR and Tikhonov regularization. In both cases, one forms which can be considered as a form of “spectral filtering”. Eigenvalue truncation followed by taking the pseudo-inverse can be considered as another form of such filtering. A common conception is that these two approaches are performing essentially the same task, hence one of them is enough to achieve the desired regularization effect. More specifically, one can ask the following: Is Hilbert norm regularization, or -regularization, really needed in the presence of spectral truncation? Theorem 1 allows us to settle this question. For a given truncation level , let
[TABLE]
be the optimal threshold for the -truncated -regularized KRR estimator.
Corollary 2**.**
For every , we have
[TABLE]
where with and running in .
Corollary 2 shows that for any truncation level , the optimal choice of is always positive, hence -regularization further improves the performance. The effect is more pronounced when is close to or, in general, when the spectrum decays slowly (hence for most ). The effect is also more significant for higher effective noise levels .
3.3 Gaussian complexity and rates
Less precise bounds, albeit good enough to capture the correct asymptotic rate as , can be obtained in terms of the Gaussian complexity of the unit ball of the RKHS. These types of results have been obtained for the Sketched-KRR. To make a comparison easier, let us show how such bounds can be obtained from Theorem 1.
Let us define the -truncated complexity (of the empirical Hilbert ball) as
[TABLE]
For the case , this matches the definition of the kernel complexity in [12], which we refer to for the related background. In particular, (18) is a tight upper bound on the Gaussian complexity of the intersection of and [25, Chapter 13]. We have:
Corollary 3** (Looser bound).**
Under the setup of Theorem 1, for ,
[TABLE]
If , one can replace the first term with for a better bound.
Choosing , we obtain
[TABLE]
The latter upper bound is what one would get for the full KRR. Matching the two terms in that bound, we chooses such that which gives the well-known critical radius for the KRR problem [25]. It is known that gives the optimal rate of convergence for estimating functions in , i.e., its rate of decay matches that of the minimax risk [12]. The above argument shows that as long as is taken large enough so that , the -truncated KRR achieves (at least) the same rate as the full KRR. For the sketching, the same conclusion is established in [12], where the smallest satisfying is referred to as the statistical dimension of the kernel.
For Sobolev- kernels, with eigendecay , we obtain . Interestingly, in this case, the estimate based on the weaker bound (19) and the exact bound (11) give the same rate (cf. Appendix C). This is expected since the given rate is known to be minimax optimal for Sobolev spaces. The same goes for the Gaussian kernel for which and the rate is for .
Order-wise, will be the same as defined in (14), that is , whenever matches the optimal rate. Hence, often for large and the argument leading to (12) suggests that in this case . Then, r_{n}\approx\min\big{\{}r\in[n]:\;\mu_{r+1}\leq\frac{\lambda_{n}}{4}\big{\}}.
For Sobolev- kernels, this suggests truncation level which gives moderate savings for high smoothness levels . Similarly, for the Gaussian kernel, it is not hard to see that truncating to is enough to get the same rate as the full KRR, which is a substantial saving.
4 Simulations
We now present some numerical experiments to corroborate the theory. We consider a Gaussian kernel of bandwidth on , as well as the Sobolev-1 kernel on . We take the covariates to be equi-spaced points in each interval. The top row of Fig. 1 shows the plot of the theoretical maximum MSE as given by Theorem 1 for the two kernels, for both the full KRR (), and the optimally truncated version (). We have used in (11). As predicted by Theorem 1, the minimum achievable maximum MSE is smaller for the truncated KRR.
To compute the optimal truncation, we have evaluated the regularization curve of the full KRR first, obtained the minimizer and then used (14) to compute the optimal truncation level . For the setup of the simulation, we get for the Gaussian and for the Sobolev-1. It is interesting to note that although in terms of rates, for the Gaussian should be asymptotically much smaller than that of Sobolev-1, in finite samples, the truncation level for the Gaussian could be bigger as can be seen here. This is due to the unspecified, potentially large, constants in the rates (that depend on the bandwidth as well). Also, notice how surprisingly small is relative to in both cases.
The bottom row of Fig. 1 shows the empirical MSE obtained for a typical random , by computing the KRR estimates for observation and comparing with . The random true function is generated as where and further normalized so that . We have generated observations from (1) with . The plots were obtained using 1000 replications. The truncation levels are those calculated based on the maximum MSE formula (11). The plots show that for a typical application, the truncated KRR also dominates the full KRR.
5 Proof of the main result
Here we give the proof of Theorem 1 and Corollaries 1 and 2. The remaining proofs can be found in Appendix B.
From the discussion in Section 2.1, both the KRR estimate and the true function belong to given in (7). It is then useful to have an expression for the empirical error of functions belonging to this space. First, we observe that . Now, take any , and let and . Then, we have
[TABLE]
where the fist equality is by the linearity of . For any function , we call the -space representation of . Identity (20) shows that it is often easier to work in the -space since the -transform turns empirical norms on functions into the usual norms on vectors. In other words, the map , is a Hilbert space isometry from to . In the -space, the KRR optimization problem can be equivalently stated as:
[TABLE]
where is the pseudo inverse of , and its range. More precisely:
Lemma 1**.**
For any , problems (4) and (21) are equivalent in the following sense:
For any minimizer of (4), is a minimizer of (21), and 2. -
for any minimizer of (21), any is a minimizer of (4).
It is often the case that the kernel matrix itself is invertible, in which case , and problem (21) simplifies. However, the equivalence in Lemma 1 holds even if we replace with an approximation which is rank deficient. This observation will be useful in the sequel.
Theorem 1.
Take to be as in Definition 1 and let . Since is the minimizer of we have or . Hence, or
[TABLE]
Let be the noise vector in (1) and . We also let
[TABLE]
Then, we can write model (1) as , where is zero mean with . From (20), we have , and
[TABLE]
where the first equality uses assumption (10). It follows that
[TABLE]
where the first term is the approximation error (AE) and the second term, the estimation error (EE). Let us write so that . We define
[TABLE]
and note that is diagonal. Let and . Then, since norm is unitarily invariant, we have
[TABLE]
Controlling the estimation error: We have
[TABLE]
using since is an orthogonal matrix. Then,
[TABLE]
establishing the EE part of the result.
Controlling the approximation error: Recall that we are interested in the worst-case approximation error (WAE) over the unit ball of the Hilbert space, i.e., over . Also, recall that without loss of generality, we can take . Hence,
[TABLE]
where the second equality is from (6), and the latter two are by definitions of and . We obtain
[TABLE]
A further change of variable gives
[TABLE]
where , applied to matrices, is the operator norm. Note that is a diagonal matrix with diagonal elements, for followed by zeros. It follows that is diagonal with diagonal elements:
[TABLE]
Since is a non-increasing sequence, we obtain
[TABLE]
which is the desired result. ∎
Corollary 1.
Let be the estimation error of as in (11). Note that as long as , we have . It remains to show that the WAE of the truncated KRR is less than that of full KRR. We have for ,
[TABLE]
This proves (15). For the second assertion, it is enough to apply (15) with , noting that in this case, the RHS will be the minimax risk of the full KRR and the LHS is further lower bounded by the minimax risk of the truncated KRR. ∎
of Corollary 2.
Let us write and for the worst-case approximation and estimation errors, respectively, as a function of . Let be the worst-case MSE, so that . The starts off with the constant branch for small values of . Let . The constant branch starts at and extends to where . Some algebra gives . For , we have showing that the minimizer of is .
The next branch of WAE starts at and ends at which solves . The knots determining subsequent branches are determined similarly: for and . We have for for . See Fig. 2.
Fix and let . Then for
[TABLE]
where ranges over . Note that
[TABLE]
The first inequality is since is increasing in if , hence lower-bounded by its value at , and is decreasing on if , hence lower-bounded by its value as . Then,
[TABLE]
It follows that as long as no matter which interval contains . This shows that the minimizer of has to be completing the proof. ∎
Acknowledgement
We thank Chad Hazlett and Linfan Zhang for helpful discussions and Zahra Razaee for comments on the manuscript.
Appendix A Time complexity comparison
The ST-KRR and approximate versions, such as Nyström and sketching, all have time complexity of for computing the -truncated KRR estimate, once the pieces required for approximating the kernel matrix (e.g., and in the case of sketching, and in the case of ST-KRR and so on) are computed. Computing these pieces is where these methods differ. For sketching, this step could have complexity as large as for dense sketches, for randomized Fourier and Hadamard sketches, to as low as for the Nyström.
For the ST-KRR, this step involves computing the top- eigenpairs of the symmetric matrix , for which the Lanczos algorithm is the standard and for which a complexity analysis is hard to find in the literature. However, results of [26] suggest that it has average-case complexity . More precisely, [26] show that on average Lanczos iterations are enough to compute the top eigenvalue to within relative error , hence an overall average-case complexity where is the number of nonzero entries of matrix .
Appendix B Remaining proofs
Proposition 1.
We will use the same notation as in the proof of Theorem 1. By the same argument as in that proof, we have where is defined in (24) and for given in (23). Let be the solution of (9) for the input (instead of ) so that . Using the optimality condition in the proof of Theorem 1,
[TABLE]
where we have used (10) and (22), with (i.e., ). We can write
[TABLE]
using , and . Recall from (26) that is equivalent to . It follows that
[TABLE]
where the third equality is using the change of variable as in the proof of Theorem 1, and the last line follows since all the matrices are diagonal and hence commute. The result now follows by combining (25) and (27), after some algebra. ∎
Corollary 3.
For any , we have , where . Hence, the estimation error in (11) is bounded as
[TABLE]
This upper bound is within a factor of of the estimation error. Using to shave off the power by one, we obtain the weaker bound:
[TABLE]
Recalling definition (18), we conclude that if ,
[TABLE]
Combining with the WAE bound (12), we obtain the desired result. ∎
Lemma 1.
Let and be the objective functions in (4) and (21), respectively. We have for any , which follows from the identity . Now, assume that is a minimizer of , and let . Pick any ; there exists such that , and we have . The other direction follows similarly. ∎
Appendix C Rate calculations
Here we compute the error rate predicted by the strong and weak bounds and show that they are the same. Let . Assume the polynomial eigendecay of the Sobolev- kernel, i.e., . Taking to be the smallest integer satisfying , we have
[TABLE]
where the first inequality uses an integral approximation to the sum and the second uses the definition of . Setting we have , hence the critical radius .
Now consider the strong bound. As discussed in the text, . Also, as the proof of Corollary 3 shows, we have
[TABLE]
Letting be defined as the smallest integer such that , we get as before. Then, the maximum MSE is bounded as
[TABLE]
Since , by the definition of , we obtain . Equating the two terms we obtain as before.
For the Gaussian kernel, with , it is not hard to verify that with , we get . Minimizing the bound over , we obtain , hence .
The reference list from the paper itself. Each links out to its DOI / PubMed record.
- 1[1] Larry Wasserman “All of nonparametric statistics” Springer Science & Business Media, 2006
- 2[2] Alexandre B Tsybakov “Introduction to Nonparametric Estimation” Springer, New York, NY, 2009
- 3[3] Vern I Paulsen and Mrinal Raghupathi “An introduction to the theory of reproducing kernel Hilbert spaces” Cambridge University Press, 2016
- 4[4] Grace Wahba “Spline models for observational data” Siam, 1990
- 5[5] George Kimeldorf and Grace Wahba “Some results on Tchebycheffian spline functions” In Journal of mathematical analysis and applications 33.1 Elsevier, 1971, pp. 82–95
- 6[6] Christopher KI Williams and Matthias Seeger “Using the Nyström method to speed up kernel machines” In Advances in neural information processing systems , 2001, pp. 682–688
- 7[7] Kai Zhang, Ivor W Tsang and James T Kwok “Improved Nyström low-rank approximation and error analysis” In Proceedings of the 25th international conference on Machine learning , 2008, pp. 1232–1239 ACM
- 8[8] Sanjiv Kumar, Mehryar Mohri and Ameet Talwalkar “Ensemble nystrom method” In Advances in Neural Information Processing Systems , 2009, pp. 1060–1068
