Kernel quadrature with DPPs
Ayoub Belhadji, R\'emi Bardenet, Pierre Chainais

TL;DR
This paper introduces a novel quadrature method for functions in RKHS using determinantal point processes, providing theoretical error bounds and demonstrating superior empirical performance over existing methods.
Contribution
It establishes a new kernel quadrature approach with DPPs, linking kernels and spectra to achieve tighter error bounds and improved sampling efficiency.
Findings
DPP-based quadrature outperforms existing methods in experiments.
Theoretical bounds relate quadrature error to the spectrum of the RKHS kernel.
Numerical results suggest DPPs can achieve faster convergence rates.
Abstract
We study quadrature rules for functions from an RKHS, using nodes sampled from a determinantal point process (DPP). DPPs are parametrized by a kernel, and we use a truncated and saturated version of the RKHS kernel. This link between the two kernels, along with DPP machinery, leads to relatively tight bounds on the quadrature error, that depends on the spectrum of the RKHS kernel. Finally, we experimentally compare DPPs to existing kernel-based quadratures such as herding, Bayesian quadrature, or leverage score sampling. Numerical results confirm the interest of DPPs, and even suggest faster rates than our bounds in particular cases.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsGaussian Processes and Bayesian Inference · Bayesian Methods and Mixture Models · Markov Chains and Monte Carlo Methods
Kernel quadrature with DPPs
Ayoub Belhadji, Rémi Bardenet, Pierre Chainais
Univ. Lille, CNRS, Centrale Lille, UMR 9189 - CRIStAL, Villeneuve d’Ascq, France
{ayoub.belhadji, remi.bardenet, pierre.chainais}@univ-lille.fr
Abstract
We study quadrature rules for functions from an RKHS, using nodes sampled from a determinantal point process (DPP). DPPs are parametrized by a kernel, and we use a truncated and saturated version of the RKHS kernel. This link between the two kernels, along with DPP machinery, leads to relatively tight bounds on the quadrature error, that depends on the spectrum of the RKHS kernel. Finally, we experimentally compare DPPs to existing kernel-based quadratures such as herding, Bayesian quadrature, or leverage score sampling. Numerical results confirm the interest of DPPs, and even suggest faster rates than our bounds in particular cases.
1 Introduction
Numerical integration [15] is an important tool for Bayesian methods [54] and model-based machine learning [45]. Formally, numerical integration consists in approximating
[TABLE]
where is a topological space, is a Borel probability measure on , is a square integrable function, and is a function belonging to a space to be precised. In the quadrature formula (1), the points are called the quadrature nodes, and the corresponding weights.
The accuracy of a quadrature rule is assessed by the quadrature error, i.e., the absolute difference between the left-hand side and the right-hand side of (1). Classical Monte Carlo algorithms, like importance sampling or Markov chain Monte Carlo [55], pick up the nodes as either independent samples or a sample from a Markov chain on , and all achieve a root mean square quadrature error in . Quasi-Monte Carlo quadrature [16] is based on deterministic, low-discrepancy sequences of nodes, and typical error rates for are . Recently, kernels have been used to derive quadrature rules such as herding [2, 10], Bayesian quadrature [26, 49], sophisticated control variates [37, 47], and leverage-score quadrature [1] under the assumption that lies in a RKHS. The main theoretical advantage is that the resulting error rates are faster than classical Monte Carlo and adapt to the smoothness of .
In this paper, we propose a new quadrature rule for functions in a given RKHS. Our nearest scientific neighbour is [1], but instead of sampling nodes independently, we leverage dependence and use a repulsive distribution called a projection determinantal point process (DPP), while the weights are obtained through a simple quadratic optimization problem. DPPs were originally introduced by [38] as probabilistic models for beams of fermions in quantum optics. Since then, DPPs have been thoroughly studied in random matrix theory [28], and have more recently been adopted in machine learning [34] and Monte Carlo methods [3].
In practice, a projection DPP is defined through a reference measure and a repulsion kernel . In our approach, the repulsion kernel is a modification of the underlying RKHS kernel. This ensures that sampling is tractable, and, as we shall see, that the expected value of the quadrature error is controlled by the decay of the eigenvalues of the integration operator associated to the measure . Note that quadratures based on projection DPPs have already been studied in the literature: implicitly in [29, Corollary 2.3] in the simple case where and is the uniform measure, and in [3] for and more general measures. In the latter case, the quadrature error is asymptotically of order [3], with essentially . In the current paper, we leverage the smoothness of the integrand to improve the convergence rate of the quadrature in general spaces .
This article is organized as follows. Section 2 reviews kernel-based quadrature. In Section 3, we recall some basic properties of projection DPPs. Section 4 is devoted to the exposition of our main result, along with a sketch of proof. We give precise pointers to the supplementary material for missing details. Finally, in Section 5 we illustrate our result and compare to related work using numerical simulations, for the uniform measure in and , and the Gaussian measure on .
Notation.
Let be a topological space equipped with a Borel measure and assume that the support of is . Let be the Hilbert space of square integrable, real-valued functions defined on , with the usual inner product denoted by , and the associated norm by . Let be a symmetric and continuous function such that, for any finite set of points in , the matrix of pairwise kernel evaluations is positive semi-definite. Denote by the associated reproducing kernel Hilbert space (RKHS) of real-valued functions [5]. We assume that is integrable with respect to the measure so that . Define the integral operator
[TABLE]
By construction, is self-adjoint, positive semi-definite, and trace-class [56]. For , denote by the -th eigenfunction of , normalized so that and the corresponding eigenvalue. The integrability of the diagonal implies that is compactly embedded in , that is, the identity map is compact; moreover, since is of full support in , is injective [61]. This implies a Mercer-type decomposition of ,
[TABLE]
where and the convergence is point-wise [62]. Moreover, for , we write . Since is injective [62], is an orthonormal basis of . Unless explicitly stated, we assume that is dense in , so that is an orthonormal basis of . For more intuition, under these assumptions, if and only if converges.
2 Related work on kernel-based quadrature
When the integrand belongs to the RKHS of kernel [12], the quadrature error reads [57]
[TABLE]
where is the so-called mean element [17, 44]. A tight approximation of the mean element by a linear combination of functions thus guarantees low quadrature error. The approaches described in this section differ by their choice of nodes and weights.
2.1 Bayesian quadrature and the design of nodes
Bayesian Quadrature initially [35] considered a fixed set of nodes and put a Gaussian process prior on the integrand . Then, the weights were chosen to minimize the posterior variance of the integral of . If the kernel of the Gaussian process is chosen to be , this amounts to minimizing the RHS of (2). The case of the Gaussian reference measure was later investigated in detail [49], while parametric integrands were considered in [43]. Rates of convergence were provided in [9] for specific kernels on compact spaces, under a fill-in condition [65] that encapsulates that the nodes must progressively fill up the (compact) space.
Finding the weights that optimize the RHS of (2) for a fixed set of nodes is a relatively simple task, see later Section 4.1, the cost of which can even be reduced using symmetries of the set of nodes [27, 32]. Jointly optimizing on the nodes and weights, however, is only possible in specific cases [7, 30]. In general, this corresponds to a non-convex problem with many local minima [23, 48]. While [52] proposed to sample i.i.d. nodes from the reference measure , greedy minimization approaches have also been proposed [26, 48]. In particular, kernel herding [10] corresponds to uniform weights and greedily minimizing the RHS in (2). This leads to a fast rate in , but only when the integrand is in a finite-dimensional RKHS. Kernel herding and similar forms of sequential Bayesian quadrature are actually linked to the Frank-Wolfe algorithm [2, 8, 26]. Beside the difficulty of proving fast convergence rates, these greedy approaches still require heuristics in practice.
2.2 Leverage-score quadrature
In [1], the author proposed to sample the nodes i.i.d. from some proposal distribution , and then pick weights in (1) that solve the optimization problem
[TABLE]
for some regularization parameter . Proposition 1 gives a bound on the resulting approximation error of the mean element for a specific choice of proposal pdf, namely the leverage scores
[TABLE]
Proposition 1** (Proposition 2 in [1]).**
Let , and . Assume that , then
[TABLE]
In other words, Proposition 1 gives a uniform control on the approximation error by the subspace spanned by the for belonging to the unit ball of , where the are sampled i.i.d. from . The required number of nodes is equal to for a given approximation error . However, for fixed , the approximation error in Proposition 1 does not go to zero when increases. One theoretical workaround is to make decrease with . However, the coupling of and through makes it very intricate to derive a convergence rate from Proposition 1. Moreover, the optimal density is in general only available as the limit (6), which makes sampling and evaluation difficult. Finally, we note that Proposition 1 highlights the fundamental role played by the spectral decomposition of the operator in designing and analyzing kernel quadrature rules.
3 Projection determinantal point processes
Let and an orthonormal family of , and assume for simplicity that and that has density with respect to the Lebesgue measure. Define the repulsion kernel
[TABLE]
not to be mistaken for the RKHS kernel . One can show [25, Lemma 21] that
[TABLE]
is a probability density over . When have distribution (9), the set is said to be a projection DPP111In the finite case, more common in ML, projection DPPs are also called elementary DPPs [34]. with reference measure and kernel . Note that the kernel is a positive definite kernel so that the determinant in (9) is non-negative. Equation (9) is key to understanding DPPs. First, loosely speaking, the probability of seeing a point of in an infinitesimal volume around is . Note that when and are the family of orthonormal polynomials with respect to , this marginal probability is related to the optimal proposal in Section 2.2; see Appendix E.2. Second, the probability of simultaneously seeing a point of in an infinitesimal volume around and one around is
[TABLE]
The probability of co-occurrence is thus always smaller than that of a Poisson process with the same intensity. In this sense, a projection DPP with symmetric kernel is a repulsive distribution, and encodes its repulsiveness.
One advantage of DPPs is that they can be sampled exactly. Because of the orthonormality of , one can write the chain rule for (9); see [25]. Sampling each conditional in turn, using e.g. rejection sampling [55], then yields an exact sampling algorithm. Rejection sampling aside, the cost of this algorithm is cubic in without further assumptions on the kernel. Simplifying assumptions can take many forms. In particular, when , and is a Gaussian, gamma [19], or beta [33] pdf, and are the orthonormal polynomials with respect to , the corresponding DPP can be sampled by tridiagonalizing a matrix with independent entries, which takes the cost to and bypasses the need for rejection sampling. For further information on DPPs see [28, 59].
4 Kernel quadrature with projection DPPs
We follow in the footsteps of [1], see Section 2.2, but using a projection DPP rather than independent sampling to obtain the nodes. In a nutshell, we consider nodes that are drawn from the projection DPP with reference measure and repulsion kernel
[TABLE]
where we recall that are the normalized eigenfunctions of the integral operator . The weights are obtained by solving the optimization problem
[TABLE]
where
[TABLE]
is the reconstruction operator222The reconstruction operator depends on the nodes , although our notation doesn’t reflect it for simplicity.. In Section 4.1 we prove that (11) almost surely has a unique solution and state our main result, an upper bound on the expected approximation error under the proposed Projection DPP. Section 4.2 gives a sketch of the proof of this bound.
4.1 Main result
Assuming that nodes are known, we first need to solve the optimization problem (11) that relates to problem (5) without regularization (). Let , then
[TABLE]
where . The right-hand side of (13) is quadratic in , so that the optimization problem (11) admits a unique solution if and only if is invertible. In this case, the solution is given by . A sufficient condition for the invertibility of is given in the following proposition.
Proposition 2**.**
Assume that the matrix is invertible, then is invertible.
The proof of Proposition 2 is given in Appendix D.1. Since the pdf (9) of the projection DPP with kernel (10) is proportional to , the following corollary immediately follows.
Corollary 1**.**
Let be a projection DPP with reference measure and kernel (10). Then is a.s. invertible, so that (11) has unique solution a.s.
We now give our main result that uses nodes drawn from a well-chosen projection DPP.
Theorem 1**.**
Let be a projection DPP with reference measure and kernel (10). Let be the unique solution to (11) and define . Assume that and define , then
[TABLE]
In particular, if , then the right-hand side of (14) is . For example, take , the uniform measure on , and the -Sobolev space, then [5]. Now, if , the expected worst case quadrature error is bounded by . Another example is the case of the Gaussian measure on , with the Gaussian kernel. In this case with and [53] so that .
We have assumed that is dense in but Theorem 1 is valid also when is finite-dimensional. In this case, denote . Then, for , and , so that (14) implies
[TABLE]
This compares favourably with herding, for instance, which comes with a rate in for the quadrature based on herding with uniform weights [2, 10].
The constant in is the norm of the coefficients of projection of onto in . For example, for , if and if . In the worst case, . Thus, we can obtain a uniform bound for in the spirit of Proposition 1, but with a supplementary factor in the upper bound in (14).
4.2 Bounding the approximation error under the DPP
In this section, we give the skeleton of the proof of Theorem 1, referring to the appendices for technical details. The proof is in two steps. First, we give an upper bound for the approximation error that involves the maximal principal angle between the functional subspaces of
[TABLE]
DPPs allow closed form expressions for the expectation of trigonometric functions of such angles; see [4] and Appendix E.1 for the geometric intuition behind the proof. The second step thus consists in developing the expectation of the bound under the DPP.
4.2.1 Bounding the approximation error using principal angles
Let be such that . By Proposition 2, is non singular and . The optimal approximation error writes
[TABLE]
where is the orthogonal projection onto with the dual333For ,. is an operator from to that can be identified with . of .
In other words, (16) equates the approximation error to , where is the orthogonal projection onto . Now we have the following lemma.
Lemma 1**.**
Assume that then and
[TABLE]
Now, to upper bound the right-hand side of (17), we note that is the product of two terms: is a decreasing function of while is the interpolation error of the eigenfunction , measured in the norm. We can bound the latter interpolation error uniformly in using the geometric notion of maximal principal angle between and . This maximal principal angle is defined through its cosine
[TABLE]
Similarly, we can define the principal angles for between the subspaces and . These angles quantify the relative position of the two subspaces. See Appendix C.3 for more details about principal angles. Now, we have the following lemma.
Lemma 2**.**
Let such that . Then
[TABLE]
To sum up, we have so far bounded the approximation error by the geometric quantity in the right-hand side of (19). Where projection DPPs shine is in taking expectations of such geometric quantities.
4.2.2 Taking the expectation under the DPP
The analysis in Section 4.2.1 is valid whenever . As seen in Corollary 1, this condition is satisfied almost surely when is drawn from the projection DPP of Theorem 1. Furthermore, the expectation of the right-hand side of (19) can be written in terms of the eigenvalues of the kernel .
Proposition 3**.**
Let be a projection DPP with reference measure and kernel (10). Then,
[TABLE]
The bound of Proposition 3, once reported in Lemma 2 and Lemma 1, already yields Theorem 1 in the special case where . This seems a very restrictive condition, but next Proposition 4 shows that we can always reduce the analysis to that case. In fact, let the kernel be defined by
[TABLE]
and let be the corresponding RKHS. Then one has the following inequality.
Proposition 4**.**
Let and the orthogonal projection onto in . Then,
[TABLE]
Simply put, capping the first eigenvalues of yields a new kernel that captures the interaction between the terms and such that we only have to deal with the term . Combining Proposition 3 with Proposition 4 applied to the kernel yields Theorem 1.
4.3 Discussion
We have arbitrarily introduced a product in the right-hand side of (19), which is a rather loose majorization. Our motivation is that the expected value of this symmetric quantity is tractable under the DPP. Getting rid of the product could make the bound much tighter. Intuitively, taking the upper bound in (20) to the power results in a term in for the RKHS . Improving the bound in (20) would require a de-symmetrization by comparing the maximum of the to their geometric mean. An easier route than de-symmetrization could be to replace the product in (19) by a sum, but this is beyond the scope of this article.
In comparison with [1], we emphasize that the dependence of our bound on the eigenvalues of the kernel , via , is explicit. This is in contrast with Proposition 1 that depends on the eigenvalues of through the degree of freedom so that the necessary number of samples diverges when . On the contrary, our quadrature requires a finite number of points for . It would be interesting to extend the analysis of our quadrature in the regime .
5 Numerical simulations
5.1 The periodic Sobolev space and the Korobov space
Let be the uniform measure on , and let the RKHS kernel be [5]
[TABLE]
so that is the Sobolev space of order on . Note that can be expressed in closed form using Bernoulli polynomials [64]. We take in (1), so that the mean element . We compare the following algorithms: the quadrature rule DPPKQ we propose in Theorem 1, the quadrature rule DPPUQ based on the same projection DPP but with uniform weights, implicitly studied in [29], the kernel quadrature rule (5) of [1], which we denote LVSQ for leverage score quadrature, with regularization parameter (note that the optimal proposal is ), herding with uniform weights [2, 10], sequential Bayesian quadrature (SBQ) [26] with regularization to avoid numerical instability, and Bayesian quadrature on the uniform grid (UGBQ). We take . Figures 1(a) and 1(b) show log-log plots of the worst case quadrature error w.r.t. , averaged over 50 samples for each point, for .
We observe that the approximation errors of all first four quadratures converge to [math] with different rates. Both UGBQ and DPPKQ converge to [math] with a rate of , which indicates that our bound in Theorem 1 is not tight in the Sobolev case. Meanwhile, the rate of DPPUQ is across the three values of : it does not adapt to the regularity of the integrands. This corresponds to the CLT proven in [29]. LVSQ without regularization converges to [math] slightly slower than . Augmenting further slows down convergence. Herding converges at an empirical rate of , which is faster than the rate predicted by the theoretical analysis in [2, 10]. SBQ is the only one that seems to plateau for , although it consistently has the best performance for low . Overall, in the Sobolev case, DPPKQ and UGBQ have the best convergence rate. UGBQ – known to be optimal in this case [7] – has a better constant.
Now, for a multidimensional example, consider the “Korobov" kernel defined on by
[TABLE]
We still take in (1) so that . We compare our DPPKQ, LVSQ without regularization (), the kernel quadrature based on the uniform grid UGBQ, the kernel quadrature SGBQ based on the sparse grid from [58], the kernel quadrature based on the Halton sequence HaltonBQ [22]. We take and . The results are shown in Figure 1(c). This time, UGBQ suffers from the dimension with a rate in , while DPPKQ, HaltonBQ and LVSQ all perform similarly well. They scale as , which is a tight upper bound on , see [1] and Appendix B. SGBQ seems to lag slightly behind with a rate [24, 58].
5.2 The Gaussian kernel
We now consider to be the Gaussian measure on along with the RKHS kernel , and again . Figure 1(d) compares the empirical performance of DPPKQ to the theoretical bound of Theorem 1, herding, crude Monte Carlo with i.i.d. sampling from , and sequential Bayesian Quadrature, where we again average over samples. We take and . Note that, this time, only the -axis is on the log scale for better display, and that LVSQ is not plotted since we don’t know how to sample from in (6) in this case. We observe that the approximation error of DPPKQ converges to [math] as , while the discussion below Theorem 1 let us expect a slightly slower . Herding improves slightly upon Monte Carlo that converges as . Similarly to Sobolev spaces, the convergence of sequential Bayesian quadrature plateaus even if it has the smallest error for small . We also conclude that DPPKQ is a close runner-up to SBQ and definitely takes the lead for large enough .
6 Conclusion
In this article, we proposed a quadrature rule for functions living in a RKHS. The nodes are drawn from a DPP tailored to the RKHS kernel, while the weights are the solution to a tractable, non-regularized optimization problem. We proved that the expected value of the squared worst case error is bounded by a quantity that depends on the eigenvalues of the integral operator associated to the RKHS kernel, thus preserving the natural feel and the generality of the bounds for kernel quadrature [1]. Key intermediate quantities further have clear geometric interpretations in the ambient RKHS. Experimental comparisons suggest that DPP quadrature favourably compares with existing kernel-based quadratures. In specific cases where an optimal quadrature is known, such as the uniform grid for 1D periodic Sobolev spaces, DPPKQ seems to have the optimal convergence rate. However, our generic error bound does not reflect this optimality in the Sobolev case, and must thus be sharpened.
We have discussed room for improvement in our proofs. Further work should also address exact sampling algorithms, which do not exist yet when the spectral decomposition of the integral operator is not known. Approximate algorithms would also suffice, as long as the error bound is preserved.
Acknowledgments
We acknowledge support from ANR grant BoB (ANR-16-CE23-0003) and région Hauts-de-France. We also thank Adrien Hardy and the reviewers for their detailed and insightful comments.
Appendix A Implementation details
In this section, we give details on the repulsion kernels in each example of the main paper, and explain how we sampled from the corresponding DPPs. In short, we relied on matrix models for univariate cases, and vanilla DPP sampling [25] for multivariate settings.
A.1 The one-dimensional periodic Sobolev space
Consider the kernel defined by
[TABLE]
The Mercer decomposition of associated to the uniform measure on writes
[TABLE]
The corresponding repulsion kernel is
[TABLE]
if is even and
[TABLE]
if not. The projection DPP with kernel and reference measure can be sampled through a matrix model. Indeed this DPP is also the distribution of the arguments (normalized by ) of the eigenvalues of a random unitary matrix drawn from the Haar measure on [66]. Sampling such matrices can be done, e.g., using the QR decomposition of a matrix with i.i.d. unit complex Gaussians as coefficients [41].
A.2 The one-dimensional Gaussian kernel
Let and the reference measure be defined by
[TABLE]
For notational convenience, we further let
[TABLE]
and
[TABLE]
Now, the Mercer decomposition of reads [51]
[TABLE]
where
[TABLE]
and is the -th Hermite polynomial (i.e., orthonormal polynomials for the pdf of a unit Gaussian). Now, denote
[TABLE]
and the measure
[TABLE]
The rescaled polynomials are orthonormal with respect to the measure . Moreover, for ,
[TABLE]
Thus, for , we have
[TABLE]
In other words, the projection DPP associated to the orthonormal family and the reference measure is equivalent to the projection DPP associated to the orthonormal family and the reference measure . The latter DPP is known to be the distribution of the eigenvalues of a symmetrized matrix with i.i.d. Gaussian entries [39], which is easily implemented.
A.3 The case of a tensor product of RKHSs
We consider the case where writes as a tensor product of RKHSs, with the associated kernel
[TABLE]
with .
A.3.1 The multivariate integral operator
The integral operator becomes
[TABLE]
In the main paper, we considered for instance the Korobov space , defined as the tensor product of unidimensional periodic Sobolev spaces. Note that an element of is such that
[TABLE]
This implies that is included in the multidimensional Sobolev space, which corresponds to the same requirement, but only for multi-indices such that . Another example, featured in this supplementary material, is the multidimensional Gaussian space associated to the Gaussian kernel on and the multidimensional Gaussian measure. In this case, the kernel can be written as the tensor product of the Gaussian kernels on :
[TABLE]
In general, the eigenpairs of the integral operator are the tensor products of the eigenpairs of the integral operators corresponding to the spaces and measures . In other words, for ,
[TABLE]
A.3.2 Fixing an order on multi-indices
The definition of the projection DPP and its kernel now require that we fix an order on multi-indices. We choose an order that keeps eigenvalues decreasing, as in the univariate case where . Whenever the univariate eigenvalues take the form with , such as in the Korobov case, it holds
[TABLE]
Now, if the eigenvalues takes the form , with , as in the Gaussian case,
[TABLE]
In the multivariate Korobov and the Gaussian cases, we thus define in this work as (44) or (47), respectively.
Now, for , let be the -th multi-index according to . The repulsion kernel is defined as
[TABLE]
We sampled from the corresponding DPP using the generic sampling algorithm in [25], using the uniform and Gaussian distributions as proposal in the successive rejection sampling steps for the Korobov and Gaussian cases, respectively.
Appendix B Supplementary simulations
In this section, we give more plots of the convergence of the quadrature error. Before that, we experimentally assess whether the upper bounds given in [1] are sharp. The author proved upper bounds for in cases where the univariate eigenvalues decrease polynomially or geometrically in . In particular, for the Korobov spaces of dimension and regularity , we have
[TABLE]
For the Gaussian RKHS in dimension , it holds
[TABLE]
where and are constants depending on the scale parameters of the kernel and the measure . In our experiments, we compare the errors of various quadratures to the two rates (49) and (50). We mean these rates to be proxies for plotting , where refers to the order introduced in Section A.3. Figure 2 shows that in the Korobov case, the rate (49) is indeed close to the corresponding eigenvalue for large values of . The value of could be larger than for and small values of . As for the Gaussian case, Figure 2 shows that the rate (50) is also close to the corresponding eigenvalue for all values of .
B.1 The multi Fourier ensemble and Korobov RKHS
We consider the case of Korobov spaces with and and compare the quadrature error of the same algorithms as in 5.1. The results are compiled in Figure 3. The numerical simulations confirm the dependencies of the theoretical bounds of the different algorithms to the dimension and the regularity . In particular, UGBQ have better performance for high values of and low values of while its asymptotic behaviour is still the same . Moreover, the empirical rate of SGBQ is similar to its theoretical rate [24, 58]. Finally, the rate is confirmed also for the algorithms DPPKQ, LVSQ and HaltonBQ.
B.2 The multi Gaussian ensemble
We consider the case of Gaussian spaces with . The kernel and the reference measure are the tensor product of respectively the same kernel and the same measure used in Section 5.2. We compare DPPKQ and Bayesian quadrature based on the tensor product of Gauss-Hermite nodes noted GHBQ. Note that a variant of this algorithm was proposed in [31]: the quadrature nodes are the tensor product of the Gauss-Hermite nodes however the weights were calculated differently. The authors proved under an assumption on the stability of the weights (that was verified empirically) that the rate of convergence is , where is a constant that quantify the stability of the weights, and are constants that depend simultaneously on the the stability of the weights and length scales of the kernel and the measure. The results are compiled in Figure 4.
The numerical simulations shows that the empirical rate of DPPKQ is that is slightly better than its theoretical rate . Moreover, we observe that the empirical rate of DPPKQ is better than the empirical rate of HGBQ.
Appendix C Mercer’s theorem, leverage scores, and principal angles
For the sake of completeness, this section gathers some known results, which will be used to prove our own. We will need a general version of Mercer’s theorem, as usual for kernel methods, see Section C.1. On a more technical ground, we will also need formulas for leverage score changes under rank 1 updates, see Section C.2. Finally, Section C.3 covers principal angles between subspaces of a Hilbert space, which bridge the gap between pairs of Hilbert subspaces and determinants, and facilitate taking expectations in Theorem 1.
C.1 Mercer decomposition in non-compact subspaces
In this section we recall Mercer’s theorem and its extensions to non-compact spaces. Let be a measurable space and a measure on . Assume is a positive definite kernel on . Whenever it is well-defined, we consider the operator
[TABLE]
Theorem 2**.**
Assume that is a compact space and is a finite Borel measure on . Then, there exists an orthonormal basis of consisting of eigenfunctions of , and the corresponding eigenvalues are non-negative. The eigenfunctions corresponding to non-vanishing eigenvalues can be taken to be continuous, and the kernel writes
[TABLE]
where the convergence is absolute and uniform.
Theorem 2 was first proven when and is the Lebesgue measure in [40]. A modern proof can be found in [36], while the proof in the general case can be found in [13]. Note, however, that the compactness assumption in Theorem 2 excludes kernels such as the Gaussian or the Laplace kernels. Hence, extensions to non-compact spaces are usually required in ML. In [63], the author extended Theorem 2 to , with the s compact and . One can also extend Mercer’s theorem under a compact embedding assumption [62]: the RKHS associated to is said to be compactly embedded in if the application
[TABLE]
is compact. A sufficient condition for this assumption is the integrability of the diagonal (Lemma 2.3, [62]):
[TABLE]
Note that this condition is not necessary (Example 2.9, [62]). Now, under the compact embedding assumption, the pointwise convergence of the Mercer decomposition to the kernel is equivalent to the injectivity of the embedding (Theorem 3.1, [62]).
C.2 Leverage score changes under rank 1 updates
In this section we prove a lemma inspired from Lemma 5 in [11]. This lemma concerns the changes of leverage scores under rank 1 updates.
We start by recalling the definition of leverage scores, which play an important role in randomized linear algebra [18]. Let , . Let be a matrix of full rank. For , denote the -th column of the matrix . Now, the -th leverage score of the matrix is defined by
[TABLE]
while the cross-leverage score between the -th column and the -th column is defined by
[TABLE]
It holds [18]
[TABLE]
and we have the following result.
Lemma 3**.**
Let , . Let of full rank and and . Let a diagonal matrix such that and for . Then
[TABLE]
and
[TABLE]
The proof of this lemma is similar to Lemma 5 in [11]. We recall the proof for completeness.
Proof.
(Adapted from [11]) The Sherman-Morrison formula applied to and the vector yields
[TABLE]
By definition of
[TABLE]
Now let . By definition of
[TABLE]
∎
C.3 Principal angles between subspaces in Hilbert spaces
We recall in this section the definition of principal angles between subspaces in Hilbert spaces and connect them to the determinant of the Gramian matrix of their orthonormal bases.
Proposition 5**.**
Let be a Hilbert space. Let and be two finite-dimensional subspaces of with . Denote and the orthogonal projections of onto these two subspaces. There exist two orthonormal bases for and denoted and , and a set of angles such that
[TABLE]
and for
[TABLE]
and
[TABLE]
and
[TABLE]
In particular
[TABLE]
We refer to [21] for the proof in the finite-dimensional case and [14] for the general case. The following result shows that the principal angles are somewhat independent of the choice of orthonormal bases. It can be found in [6, 42] for the finite dimensional case. We give here the proof for the general case, for the sake of completeness.
Corollary 2**.**
Let be any orthonormal basis of and be any orthonormal basis of , and let and . Then the eigenvalues of are the . In particular, .
Proof.
Let , , be the bases of Proposition 5. Let be such that
[TABLE]
Similarly, there exists a matrix such that
[TABLE]
This implies that
[TABLE]
where . Then
[TABLE]
Thus the eigenvalues of are the eigenvalues of . By Proposition 5, the diagonal elements of are
[TABLE]
We finish the proof by showing that the anti-diagonal elements satisfy
[TABLE]
By (65),
[TABLE]
Then
[TABLE]
Thus
[TABLE]
Finally, is a diagonal matrix and the eigenvalues of are the . ∎
Appendix D Proofs of our results
Section D.1 contains the proof of Proposition 2. In the main paper, we use it under the form of Corollary 1 to ensure that is almost surely invertible when is a projection DPP with reference measure and kernel (10). This allows computing the quadrature weights.
The rest of Section D deals with Theorem 1, our upper bound on the approximation error of DPP-based kernel quadrature. The proof is rather long, but can be decomposed in four steps, which we now introduce for ease of reading.
First, we prove Lemma 1, which separates the search for an upper bound into examining the contribution of the three terms in (17); this is Section D.2. The first two terms of (17) only depend on the function in (1), and we leave them be. The third term is more geometric, and relates to the approximation error of the space spanned by by the (random) subspace .
Second, in Section D.3, we bound this geometric term for a fixed DPP realization . We pay attention to obtain a bound that will later yield a tractable expectation under that DPP. This is done in Proposition 4, which in turn requires two intermediate results, Lemma 4 and Proposition 6.
Third, we take the expectation of the bound in Proposition 4 under the proposed DPP. This is done in Proposition 3, which is proven thanks to Proposition 2, Lemmas 2, 5 & 6. This is Section D.4.
Fourth, Theorem 1 is obtained in Section D.5, using the results of the previous steps, and an argument to reduce the proof to RKHSs with flat initial spectrum.
D.1 Proof of Proposition 2
Proof.
Recall the Mercer decomposition of :
[TABLE]
where the convergence is point-wise on . Define for the -th truncated kernel
[TABLE]
By (77)
[TABLE]
Let such that , and define
[TABLE]
By the continuity of the function and by (79)
[TABLE]
Thus to prove that , it is enough to prove that the is larger than a positive real number for large enough. We write
[TABLE]
with and is a diagonal matrix containing the first eigenvalues . The Cauchy-Binet identity yields
[TABLE]
Therefore,
[TABLE]
so that is a.s. invertible. ∎
D.2 Proof of Lemma 1
Proof.
First, we prove that
[TABLE]
Recall that
[TABLE]
and that we assumed in Section 1 that is dense in , so that is an orthonormal basis of and the eigenvalues are strictly positive. Now let and be defined by
[TABLE]
[TABLE]
Observe that . Now, for ,
[TABLE]
As a consequence,
[TABLE]
Now we turn to proving (17) from the main text. Define first the operators , and by
[TABLE]
[TABLE]
[TABLE]
[TABLE]
Note that and
[TABLE]
Using (94), there exists such that and . Now, the approximation error writes
[TABLE]
The operator is an orthogonal projection and so that by (99)
[TABLE]
Now, recall that the is orthonormal. Moreover for , is an eigenfunction of and the corresponding eigenvalue is . Thus
[TABLE]
Then
[TABLE]
Remarking that concludes the proof of (17) and therefore Lemma 1. ∎
D.3 Proof of Proposition 4
Proposition 4 gives an upper bound to the term that appears in Lemma 1. We first prove a technical result, Lemma 4, and then combine it with Proposition 6 to finish the proof. We conclude with the proof of Proposition 6.
D.3.1 A preliminary lemma
Let . Recall that and denote , see section 4.2.2. In the following, we define
[TABLE]
Lemma 4 below shows that each term of the form measures the squared norm of the projection of on . The same holds for and the projection of onto .
Indeed, since . Thus it is sufficient to prove that . This boils down to showing that is the matrix of the inner product restricted to .
Lemma 4**.**
For , let the vectors of the evaluations of and on the elements of respectively. Then
[TABLE]
We give the proof of (108); the proof of (109) follows the same lines.
Proof.
Let us write
[TABLE]
where the are the elements of the vector . Then
[TABLE]
Since is orthonormal,
[TABLE]
Using Mercer’s theorem, see (79),
[TABLE]
Combining (112) and (113) along with the definition of the vector yields
[TABLE]
∎
D.3.2 End of the proof of Proposition 4
Proof.
By Lemma 4, the inequality (22) in Proposition 4 is equivalent to
[TABLE]
As an intermediate remark, note that in the special case , by construction
[TABLE]
where is the Loewner order, the partial order defined by the convex cone of positive semi-definite matrices. Thus
[TABLE]
Noting that and that
[TABLE]
yields (115) for :
[TABLE]
For , the proof is much more subtle. Indeed, a naive application of the inequality (117) would lead to the following inequality
[TABLE]
Since , , we get
[TABLE]
and hence the unsatisfactory inequality
[TABLE]
We can prove a better inequality by applying a sequence of rank-one updates to the kernel to build intermediate kernels that lead to inequalities sharp enough to prove (115) for . Then inequality (115) will result as a corollary of Proposition 6 below. To this aim, we define RKHS , , that interpolate between and . For , define the kernel by
[TABLE]
and let the RKHS corresponding to the kernel . For , define . Similar to previous notations, we define as well
[TABLE]
Now we have the following useful proposition.
Proposition 6**.**
For , we have
[TABLE]
and
[TABLE]
For ease of reading, we first show that inequality (115) and therefore Proposition 4 is easily deduced from this Proposition 6 and then give its proof.
Let such that . We first remark that and use times inequality (126) of Proposition 6:
[TABLE]
Then we use (125) that is connected to the rank-one update from the kernel to so that
[TABLE]
Then we apply (126) to the r.h.s. again times to finally get:
[TABLE]
since and . This concludes the proof of the desired inequality (115) and therefore of Proposition 4. ∎
D.3.3 Proof of Proposition 6
Proof.
(Proposition 6) Let , and such that . Let defined by
[TABLE]
For define
[TABLE]
Let the diagonal matrix defined by
[TABLE]
Then one has the simple relation
[TABLE]
which prepares the use of Lemma 3 in Section C.2. By definition of the -th leverage score of the matrix , see (54) in Section C.2,
[TABLE]
Define similarly . Thanks to (57) of Lemma 3 and (133) and for
[TABLE]
where . Thus
[TABLE]
Then
[TABLE]
since and \tau_{n}\Big{(}\bm{A}_{n-1}\Big{)}\in[0,1] thanks to (56). This proves that for such that ,
[TABLE]
Now,
[TABLE]
Moreover the application is continuous in . This proves the inequality (125) of Proposition 6. To prove the inequality (126), we start by using (58):
[TABLE]
which implies that
[TABLE]
Then for ,
[TABLE]
As above, we conclude the proof by considering the limit
[TABLE]
This proves inequality (126) and concludes the proof of Proposition 6. ∎
D.4 Proof of Proposition 3
In this section, is the realization of the DPP of Theorem 1. Let and , and . Moreover, let and .
We first prove two lemmas that are necessary to prove Proposition 3.
D.4.1 Two preliminary lemmas
Lemma 5**.**
Let such that . Then,
[TABLE]
Proof.
The condition yields by Proposition 2 that is non singular. Thus . Let an orthonormal basis of with respect to . Using Corollary 2, and the fact that is an orthonormal basis of according to ,
[TABLE]
Now, write for ,
[TABLE]
Thus
[TABLE]
Then
[TABLE]
where
[TABLE]
Thus
[TABLE]
Now, let the columns of the matrix . is an orthonormal basis of with respect to , then by (147)
[TABLE]
Therefore
[TABLE]
Thus
[TABLE]
Combining (146), (152) and (155) concludes the proof of Lemma 5:
[TABLE]
∎
Lemma 6**.**
[TABLE]
Proof.
Let . From (79)
[TABLE]
Moreover,
[TABLE]
Now, for such that , is an orthonormal family of , then by [25] Lemma 21:
[TABLE]
Thus
[TABLE]
Now, implies that . In fact, for let the -th symmetric polynomial. By Maclaurin’s inequality [60], and for any vector
[TABLE]
Thus
[TABLE]
This inequality is independent of the dimension thus it can be extended for with . Therefore
[TABLE]
Furthermore,
[TABLE]
Then by monotone convergence theorem, is mesurable and
[TABLE]
∎
D.4.2 End of the proof of Proposition 3
Proof.
Remember that
[TABLE]
Then by Lemma 5 and the fact that
[TABLE]
Then, taking the expectation with respect to resulting from a DPP of kernel ,
[TABLE]
Now, by Lemma 6
[TABLE]
Therefore,
[TABLE]
∎
D.5 Proof of Theorem 1
Proof.
Thanks to Proposition 4 and Lemma 2 (for and )
[TABLE]
Then Proposition 3 applied to with kernel yields
[TABLE]
Every subset such that can be written as with and , and this decomposition is unique. Then
[TABLE]
Therefore
[TABLE]
where for , is the -th symmetric polynomial with the convention that .
Finally, thanks to (163) above
[TABLE]
As a consequence, by writing ,
[TABLE]
which can be plugged in Lemma 1 to conclude the proof. ∎
Appendix E The intuitions behind the algorithm
The algorithm presented in this article is based on several intuitions. In this section, we summarize these intuitions.
E.1 The geometric intuition
Recall that the quadrature problem in a RKHS boils down to a problem of interpolation of the mean element by a mixture of , where such that . A promising algorithm would thus be to select the nodes so as to minimize the projection of onto . Upper bounding the approximation error is not easy in general. One the one side, we propose to replace by its projection onto the first eigenfunctions of . Then it is easy to prove that
[TABLE]
On the other side, if we find a quadrature rule such that is small, then we can guarantee an overall approximation error that is not too much worse than the PCA error (179). After introducing an auxiliary RKHS with kernel , we express this second term using the principal angles between the subspaces and (see section 4.2.2). This yields a bound on the interpolation error
[TABLE]
The first term in the right hand side of (180) is , which corresponds to the approximation error observed in numerical simulations. The second term depends on the largest principal angle between the subspaces and , see Figure 5. This term can in turn be bounded by the symmetrized quantity
[TABLE]
which has a tractable expectation under the projection DPP that we consider in this paper. As an illustration of (180), Figure 6 compares the quality of approximation of a mean element using kernel interpolation based on two configurations of nodes: the first configuration (top) is well spread and the second configuration (bottom) is not. Observe that the largest principal angle for the first configuration is around , so that ; while it is around for the second configuration so that . Now observe that the first design of nodes gives the best reconstruction. This observation is consistent with (180).
E.2 The inclusion probability of DPPs and the Christoffel functions
The optimal distribution , see section 2.2, can be linked to the so-called Christoffel functions [50]. These functions are rooted in the literature on orthogonal polynomials [46]. To make it simpler, we introduce them in dimension . They are defined by
[TABLE]
Christoffel functions have a more explicit form [46] that can be used for pointwise evaluation
[TABLE]
where are the orthonormal polynomials with respect to . To establish a connection with , the authors of [50] defined regularized Christoffel functions for some kernel :
[TABLE]
The authors derived an asymptotic equivalent of the function in the regime under some assumptions on the kernel. Furthermore, they proved that is tied to by the following relationship (Lemma 5, [50]):
[TABLE]
On the other hand, assume that the are the family of orthonormal polynomials with respect to . Let and a random subset of drawn from the Projection DPP , then
[TABLE]
In other words, the inclusion probability of the corresponding projection DPP is related to the inverse of the Christoffel function as defined in (183). Figure 7 illustrates the evaluations of the inclusion probability of the projection DPP in the case of RKHS defined by the Gaussian kernel along with the Gaussian measure in the real line. Recall that in this case the eigenfunctions are given by
[TABLE]
The theoretical analysis of the "bumps" of the functions was carried out in [20]. More precisely, the authors studied the approximations of those bumps by Gaussians centred on the Hermite polynomials roots, see Figure 7 (b). We observe a similar behaviour for the multidimensional Gaussian case as illustrated in Figure 8: the inclusion probability of the projection DPP have has local maxima around the tensor products of the Hermite polynomials roots. In other words, the quadratures based on nodes sampled according to a projection DPP are probabilistic relaxations of classical quadratures based on roots of orthogonal polynomials that can be defined even if is not the square of an integer (the cases in Figure 8).
The reference list from the paper itself. Each links out to its DOI / PubMed record.
- 1Bach [2017] F. Bach. On the equivalence between kernel quadrature rules and random feature expansions. The Journal of Machine Learning Research , 18(1):714–751, 2017.
- 2Bach et al. [2012] F. Bach, S. Lacoste-Julien, and G. Obozinski. On the equivalence between herding and conditional gradient algorithms. In Proceedings of the 29th International Coference on International Conference on Machine Learning , ICML’12, pages 1355–1362, 2012.
- 3Bardenet and Hardy [2016] R. Bardenet and A. Hardy. Monte Carlo with determinantal point processes. ar Xiv:1605.00361 , May 2016.
- 4Belhadji et al. [2018] A. Belhadji, R. Bardenet, and P. Chainais. A determinantal point process for column subset selection. ar Xiv:1812.09771 , 2018.
- 5Berlinet and Thomas-Agnan [2011] A. Berlinet and Ch. Thomas-Agnan. Reproducing kernel Hilbert spaces in probability and statistics . Springer Science & Business Media, 2011.
- 6Björck and Golub [1973] A. Björck and G. H. Golub. Numerical methods for computing angles between linear subspaces. Mathematics of computation , 27(123):579–594, 1973.
- 7Bojanov [1981] B.D. Bojanov. Uniqueness of the optimal nodes of quadrature formulae. Mathematics of computation , 36(154):525–546, 1981.
- 8Briol et al. [2015] F.X. Briol, C. Oates, M. Girolami, and M.A. Osborne. Frank-Wolfe Bayesian quadrature: Probabilistic integration with theoretical guarantees. In Advances in Neural Information Processing Systems , pages 1162–1170, 2015.
