Kernel quadrature with DPPs

Ayoub Belhadji; R\'emi Bardenet; Pierre Chainais

arXiv:1906.07832·stat.ML·January 3, 2020

Kernel quadrature with DPPs

Ayoub Belhadji, R\'emi Bardenet, Pierre Chainais

PDF

Open Access 1 Repo

TL;DR

This paper introduces a novel quadrature method for functions in RKHS using determinantal point processes, providing theoretical error bounds and demonstrating superior empirical performance over existing methods.

Contribution

It establishes a new kernel quadrature approach with DPPs, linking kernels and spectra to achieve tighter error bounds and improved sampling efficiency.

Findings

01

DPP-based quadrature outperforms existing methods in experiments.

02

Theoretical bounds relate quadrature error to the spectrum of the RKHS kernel.

03

Numerical results suggest DPPs can achieve faster convergence rates.

Abstract

We study quadrature rules for functions from an RKHS, using nodes sampled from a determinantal point process (DPP). DPPs are parametrized by a kernel, and we use a truncated and saturated version of the RKHS kernel. This link between the two kernels, along with DPP machinery, leads to relatively tight bounds on the quadrature error, that depends on the spectrum of the RKHS kernel. Finally, we experimentally compare DPPs to existing kernel-based quadratures such as herding, Bayesian quadrature, or leverage score sampling. Numerical results confirm the interest of DPPs, and even suggest faster rates than our bounds in particular cases.

Equations409

\int_{X} f (x) g (x) d ω (x) \approx j \in [N] \sum w_{j} f (x_{j}),

\int_{X} f (x) g (x) d ω (x) \approx j \in [N] \sum w_{j} f (x_{j}),

Σ f (\cdot) = \int_{X} k (\cdot, y) f (y) d ω (y), f \in L_{2} (d ω) .

Σ f (\cdot) = \int_{X} k (\cdot, y) f (y) d ω (y), f \in L_{2} (d ω) .

k (x, y) = m \in N^{*} \sum σ_{m} e_{m} (x) e_{m} (y),

k (x, y) = m \in N^{*} \sum σ_{m} e_{m} (x) e_{m} (y),

\int_{X} f (x) g (x) d ω (x) - j \in [N] \sum w_{j} f (x_{j})

\int_{X} f (x) g (x) d ω (x) - j \in [N] \sum w_{j} f (x_{j})

\displaystyle\leq\|f\|_{\mathcal{F}}\,\Big{\|}\mu_{g}-\sum\limits_{j\in[N]}w_{j}k(x_{j},.)\Big{\|}_{\mathcal{F}}\,,

\min\limits_{w\in\mathbb{R}^{N}}\Big{\|}\mu_{g}-\sum\limits_{j\in[N]}\frac{w_{j}}{q(x_{j})^{1/2}}k(x_{j},.)\Big{\|}_{\mathcal{F}}^{2}+\lambda N\|w\|_{2}^{2},

\min\limits_{w\in\mathbb{R}^{N}}\Big{\|}\mu_{g}-\sum\limits_{j\in[N]}\frac{w_{j}}{q(x_{j})^{1/2}}k(x_{j},.)\Big{\|}_{\mathcal{F}}^{2}+\lambda N\|w\|_{2}^{2},

q_{λ}^{*} (x) \propto ⟨ k (x, .), Σ^{- 1/2} (Σ + λ I_{L_{2} (d ω)})^{- 1} Σ^{- 1/2} k (x, .) ⟩_{L_{2} (d ω)} = m \in N \sum \frac{σ _{m}}{σ _{m} + λ} e_{m} (x)^{2} .

q_{λ}^{*} (x) \propto ⟨ k (x, .), Σ^{- 1/2} (Σ + λ I_{L_{2} (d ω)})^{- 1} Σ^{- 1/2} k (x, .) ⟩_{L_{2} (d ω)} = m \in N \sum \frac{σ _{m}}{σ _{m} + λ} e_{m} (x)^{2} .

\operatorname{\mathbb{P}}\bigg{(}\sup\limits_{\|g\|_{\mathrm{d}\omega}\leq 1}\inf\limits_{\|\bm{w}\|^{2}\leq\frac{4}{N}}\Big{\|}\mu_{g}-\sum\limits_{j\in[N]}\frac{w_{j}}{q_{\lambda}(x_{j})^{1/2}}k(x_{j},.)\Big{\|}_{\mathcal{F}}^{2}\leq 4\lambda\bigg{)}\geq 1-\delta.

\operatorname{\mathbb{P}}\bigg{(}\sup\limits_{\|g\|_{\mathrm{d}\omega}\leq 1}\inf\limits_{\|\bm{w}\|^{2}\leq\frac{4}{N}}\Big{\|}\mu_{g}-\sum\limits_{j\in[N]}\frac{w_{j}}{q_{\lambda}(x_{j})^{1/2}}k(x_{j},.)\Big{\|}_{\mathcal{F}}^{2}\leq 4\lambda\bigg{)}\geq 1-\delta.

K (x, y) = n \in [N] \sum ψ_{n} (x) ψ_{n} (y),

K (x, y) = n \in [N] \sum ψ_{n} (x) ψ_{n} (y),

\frac{1}{N !} Det (K (x_{i}, x_{j})_{i, j \in [N]}) i \in [N] \prod ω (x_{i})

\frac{1}{N !} Det (K (x_{i}, x_{j})_{i, j \in [N]}) i \in [N] \prod ω (x_{i})

\displaystyle\Big{[}\operatorname*{\mathfrak{K}}(x_{1},x_{1})\operatorname*{\mathfrak{K}}(x_{2},x_{2})-

\displaystyle\Big{[}\operatorname*{\mathfrak{K}}(x_{1},x_{1})\operatorname*{\mathfrak{K}}(x_{2},x_{2})-

\leq [K (x_{1}, x_{1}) ω (x_{1}) d x_{1}] [K (x_{2}, x_{2}) ω (x_{2}) d x_{2}] .

K (x, y) = n \in [N] \sum e_{n} (x) e_{n} (y),

K (x, y) = n \in [N] \sum e_{n} (x) e_{n} (y),

w \in R^{N} min ∥ μ_{g} - Φ w ∥_{F}^{2},

w \in R^{N} min ∥ μ_{g} - Φ w ∥_{F}^{2},

Φ : (w_{j})_{j \in [N]} \mapsto j \in [N] \sum w_{j} k (x_{j}, .)

Φ : (w_{j})_{j \in [N]} \mapsto j \in [N] \sum w_{j} k (x_{j}, .)

∥ μ_{g} - Φ w ∥_{F}^{2} = ∥ μ_{g} ∥_{F}^{2} - 2 w^{⊺} μ_{g} (x_{j})_{j \in [N]} + w^{⊺} K (x) w,

∥ μ_{g} - Φ w ∥_{F}^{2} = ∥ μ_{g} ∥_{F}^{2} - 2 w^{⊺} μ_{g} (x_{j})_{j \in [N]} + w^{⊺} K (x) w,

E_{DPP} ∥ μ_{g} - Φ \hat{w} ∥_{F}^{2} \leq 2 σ_{N + 1} + 2∥ g ∥_{d ω, 1}^{2} (N r_{N} + ℓ = 2 \sum N \frac{σ _{1}}{ℓ ! ^{2}} (\frac{N r _{N}}{σ _{1}})^{ℓ}) .

E_{DPP} ∥ μ_{g} - Φ \hat{w} ∥_{F}^{2} \leq 2 σ_{N + 1} + 2∥ g ∥_{d ω, 1}^{2} (N r_{N} + ℓ = 2 \sum N \frac{σ _{1}}{ℓ ! ^{2}} (\frac{N r _{N}}{σ _{1}})^{ℓ}) .

∥ μ_{g} - Φ \hat{w} ∥_{F}^{2} = 0 a.s.

∥ μ_{g} - Φ \hat{w} ∥_{F}^{2} = 0 a.s.

E_{N}^{F} = Span (e_{n}^{F})_{n \in [N]} and T (x) = Span (k (x_{j}, .))_{j \in [N]} .

E_{N}^{F} = Span (e_{n}^{F})_{n \in [N]} and T (x) = Span (k (x_{j}, .))_{j \in [N]} .

∥ μ_{g} - Φ \hat{w} ∥_{F}^{2} = ∥ μ_{g} - Π_{T (x)} μ_{g} ∥_{F}^{2},

∥ μ_{g} - Φ \hat{w} ∥_{F}^{2} = ∥ μ_{g} - Π_{T (x)} μ_{g} ∥_{F}^{2},

\|\bm{\Pi}_{\mathcal{T}(\bm{x})^{\perp}}\mu_{g}\|^{2}_{\mathcal{F}}\leq 2\bigg{(}\sigma_{N+1}+\|g\|_{\mathrm{d}\omega,1}^{2}\max\limits_{n\in[N]}\sigma_{n}\|\bm{\Pi}_{\mathcal{T}(\bm{x})^{\perp}}e_{n}^{\mathcal{F}}\|_{\mathcal{F}}^{2}\bigg{)}.

\|\bm{\Pi}_{\mathcal{T}(\bm{x})^{\perp}}\mu_{g}\|^{2}_{\mathcal{F}}\leq 2\bigg{(}\sigma_{N+1}+\|g\|_{\mathrm{d}\omega,1}^{2}\max\limits_{n\in[N]}\sigma_{n}\|\bm{\Pi}_{\mathcal{T}(\bm{x})^{\perp}}e_{n}^{\mathcal{F}}\|_{\mathcal{F}}^{2}\bigg{)}.

cos^{2} θ_{N} (T (x), E_{N}^{F}) = u \in T (x), v \in E_{N}^{F} ∥ u ∥_{F} = 1, ∥ v ∥_{F} = 1 in f ⟨ u, v ⟩_{F} .

cos^{2} θ_{N} (T (x), E_{N}^{F}) = u \in T (x), v \in E_{N}^{F} ∥ u ∥_{F} = 1, ∥ v ∥_{F} = 1 in f ⟨ u, v ⟩_{F} .

n \in [N] max ∥ Π_{T (x)^{⊥}} e_{n}^{F} ∥_{F}^{2} \leq \frac{1}{cos ^{2} θ _{N} ( T ( x ) , E _{N}^{F} )} - 1 \leq n \in [N] \prod \frac{1}{cos ^{2} θ _{n} ( T ( x ) , E _{N}^{F} )} - 1.

n \in [N] max ∥ Π_{T (x)^{⊥}} e_{n}^{F} ∥_{F}^{2} \leq \frac{1}{cos ^{2} θ _{N} ( T ( x ) , E _{N}^{F} )} - 1 \leq n \in [N] \prod \frac{1}{cos ^{2} θ _{n} ( T ( x ) , E _{N}^{F} )} - 1.

\operatorname{\mathbb{E}}_{\operatorname{\mathrm{DPP}}}\prod\limits_{n\in[N]}\frac{1}{\cos^{2}\theta_{n}\bigg{(}\mathcal{T}(\bm{x}),\mathcal{E}^{\mathcal{F}}_{N}\bigg{)}}=\sum\limits_{\begin{subarray}{c}T\subset\mathbb{N}^{*}\\ |T|=N\end{subarray}}\frac{\prod\limits_{t\in T}\sigma_{t}}{\prod\limits_{n\in[N]}\sigma_{n}}\>.

\operatorname{\mathbb{E}}_{\operatorname{\mathrm{DPP}}}\prod\limits_{n\in[N]}\frac{1}{\cos^{2}\theta_{n}\bigg{(}\mathcal{T}(\bm{x}),\mathcal{E}^{\mathcal{F}}_{N}\bigg{)}}=\sum\limits_{\begin{subarray}{c}T\subset\mathbb{N}^{*}\\ |T|=N\end{subarray}}\frac{\prod\limits_{t\in T}\sigma_{t}}{\prod\limits_{n\in[N]}\sigma_{n}}\>.

\tilde{k} (x, y) = n \in [N] \sum σ_{1} e_{n} (x) e_{n} (y) + n \geq N + 1 \sum σ_{n} e_{n} (x) e_{n} (y) = n \in N^{*} \sum \tilde{σ}_{n} e_{n} (x) e_{n} (y),

\tilde{k} (x, y) = n \in [N] \sum σ_{1} e_{n} (x) e_{n} (y) + n \geq N + 1 \sum σ_{n} e_{n} (x) e_{n} (y) = n \in N^{*} \sum \tilde{σ}_{n} e_{n} (x) e_{n} (y),

\forall n \in [N], σ_{n} ∥ Π_{T (x)^{⊥}} e_{n}^{F} ∥_{F}^{2} \leq σ_{1} ∥ Π_{\tilde{T} (x)^{⊥}} e_{n}^{\tilde{F}} ∥_{\tilde{F}}^{2} .

\forall n \in [N], σ_{n} ∥ Π_{T (x)^{⊥}} e_{n}^{F} ∥_{F}^{2} \leq σ_{1} ∥ Π_{\tilde{T} (x)^{⊥}} e_{n}^{\tilde{F}} ∥_{\tilde{F}}^{2} .

k_{s} (x, y) = 1 + m \in N^{*} \sum \frac{1}{m ^{2 s}} cos (2 π m (x - y)),

k_{s} (x, y) = 1 + m \in N^{*} \sum \frac{1}{m ^{2 s}} cos (2 π m (x - y)),

\forall x, y \in [0, 1]^{d}, k_{s, d} (x, y) = i \in [d] \prod k_{s} (x_{i}, y_{i}) .

\forall x, y \in [0, 1]^{d}, k_{s, d} (x, y) = i \in [d] \prod k_{s} (x_{i}, y_{i}) .

k_{s} (x, y) = 1 + m \in N^{*} \sum \frac{1}{m ^{2 s}} cos (2 π m (x - y)) .

k_{s} (x, y) = 1 + m \in N^{*} \sum \frac{1}{m ^{2 s}} cos (2 π m (x - y)) .

k_{s} (x, y) = k \in Z \sum \frac{1}{max ( 1 , ∣ k ∣ ) ^{2 s}} e^{2 π ik x} e^{- 2 π ik y} .

k_{s} (x, y) = k \in Z \sum \frac{1}{max ( 1 , ∣ k ∣ ) ^{2 s}} e^{2 π ik x} e^{- 2 π ik y} .

K (x, y) = e^{π i N (x - y)} m = - N /2 \sum N /2 e^{2 π im x} e^{- 2 π im y} = m = 0 \sum N e^{2 π im x} e^{- 2 π im y},

K (x, y) = e^{π i N (x - y)} m = - N /2 \sum N /2 e^{2 π im x} e^{- 2 π im y} = m = 0 \sum N e^{2 π im x} e^{- 2 π im y},

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

AyoubBelhadji/DPPKQ
noneOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsGaussian Processes and Bayesian Inference · Bayesian Methods and Mixture Models · Markov Chains and Monte Carlo Methods

Full text

Kernel quadrature with DPPs

Ayoub Belhadji, Rémi Bardenet, Pierre Chainais

Univ. Lille, CNRS, Centrale Lille, UMR 9189 - CRIStAL, Villeneuve d’Ascq, France

{ayoub.belhadji, remi.bardenet, pierre.chainais}@univ-lille.fr

Abstract

We study quadrature rules for functions from an RKHS, using nodes sampled from a determinantal point process (DPP). DPPs are parametrized by a kernel, and we use a truncated and saturated version of the RKHS kernel. This link between the two kernels, along with DPP machinery, leads to relatively tight bounds on the quadrature error, that depends on the spectrum of the RKHS kernel. Finally, we experimentally compare DPPs to existing kernel-based quadratures such as herding, Bayesian quadrature, or leverage score sampling. Numerical results confirm the interest of DPPs, and even suggest faster rates than our bounds in particular cases.

1 Introduction

Numerical integration [15] is an important tool for Bayesian methods [54] and model-based machine learning [45]. Formally, numerical integration consists in approximating

[TABLE]

where $\mathcal{X}$ is a topological space, $\mathrm{d}\omega$ is a Borel probability measure on $\mathcal{X}$ , $g$ is a square integrable function, and $f$ is a function belonging to a space to be precised. In the quadrature formula (1), the $N$ points $x_{1},\dots,x_{N}\in\mathcal{X}$ are called the quadrature nodes, and $w_{1},\dots,w_{N}$ the corresponding weights.

The accuracy of a quadrature rule is assessed by the quadrature error, i.e., the absolute difference between the left-hand side and the right-hand side of (1). Classical Monte Carlo algorithms, like importance sampling or Markov chain Monte Carlo [55], pick up the nodes as either independent samples or a sample from a Markov chain on $\mathcal{X}$ , and all achieve a root mean square quadrature error in $\mathcal{O}(1/\sqrt{N})$ . Quasi-Monte Carlo quadrature [16] is based on deterministic, low-discrepancy sequences of nodes, and typical error rates for $\mathcal{X}=\mathbb{R}^{d}$ are $\mathcal{O}(\log^{d}N/N)$ . Recently, kernels have been used to derive quadrature rules such as herding [2, 10], Bayesian quadrature [26, 49], sophisticated control variates [37, 47], and leverage-score quadrature [1] under the assumption that $f$ lies in a RKHS. The main theoretical advantage is that the resulting error rates are faster than classical Monte Carlo and adapt to the smoothness of $f$ .

In this paper, we propose a new quadrature rule for functions in a given RKHS. Our nearest scientific neighbour is [1], but instead of sampling nodes independently, we leverage dependence and use a repulsive distribution called a projection determinantal point process (DPP), while the weights are obtained through a simple quadratic optimization problem. DPPs were originally introduced by [38] as probabilistic models for beams of fermions in quantum optics. Since then, DPPs have been thoroughly studied in random matrix theory [28], and have more recently been adopted in machine learning [34] and Monte Carlo methods [3].

In practice, a projection DPP is defined through a reference measure $\mathrm{d}\omega$ and a repulsion kernel $\operatorname*{\mathfrak{K}}$ . In our approach, the repulsion kernel is a modification of the underlying RKHS kernel. This ensures that sampling is tractable, and, as we shall see, that the expected value of the quadrature error is controlled by the decay of the eigenvalues of the integration operator associated to the measure $\mathrm{d}\omega$ . Note that quadratures based on projection DPPs have already been studied in the literature: implicitly in [29, Corollary 2.3] in the simple case where $\mathcal{X}=[0,1]$ and $\mathrm{d}\omega$ is the uniform measure, and in [3] for $[0,1]^{d}$ and more general measures. In the latter case, the quadrature error is asymptotically of order $N^{-1/2-1/2d}$ [3], with $f$ essentially $\mathcal{C}^{1}$ . In the current paper, we leverage the smoothness of the integrand to improve the convergence rate of the quadrature in general spaces $\mathcal{X}$ .

This article is organized as follows. Section 2 reviews kernel-based quadrature. In Section 3, we recall some basic properties of projection DPPs. Section 4 is devoted to the exposition of our main result, along with a sketch of proof. We give precise pointers to the supplementary material for missing details. Finally, in Section 5 we illustrate our result and compare to related work using numerical simulations, for the uniform measure in $d=1$ and $2$ , and the Gaussian measure on $\mathbb{R}$ .

Notation.

Let $\mathcal{X}$ be a topological space equipped with a Borel measure $\mathrm{d}\omega$ and assume that the support of $\mathrm{d}\omega$ is $\mathcal{X}$ . Let $\mathbb{L}_{2}(\mathrm{d}\omega)$ be the Hilbert space of square integrable, real-valued functions defined on $\mathcal{X}$ , with the usual inner product denoted by $\langle\cdot,\cdot\rangle_{\mathrm{d}\omega}$ , and the associated norm by $\|.\|_{\mathrm{d}\omega}$ . Let $k:\mathcal{X}\times\mathcal{X}\rightarrow\mathbb{R}_{+}$ be a symmetric and continuous function such that, for any finite set of points in $\mathcal{X}$ , the matrix of pairwise kernel evaluations is positive semi-definite. Denote by $\mathcal{F}$ the associated reproducing kernel Hilbert space (RKHS) of real-valued functions [5]. We assume that $x\mapsto k(x,x)$ is integrable with respect to the measure $\mathrm{d}\omega$ so that $\mathcal{F}\subset\mathbb{L}_{2}(\mathrm{d}\omega)$ . Define the integral operator

[TABLE]

By construction, $\bm{\Sigma}$ is self-adjoint, positive semi-definite, and trace-class [56]. For $m\in\mathbb{N}$ , denote by $e_{m}$ the $m$ -th eigenfunction of $\bm{\Sigma}$ , normalized so that $\|e_{m}\|_{\mathrm{d}\omega}=1$ and $\sigma_{m}$ the corresponding eigenvalue. The integrability of the diagonal $x\mapsto k(x,x)$ implies that $\mathcal{F}$ is compactly embedded in $\mathbb{L}_{2}(\mathrm{d}\omega)$ , that is, the identity map $I_{\mathcal{F}}:\mathcal{F}\longrightarrow\mathbb{L}_{2}(\mathrm{d}\omega)$ is compact; moreover, since $\mathrm{d}\omega$ is of full support in $\mathcal{X}$ , $I_{\mathcal{F}}$ is injective [61]. This implies a Mercer-type decomposition of $k$ ,

[TABLE]

where $\mathbb{N}^{*}=\mathbb{N}\smallsetminus\{0\}$ and the convergence is point-wise [62]. Moreover, for $m\in\mathbb{N}^{*}$ , we write $e_{m}^{\mathcal{F}}=\sqrt{\sigma_{m}}e_{m}$ . Since $I_{\mathcal{F}}$ is injective [62], $(e_{m}^{\mathcal{F}})_{m\in\mathbb{N}^{*}}$ is an orthonormal basis of $\mathcal{F}$ . Unless explicitly stated, we assume that $\mathcal{F}$ is dense in $\mathbb{L}_{2}(\mathrm{d}\omega)$ , so that $(e_{m})_{m\in\mathbb{N}^{*}}$ is an orthonormal basis of $\mathbb{L}_{2}(\mathrm{d}\omega)$ . For more intuition, under these assumptions, $f\in\mathcal{F}$ if and only if $\sum_{m}\sigma_{m}^{-1}\langle f,e_{m}\rangle_{\mathbb{L}_{2}(\mathrm{d}\omega)}^{2}$ converges.

2 Related work on kernel-based quadrature

When the integrand $f$ belongs to the RKHS $\mathcal{F}$ of kernel $k$ [12], the quadrature error reads [57]

[TABLE]

where $\mu_{g}=\int_{\mathcal{X}}g(x)k(x,.)\mathrm{d}\omega(x)$ is the so-called mean element [17, 44]. A tight approximation of the mean element by a linear combination of functions $k(x_{j},.)$ thus guarantees low quadrature error. The approaches described in this section differ by their choice of nodes and weights.

2.1 Bayesian quadrature and the design of nodes

Bayesian Quadrature initially [35] considered a fixed set of nodes and put a Gaussian process prior on the integrand $f$ . Then, the weights were chosen to minimize the posterior variance of the integral of $f$ . If the kernel of the Gaussian process is chosen to be $k$ , this amounts to minimizing the RHS of (2). The case of the Gaussian reference measure was later investigated in detail [49], while parametric integrands were considered in [43]. Rates of convergence were provided in [9] for specific kernels on compact spaces, under a fill-in condition [65] that encapsulates that the nodes must progressively fill up the (compact) space.

Finding the weights that optimize the RHS of (2) for a fixed set of nodes is a relatively simple task, see later Section 4.1, the cost of which can even be reduced using symmetries of the set of nodes [27, 32]. Jointly optimizing on the nodes and weights, however, is only possible in specific cases [7, 30]. In general, this corresponds to a non-convex problem with many local minima [23, 48]. While [52] proposed to sample i.i.d. nodes from the reference measure $\mathrm{d}\omega$ , greedy minimization approaches have also been proposed [26, 48]. In particular, kernel herding [10] corresponds to uniform weights and greedily minimizing the RHS in (2). This leads to a fast rate in $\mathcal{O}(1/N)$ , but only when the integrand is in a finite-dimensional RKHS. Kernel herding and similar forms of sequential Bayesian quadrature are actually linked to the Frank-Wolfe algorithm [2, 8, 26]. Beside the difficulty of proving fast convergence rates, these greedy approaches still require heuristics in practice.

2.2 Leverage-score quadrature

In [1], the author proposed to sample the nodes $(x_{j})$ i.i.d. from some proposal distribution $q$ , and then pick weights $\hat{\bm{w}}$ in (1) that solve the optimization problem

[TABLE]

for some regularization parameter $\lambda>0$ . Proposition 1 gives a bound on the resulting approximation error of the mean element for a specific choice of proposal pdf, namely the leverage scores

[TABLE]

Proposition 1 (Proposition 2 in [1]).

Let $\delta\in[0,1]$ , and $d_{\lambda}=\operatorname{Tr}\bm{\Sigma}(\bm{\Sigma}+\lambda\bm{I})^{-1}$ . Assume that $N\geq 5d_{\lambda}\log(16d_{\lambda}/\delta)$ , then

[TABLE]

In other words, Proposition 1 gives a uniform control on the approximation error $\mu_{g}$ by the subspace spanned by the $k(x_{j},.)$ for $g$ belonging to the unit ball of $\mathbb{L}_{2}(\mathrm{d}\omega)$ , where the $(x_{j})$ are sampled i.i.d. from $q_{\lambda}^{*}$ . The required number of nodes is equal to $\mathcal{O}(d_{\lambda}\log d_{\lambda})$ for a given approximation error $\lambda$ . However, for fixed $\lambda$ , the approximation error in Proposition 1 does not go to zero when $N$ increases. One theoretical workaround is to make $\lambda=\lambda(N)$ decrease with $N$ . However, the coupling of $N$ and $\lambda$ through $d_{\lambda}$ makes it very intricate to derive a convergence rate from Proposition 1. Moreover, the optimal density $q_{\lambda}^{*}$ is in general only available as the limit (6), which makes sampling and evaluation difficult. Finally, we note that Proposition 1 highlights the fundamental role played by the spectral decomposition of the operator $\bm{\Sigma}$ in designing and analyzing kernel quadrature rules.

3 Projection determinantal point processes

Let $N\in\mathbb{N}^{*}$ and $(\psi_{n})_{n\in[N]}$ an orthonormal family of $\mathbb{L}_{2}(\mathrm{d}\omega)$ , and assume for simplicity that $\mathcal{X}\subset\mathbb{R}^{d}$ and that $\rm{d}\omega$ has density $\omega$ with respect to the Lebesgue measure. Define the repulsion kernel

[TABLE]

not to be mistaken for the RKHS kernel $k$ . One can show [25, Lemma 21] that

[TABLE]

is a probability density over $\mathcal{X}^{N}$ . When $x_{1},\dots,x_{N}$ have distribution (9), the set $\bm{x}=\{x_{1},\dots x_{N}\}$ is said to be a projection DPP111In the finite case, more common in ML, projection DPPs are also called elementary DPPs [34]. with reference measure $\mathrm{d}\omega$ and kernel $\operatorname*{\mathfrak{K}}$ . Note that the kernel $\operatorname*{\mathfrak{K}}$ is a positive definite kernel so that the determinant in (9) is non-negative. Equation (9) is key to understanding DPPs. First, loosely speaking, the probability of seeing a point of $\bm{x}$ in an infinitesimal volume around $x_{1}$ is $\operatorname*{\mathfrak{K}}(x_{1},x_{1})\omega(x_{1})\mathrm{d}x_{1}$ . Note that when $d=1$ and $(\psi_{n})$ are the family of orthonormal polynomials with respect to $\mathrm{d}\omega$ , this marginal probability is related to the optimal proposal $q_{\lambda}$ in Section 2.2; see Appendix E.2. Second, the probability of simultaneously seeing a point of $\bm{x}$ in an infinitesimal volume around $x_{1}$ and one around $x_{2}$ is

[TABLE]

The probability of co-occurrence is thus always smaller than that of a Poisson process with the same intensity. In this sense, a projection DPP with symmetric kernel is a repulsive distribution, and $\operatorname*{\mathfrak{K}}$ encodes its repulsiveness.

One advantage of DPPs is that they can be sampled exactly. Because of the orthonormality of $(\psi_{n})$ , one can write the chain rule for (9); see [25]. Sampling each conditional in turn, using e.g. rejection sampling [55], then yields an exact sampling algorithm. Rejection sampling aside, the cost of this algorithm is cubic in $N$ without further assumptions on the kernel. Simplifying assumptions can take many forms. In particular, when $d=1$ , and $\omega$ is a Gaussian, gamma [19], or beta [33] pdf, and $(\psi_{n})$ are the orthonormal polynomials with respect to $\omega$ , the corresponding DPP can be sampled by tridiagonalizing a matrix with independent entries, which takes the cost to $\mathcal{O}(N^{2})$ and bypasses the need for rejection sampling. For further information on DPPs see [28, 59].

4 Kernel quadrature with projection DPPs

We follow in the footsteps of [1], see Section 2.2, but using a projection DPP rather than independent sampling to obtain the nodes. In a nutshell, we consider nodes $(x_{j})_{j\in[N]}$ that are drawn from the projection DPP with reference measure $\mathrm{d}\omega$ and repulsion kernel

[TABLE]

where we recall that $(e_{n})$ are the normalized eigenfunctions of the integral operator $\bm{\Sigma}$ . The weights $\bm{w}$ are obtained by solving the optimization problem

[TABLE]

where

[TABLE]

is the reconstruction operator222The reconstruction operator $\bm{\Phi}$ depends on the nodes $x_{j}$ , although our notation doesn’t reflect it for simplicity.. In Section 4.1 we prove that (11) almost surely has a unique solution $\hat{\bm{w}}$ and state our main result, an upper bound on the expected approximation error $\|\mu_{g}-\bm{\Phi}\hat{\bm{w}}\|^{2}_{\mathcal{F}}$ under the proposed Projection DPP. Section 4.2 gives a sketch of the proof of this bound.

4.1 Main result

Assuming that nodes $(x_{j})_{j\in[N]}$ are known, we first need to solve the optimization problem (11) that relates to problem (5) without regularization ( $\lambda=0$ ). Let $\bm{x}=(x_{1},\dots,x_{N})\in\mathcal{X}^{N}$ , then

[TABLE]

where $\bm{K}(\bm{x})=(k(x_{i},x_{j}))_{i,j\in[N]}$ . The right-hand side of (13) is quadratic in $\bm{w}$ , so that the optimization problem (11) admits a unique solution $\hat{\bm{w}}$ if and only if $\bm{K}(\bm{x})$ is invertible. In this case, the solution is given by $\hat{\bm{w}}=\bm{K}(\bm{x})^{-1}\mu_{g}(x_{j})_{j\in[N]}$ . A sufficient condition for the invertibility of $\bm{K}(\bm{x})$ is given in the following proposition.

Proposition 2.

Assume that the matrix $\bm{E}(\bm{x})=(e_{i}(x_{j}))_{i,j\in[N]}$ is invertible, then $\bm{K}(\bm{x})$ is invertible.

The proof of Proposition 2 is given in Appendix D.1. Since the pdf (9) of the projection DPP with kernel (10) is proportional to $\operatorname{Det}^{2}\bm{E}(\bm{x})$ , the following corollary immediately follows.

Corollary 1.

Let $\bm{x}=\{x_{1},\dots,x_{N}\}$ be a projection DPP with reference measure $\mathrm{d}\omega$ and kernel (10). Then $\bm{K}(\bm{x})$ is a.s. invertible, so that (11) has unique solution $\hat{\bm{w}}=\bm{K}(\bm{x})^{-1}\mu_{g}(x_{j})_{j\in[N]}$ a.s.

We now give our main result that uses nodes $(x_{j})_{j\in[N]}$ drawn from a well-chosen projection DPP.

Theorem 1.

Let $\bm{x}=\{x_{1},\dots,x_{N}\}$ be a projection DPP with reference measure $\mathrm{d}\omega$ and kernel (10). Let $\hat{\bm{w}}$ be the unique solution to (11) and define $\displaystyle\|g\|_{\mathrm{d}\omega,1}=\sum\limits_{n\in[N]}|\langle e_{n},g\rangle_{d\omega}|$ . Assume that $\|g\|_{\mathrm{d}\omega}\leq 1$ and define $r_{N}=\sum\limits_{m\geq N+1}\sigma_{m}$ , then

[TABLE]

In particular, if $Nr_{N}=o(1)$ , then the right-hand side of (14) is $Nr_{N}+o(Nr_{N})$ . For example, take $\mathcal{X}=[0,1]$ , $\mathrm{d}\omega$ the uniform measure on $\mathcal{X}$ , and $\mathcal{F}$ the $s$ -Sobolev space, then $\sigma_{m}=m^{-2s}$ [5]. Now, if $s>1$ , the expected worst case quadrature error is bounded by $Nr_{N}=\mathcal{O}(N^{2-2s})=o(1)$ . Another example is the case of the Gaussian measure on $\mathcal{X}=\mathbb{R}$ , with the Gaussian kernel. In this case $\sigma_{m}=\beta\alpha^{m}$ with $0<\alpha<1$ and $\beta>0$ [53] so that $Nr_{N}=N\frac{\beta}{1-\alpha}\alpha^{N+1}=o(1)$ .

We have assumed that $\mathcal{F}$ is dense in $\mathbb{L}_{2}(\mathrm{d}\omega)$ but Theorem 1 is valid also when $\mathcal{F}$ is finite-dimensional. In this case, denote $N_{0}=\dim\mathcal{F}$ . Then, for $n>N_{0}$ , $\sigma_{n}=0$ and $r_{N_{0}}=0$ , so that (14) implies

[TABLE]

This compares favourably with herding, for instance, which comes with a rate in $\mathcal{O}(\frac{1}{N})$ for the quadrature based on herding with uniform weights [2, 10].

The constant $\|g\|_{\mathrm{d}\omega,1}$ in $\eqref{eq:main_result}$ is the $\ell_{1}$ norm of the coefficients of projection of $g$ onto $\operatorname{\mathrm{Span}}(e_{n})_{n\in[N]}$ in $\mathbb{L}_{2}(\mathrm{d}\omega)$ . For example, for $g=e_{n}$ , $\|g\|_{\mathrm{d}\omega,1}=1$ if $n\in[N]$ and $\|g\|_{\mathrm{d}\omega,1}=0$ if $n\geq N+1$ . In the worst case, $\|g\|_{\mathrm{d}\omega,1}\leq\sqrt{N}\|g\|_{\mathrm{d}\omega}\leq\sqrt{N}$ . Thus, we can obtain a uniform bound for $\|g\|_{\mathrm{d}\omega}\leq 1$ in the spirit of Proposition 1, but with a supplementary factor $N$ in the upper bound in (14).

4.2 Bounding the approximation error under the DPP

In this section, we give the skeleton of the proof of Theorem 1, referring to the appendices for technical details. The proof is in two steps. First, we give an upper bound for the approximation error $\|\mu_{g}-\bm{\Phi}\hat{\bm{w}}\|^{2}_{\mathcal{F}}$ that involves the maximal principal angle between the functional subspaces of $\mathcal{F}$

[TABLE]

DPPs allow closed form expressions for the expectation of trigonometric functions of such angles; see [4] and Appendix E.1 for the geometric intuition behind the proof. The second step thus consists in developing the expectation of the bound under the DPP.

4.2.1 Bounding the approximation error using principal angles

Let $\bm{x}=(x_{1},\dots,x_{N})\in\mathcal{X}^{N}$ be such that $\operatorname{Det}\bm{E}(\bm{x})\neq 0$ . By Proposition 2, $\bm{K}(\bm{x})$ is non singular and $\dim\mathcal{T}(\bm{x})=N$ . The optimal approximation error writes

[TABLE]

where $\bm{\Pi}_{\mathcal{T}(\bm{x})}=\bm{\Phi}(\bm{\Phi}^{*}\bm{\Phi})^{-1}\bm{\Phi}^{*}$ is the orthogonal projection onto $\mathcal{T}(\bm{x})$ with $\bm{\Phi}^{*}$ the dual333For $\mu\in\mathcal{F}$ , $\bm{\Phi}^{*}\mu=(\mu(x_{j}))_{j\in[N]}$ . $\bm{\Phi}^{*}\bm{\Phi}$ is an operator from $\mathbb{R}^{N}$ to $\mathbb{R}^{N}$ that can be identified with $\bm{K}(\bm{x})$ . of $\bm{\Phi}$ .

In other words, (16) equates the approximation error to $\|\bm{\Pi}_{\mathcal{T}(\bm{x})^{\perp}}\mu_{g}\|^{2}_{\mathcal{F}}$ , where $\bm{\Pi}_{\mathcal{T}(\bm{x})^{\perp}}$ is the orthogonal projection onto $\mathcal{T}(\bm{x})^{\perp}$ . Now we have the following lemma.

Lemma 1.

Assume that $\|g\|_{\mathrm{d}\omega}\leq 1$ then $\|\bm{\Sigma}^{-1/2}\mu_{g}\|_{\mathcal{F}}\leq 1$ and

[TABLE]

Now, to upper bound the right-hand side of (17), we note that $\sigma_{n}\|\bm{\Pi}_{\mathcal{T}(\bm{x})^{\perp}}e_{n}^{\mathcal{F}}\|_{\mathcal{F}}^{2}$ is the product of two terms: $\sigma_{n}$ is a decreasing function of $n$ while $\|\bm{\Pi}_{\mathcal{T}(\bm{x})^{\perp}}e_{n}^{\mathcal{F}}\|_{\mathcal{F}}^{2}$ is the interpolation error of the eigenfunction $e_{n}^{\mathcal{F}}$ , measured in the $\|.\|_{\mathcal{F}}$ norm. We can bound the latter interpolation error uniformly in $n\in[N]$ using the geometric notion of maximal principal angle between $\mathcal{T}(\bm{x})$ and $\mathcal{E}^{\mathcal{F}}_{N}=\operatorname{\mathrm{Span}}(e_{n}^{\mathcal{F}})_{n\in[N]}$ . This maximal principal angle is defined through its cosine

[TABLE]

Similarly, we can define the $N$ principal angles $\theta_{n}(\mathcal{T}(\bm{x}),\mathcal{E}^{\mathcal{F}}_{N})\in\left[0,\frac{\pi}{2}\right]$ for $n\in[N]$ between the subspaces $\mathcal{E}^{\mathcal{F}}_{N}$ and $\mathcal{T}(\bm{x})$ . These angles quantify the relative position of the two subspaces. See Appendix C.3 for more details about principal angles. Now, we have the following lemma.

Lemma 2.

Let $\bm{x}=(x_{1},\dots,x_{N})\in\mathcal{X}^{N}$ such that $\operatorname{Det}\bm{E}(\bm{x})\neq 0$ . Then

[TABLE]

To sum up, we have so far bounded the approximation error by the geometric quantity in the right-hand side of (19). Where projection DPPs shine is in taking expectations of such geometric quantities.

4.2.2 Taking the expectation under the DPP

The analysis in Section 4.2.1 is valid whenever $\operatorname{Det}\bm{E}(\bm{x})\neq 0$ . As seen in Corollary 1, this condition is satisfied almost surely when $\bm{x}$ is drawn from the projection DPP of Theorem 1. Furthermore, the expectation of the right-hand side of (19) can be written in terms of the eigenvalues of the kernel $k$ .

Proposition 3.

Let $\bm{x}$ be a projection DPP with reference measure $\mathrm{d}\omega$ and kernel (10). Then,

[TABLE]

The bound of Proposition 3, once reported in Lemma 2 and Lemma 1, already yields Theorem 1 in the special case where $\sigma_{1}=\dots=\sigma_{N}$ . This seems a very restrictive condition, but next Proposition 4 shows that we can always reduce the analysis to that case. In fact, let the kernel $\tilde{k}$ be defined by

[TABLE]

and let $\tilde{\mathcal{F}}$ be the corresponding RKHS. Then one has the following inequality.

Proposition 4.

Let $\tilde{\mathcal{T}}(\bm{x})=\operatorname{\mathrm{Span}}\left(\tilde{k}(x_{j},.)\right)_{j\in[N]}$ and $\bm{\Pi}_{\tilde{\mathcal{T}}(\bm{x})^{\perp}}$ the orthogonal projection onto $\tilde{\mathcal{T}}(\bm{x})^{\perp}$ in $(\tilde{\mathcal{F}},\langle.,.\rangle_{\tilde{\mathcal{F}}})$ . Then,

[TABLE]

Simply put, capping the first eigenvalues of $k$ yields a new kernel $\tilde{k}$ that captures the interaction between the terms $\sigma_{n}$ and $\|\bm{\Pi}_{\mathcal{T}(\bm{x})^{\perp}}e_{n}^{\mathcal{F}}\|_{\mathcal{F}}^{2}$ such that we only have to deal with the term $\|\bm{\Pi}_{\tilde{\mathcal{T}}(\bm{x})^{\perp}}e_{n}^{\tilde{\mathcal{F}}}\|_{\tilde{\mathcal{F}}}^{2}$ . Combining Proposition 3 with Proposition 4 applied to the kernel $\tilde{k}$ yields Theorem 1.

4.3 Discussion

We have arbitrarily introduced a product in the right-hand side of (19), which is a rather loose majorization. Our motivation is that the expected value of this symmetric quantity is tractable under the DPP. Getting rid of the product could make the bound much tighter. Intuitively, taking the upper bound in (20) to the power $1/N$ results in a term in $\mathcal{O}(r_{N})$ for the RKHS $\tilde{\mathcal{F}}$ . Improving the bound in (20) would require a de-symmetrization by comparing the maximum of the $1/\cos^{2}\theta_{\ell}(\mathcal{T}(\bm{x}),\mathcal{E}^{\mathcal{F}}_{N})$ to their geometric mean. An easier route than de-symmetrization could be to replace the product in (19) by a sum, but this is beyond the scope of this article.

In comparison with [1], we emphasize that the dependence of our bound on the eigenvalues of the kernel $k$ , via $r_{N}$ , is explicit. This is in contrast with Proposition 1 that depends on the eigenvalues of $\bm{\Sigma}$ through the degree of freedom $d_{\lambda}$ so that the necessary number of samples $N$ diverges when $\lambda\rightarrow 0$ . On the contrary, our quadrature requires a finite number of points for $\lambda=0$ . It would be interesting to extend the analysis of our quadrature in the regime $\lambda>0$ .

5 Numerical simulations

5.1 The periodic Sobolev space and the Korobov space

Let $\mathrm{d}\omega$ be the uniform measure on $\mathcal{X}=[0,1]$ , and let the RKHS kernel be [5]

[TABLE]

so that $\mathcal{F}=\mathcal{F}_{s}$ is the Sobolev space of order $s$ on $[0,1]$ . Note that $k_{s}$ can be expressed in closed form using Bernoulli polynomials [64]. We take $g\equiv 1$ in (1), so that the mean element $\mu_{g}\equiv 1$ . We compare the following algorithms: $(i)$ the quadrature rule DPPKQ we propose in Theorem 1, $(ii)$ the quadrature rule DPPUQ based on the same projection DPP but with uniform weights, implicitly studied in [29], $(iii)$ the kernel quadrature rule (5) of [1], which we denote LVSQ for leverage score quadrature, with regularization parameter $\lambda\in\{0,0.1,0.2\}$ (note that the optimal proposal is $q_{\lambda}^{*}\equiv 1$ ), $(iv)$ herding with uniform weights [2, 10], $(v)$ sequential Bayesian quadrature (SBQ) [26] with regularization to avoid numerical instability, and $(vi)$ Bayesian quadrature on the uniform grid (UGBQ). We take $N\in[5,50]$ . Figures 1(a) and 1(b) show log-log plots of the worst case quadrature error w.r.t. $N$ , averaged over 50 samples for each point, for $s\in\{1,3\}$ .

We observe that the approximation errors of all first four quadratures converge to [math] with different rates. Both UGBQ and DPPKQ converge to [math] with a rate of $\mathcal{O}(N^{-2s})$ , which indicates that our $\mathcal{O}(N^{2-2s})$ bound in Theorem 1 is not tight in the Sobolev case. Meanwhile, the rate of DPPUQ is $\mathcal{O}(N^{-2})$ across the three values of $s$ : it does not adapt to the regularity of the integrands. This corresponds to the CLT proven in [29]. LVSQ without regularization converges to [math] slightly slower than $\mathcal{O}(N^{-2s})$ . Augmenting $\lambda$ further slows down convergence. Herding converges at an empirical rate of $\mathcal{O}(N^{-2})$ , which is faster than the rate $\mathcal{O}(N^{-1})$ predicted by the theoretical analysis in [2, 10]. SBQ is the only one that seems to plateau for $s=3$ , although it consistently has the best performance for low $N$ . Overall, in the Sobolev case, DPPKQ and UGBQ have the best convergence rate. UGBQ – known to be optimal in this case [7] – has a better constant.

Now, for a multidimensional example, consider the “Korobov" kernel $k_{s}$ defined on $[0,1]^{d}$ by

[TABLE]

We still take $g\equiv 1$ in (1) so that $\mu_{g}\equiv 1$ . We compare $(i)$ our DPPKQ, $(ii)$ LVSQ without regularization ( $\lambda=0$ ), $(iii)$ the kernel quadrature based on the uniform grid UGBQ, $(iv)$ the kernel quadrature SGBQ based on the sparse grid from [58], $(v)$ the kernel quadrature based on the Halton sequence HaltonBQ [22]. We take $N\in[5,1000]$ and $s=1$ . The results are shown in Figure 1(c). This time, UGBQ suffers from the dimension with a rate in $\mathcal{O}(N^{-2s/d})$ , while DPPKQ, HaltonBQ and LVSQ $(\lambda=0)$ all perform similarly well. They scale as $\mathcal{O}((\log N)^{2s(d-1)}N^{-2s})$ , which is a tight upper bound on $\sigma_{N+1}$ , see [1] and Appendix B. SGBQ seems to lag slightly behind with a rate $\mathcal{O}((\log N)^{2(s+1)(d-1)}N^{-2s})$ [24, 58].

5.2 The Gaussian kernel

We now consider $\rm{d}\omega$ to be the Gaussian measure on $\mathcal{X}=\mathbb{R}$ along with the RKHS kernel $\displaystyle k_{\gamma}(x,y)=\exp[-(x-y)^{2}/2\gamma^{2}]$ , and again $g\equiv 1$ . Figure 1(d) compares the empirical performance of DPPKQ to the theoretical bound of Theorem 1, herding, crude Monte Carlo with i.i.d. sampling from $\rm{d}\omega$ , and sequential Bayesian Quadrature, where we again average over $50$ samples. We take $N\in[5,50]$ and $\gamma=\frac{1}{2}$ . Note that, this time, only the $y$ -axis is on the log scale for better display, and that LVSQ is not plotted since we don’t know how to sample from $q_{\lambda}$ in (6) in this case. We observe that the approximation error of DPPKQ converges to [math] as $\mathcal{O}(\alpha^{N})$ , while the discussion below Theorem 1 let us expect a slightly slower $\mathcal{O}(N\alpha^{N})$ . Herding improves slightly upon Monte Carlo that converges as $\mathcal{O}(N^{-1})$ . Similarly to Sobolev spaces, the convergence of sequential Bayesian quadrature plateaus even if it has the smallest error for small $N$ . We also conclude that DPPKQ is a close runner-up to SBQ and definitely takes the lead for large enough $N$ .

6 Conclusion

In this article, we proposed a quadrature rule for functions living in a RKHS. The nodes are drawn from a DPP tailored to the RKHS kernel, while the weights are the solution to a tractable, non-regularized optimization problem. We proved that the expected value of the squared worst case error is bounded by a quantity that depends on the eigenvalues of the integral operator associated to the RKHS kernel, thus preserving the natural feel and the generality of the bounds for kernel quadrature [1]. Key intermediate quantities further have clear geometric interpretations in the ambient RKHS. Experimental comparisons suggest that DPP quadrature favourably compares with existing kernel-based quadratures. In specific cases where an optimal quadrature is known, such as the uniform grid for 1D periodic Sobolev spaces, DPPKQ seems to have the optimal convergence rate. However, our generic error bound does not reflect this optimality in the Sobolev case, and must thus be sharpened.

We have discussed room for improvement in our proofs. Further work should also address exact sampling algorithms, which do not exist yet when the spectral decomposition of the integral operator is not known. Approximate algorithms would also suffice, as long as the error bound is preserved.

Acknowledgments

We acknowledge support from ANR grant BoB (ANR-16-CE23-0003) and région Hauts-de-France. We also thank Adrien Hardy and the reviewers for their detailed and insightful comments.

Appendix A Implementation details

In this section, we give details on the repulsion kernels in each example of the main paper, and explain how we sampled from the corresponding DPPs. In short, we relied on matrix models for univariate cases, and vanilla DPP sampling [25] for multivariate settings.

A.1 The one-dimensional periodic Sobolev space

Consider the kernel $k_{s}:[0,1]\times[0,1]\rightarrow\mathbb{R}_{+}$ defined by

[TABLE]

The Mercer decomposition of $k_{s}$ associated to the uniform measure $\mathrm{d}\omega$ on $[0,1]$ writes

[TABLE]

The corresponding repulsion kernel is

[TABLE]

if $N$ is even and

[TABLE]

if not. The projection DPP with kernel $\operatorname*{\mathfrak{K}}$ and reference measure $\mathrm{d}\omega$ can be sampled through a matrix model. Indeed this DPP is also the distribution of the arguments (normalized by $2\pi$ ) of the eigenvalues of a random unitary matrix drawn from the Haar measure on $\mathbb{U}_{N+1}$ [66]. Sampling such matrices can be done, e.g., using the QR decomposition of a matrix with i.i.d. unit complex Gaussians as coefficients [41].

A.2 The one-dimensional Gaussian kernel

Let $k_{\gamma}:\mathbb{R}\times\mathbb{R}\rightarrow\mathbb{R}_{+}$ and the reference measure $\mathrm{d}\omega$ be defined by

[TABLE]

For notational convenience, we further let

[TABLE]

and

[TABLE]

Now, the Mercer decomposition of $k_{\gamma}$ reads [51]

[TABLE]

where

[TABLE]

and $H_{m}$ is the $m$ -th Hermite polynomial (i.e., orthonormal polynomials for the pdf of a unit Gaussian). Now, denote

[TABLE]

and the measure

[TABLE]

The rescaled polynomials $(\tilde{e}_{m})_{m\in\mathbb{N}}$ are orthonormal with respect to the measure $\mathrm{d}\tilde{\omega}$ . Moreover, for $x\in\mathbb{R}$ ,

[TABLE]

Thus, for $\bm{x}=(x_{i})_{i\in[N]}\in\mathbb{R}^{N}$ , we have

[TABLE]

In other words, the projection DPP associated to the orthonormal family $(e_{n})_{n\in[N]}$ and the reference measure $\mathrm{d}\omega$ is equivalent to the projection DPP associated to the orthonormal family $(\tilde{e}_{n})_{n\in[N]}$ and the reference measure $\mathrm{d}\tilde{\omega}$ . The latter DPP is known to be the distribution of the eigenvalues of a symmetrized matrix with i.i.d. Gaussian entries [39], which is easily implemented.

A.3 The case of a tensor product of RKHSs

We consider the case where $\mathcal{F}$ writes as a tensor product of RKHSs, with the associated kernel

[TABLE]

with $k_{\ell}:\mathcal{X}_{\ell}\times\mathcal{X}_{\ell}\rightarrow\mathbb{R}$ .

A.3.1 The multivariate integral operator

The integral operator becomes

[TABLE]

In the main paper, we considered for instance the Korobov space $\mathbb{K}^{d}_{s}([0,1])$ , defined as the tensor product of unidimensional periodic Sobolev spaces. Note that an element $f$ of $\mathbb{K}^{d}_{s}([0,1])$ is such that

[TABLE]

This implies that $\mathbb{K}^{d}_{s}([0,1])$ is included in the multidimensional Sobolev space, which corresponds to the same requirement, but only for multi-indices such that $\|u_{i}\|_{1}\leq s$ . Another example, featured in this supplementary material, is the multidimensional Gaussian space associated to the Gaussian kernel on $\mathcal{X}_{\ell}=\mathbb{R}$ and the multidimensional Gaussian measure. In this case, the kernel $k_{\gamma,d}$ can be written as the tensor product of the Gaussian kernels on $\mathbb{R}$ :

[TABLE]

In general, the eigenpairs of the integral operator are the tensor products of the eigenpairs of the integral operators $\Sigma_{\ell}$ corresponding to the spaces $\mathcal{F}_{\ell}$ and measures $\mathrm{d}\omega_{\ell}$ . In other words, for $\bm{u}\in(\mathbb{N}\smallsetminus\{0\})^{d}$ ,

[TABLE]

A.3.2 Fixing an order on multi-indices

The definition of the projection DPP and its kernel $\mathfrak{K}$ now require that we fix an order on multi-indices. We choose an order $\prec$ that keeps eigenvalues decreasing, as in the univariate case where $\sigma_{1}\geq\sigma_{2}\geq\dots$ . Whenever the univariate eigenvalues take the form $\sigma_{i}=\frac{1}{(1+i)^{\eta}}$ with $\eta>0$ , such as in the Korobov case, it holds

[TABLE]

Now, if the eigenvalues takes the form $\displaystyle\sigma_{i}={\eta^{-i}}$ , with $\eta>1$ , as in the Gaussian case,

[TABLE]

In the multivariate Korobov and the Gaussian cases, we thus define in this work $\bm{u}\prec\bm{v}$ as (44) or (47), respectively.

Now, for $N\in\mathbb{N}$ , let $\textbf{u}_{N}=(\textbf{u}_{1,N},\dots,\textbf{u}_{d,N})\in\mathbb{N}^{d}$ be the $N$ -th multi-index according to $\prec$ . The repulsion kernel is defined as

[TABLE]

We sampled from the corresponding DPP using the generic sampling algorithm in [25], using the uniform and Gaussian distributions as proposal in the successive rejection sampling steps for the Korobov and Gaussian cases, respectively.

Appendix B Supplementary simulations

In this section, we give more plots of the convergence of the quadrature error. Before that, we experimentally assess whether the upper bounds given in [1] are sharp. The author proved upper bounds for $\sigma_{N+1}$ in cases where the univariate eigenvalues $\sigma_{\ell,N}$ decrease polynomially or geometrically in $N$ . In particular, for the Korobov spaces of dimension $d$ and regularity $s$ , we have

[TABLE]

For the Gaussian RKHS in dimension $d$ , it holds

[TABLE]

where $\beta\in]0,1[$ and $\delta>0$ are constants depending on the scale parameters of the kernel and the measure $\mathrm{d}\omega$ . In our experiments, we compare the errors of various quadratures to the two rates (49) and (50). We mean these rates to be proxies for plotting $\sigma_{\textbf{u}_{N}}$ , where $\textbf{u}_{N}$ refers to the order introduced in Section A.3. Figure 2 shows that in the Korobov case, the rate (49) is indeed close to the corresponding eigenvalue for large values of $N$ . The value of $(\log N)^{2s(d-1)}N^{-2s}$ could be larger than $1$ for $d\geq 4$ and small values of $N$ . As for the Gaussian case, Figure 2 shows that the rate (50) is also close to the corresponding eigenvalue for all values of $N$ .

B.1 The multi Fourier ensemble and Korobov RKHS

We consider the case of Korobov spaces with $d\in\{2,3\}$ and $s\in\{1,2\}$ and compare the quadrature error of the same algorithms as in 5.1. The results are compiled in Figure 3. The numerical simulations confirm the dependencies of the theoretical bounds of the different algorithms to the dimension $d$ and the regularity $s$ . In particular, UGBQ have better performance for high values of $s$ and low values of $N$ while its asymptotic behaviour is still the same $\mathcal{O}(N^{-2s/d})$ . Moreover, the empirical rate of SGBQ is similar to its theoretical rate $\mathcal{O}((\log N)^{2(s+1)(d-1)}N^{-2s})$ [24, 58]. Finally, the rate $\mathcal{O}((\log N)^{2s(d-1)}N^{-2s})$ is confirmed also for the algorithms DPPKQ, LVSQ $(\lambda=0)$ and HaltonBQ.

B.2 The multi Gaussian ensemble

We consider the case of Gaussian spaces with $d\in\{2,3\}$ . The kernel $k_{\gamma,d}$ and the reference measure are the tensor product of respectively the same kernel and the same measure used in Section 5.2. We compare DPPKQ and Bayesian quadrature based on the tensor product of Gauss-Hermite nodes noted GHBQ. Note that a variant of this algorithm was proposed in [31]: the quadrature nodes are the tensor product of the Gauss-Hermite nodes however the weights were calculated differently. The authors proved under an assumption on the stability of the weights (that was verified empirically) that the rate of convergence is $\mathcal{O}(dr^{d}\beta^{\prime d}e^{-\delta^{\prime}dN^{1/d}})$ , where $r$ is a constant that quantify the stability of the weights, and $\beta^{\prime},\delta^{\prime}$ are constants that depend simultaneously on the the stability of the weights and length scales of the kernel and the measure. The results are compiled in Figure 4.

The numerical simulations shows that the empirical rate of DPPKQ is $\mathcal{O}(e^{-\delta dN^{1/d}})$ that is slightly better than its theoretical rate $\mathcal{O}(e^{-\delta d!^{1/d}N^{1/d}})$ . Moreover, we observe that the empirical rate of DPPKQ is better than the empirical rate of HGBQ.

Appendix C Mercer’s theorem, leverage scores, and principal angles

For the sake of completeness, this section gathers some known results, which will be used to prove our own. We will need a general version of Mercer’s theorem, as usual for kernel methods, see Section C.1. On a more technical ground, we will also need formulas for leverage score changes under rank 1 updates, see Section C.2. Finally, Section C.3 covers principal angles between subspaces of a Hilbert space, which bridge the gap between pairs of Hilbert subspaces and determinants, and facilitate taking expectations in Theorem 1.

C.1 Mercer decomposition in non-compact subspaces

In this section we recall Mercer’s theorem and its extensions to non-compact spaces. Let $\mathcal{X}$ be a measurable space and $\mathrm{d}\omega$ a measure on $\mathcal{X}$ . Assume $k$ is a positive definite kernel on $\mathcal{X}$ . Whenever it is well-defined, we consider the operator

[TABLE]

Theorem 2.

Assume that $\mathcal{X}$ is a compact space and $\mathrm{d}\omega$ is a finite Borel measure on $\mathcal{X}$ . Then, there exists an orthonormal basis $(e_{n})_{n\in\mathbb{N}^{*}}$ of $\mathbb{L}_{2}(\mathrm{d}\omega)$ consisting of eigenfunctions of $\bm{\Sigma}$ , and the corresponding eigenvalues are non-negative. The eigenfunctions corresponding to non-vanishing eigenvalues can be taken to be continuous, and the kernel $k$ writes

[TABLE]

where the convergence is absolute and uniform.

Theorem 2 was first proven when $\mathcal{X}=[0,1]$ and $\mathrm{d}\omega$ is the Lebesgue measure in [40]. A modern proof can be found in [36], while the proof in the general case can be found in [13]. Note, however, that the compactness assumption in Theorem 2 excludes kernels such as the Gaussian or the Laplace kernels. Hence, extensions to non-compact spaces are usually required in ML. In [63], the author extended Theorem 2 to $X=\cup_{i\in\mathbb{N}}X_{i}$ , with the $X_{i}$ s compact and $\mathrm{d}\omega(X_{i})<\infty$ . One can also extend Mercer’s theorem under a compact embedding assumption [62]: the RKHS $\mathcal{F}$ associated to $k$ is said to be compactly embedded in $\mathbb{L}_{2}(\mathrm{d}\omega)$ if the application

[TABLE]

is compact. A sufficient condition for this assumption is the integrability of the diagonal (Lemma 2.3, [62]):

[TABLE]

Note that this condition is not necessary (Example 2.9, [62]). Now, under the compact embedding assumption, the pointwise convergence of the Mercer decomposition to the kernel $k$ is equivalent to the injectivity of the embedding $I_{\mathcal{F}}$ (Theorem 3.1, [62]).

C.2 Leverage score changes under rank 1 updates

In this section we prove a lemma inspired from Lemma 5 in [11]. This lemma concerns the changes of leverage scores under rank 1 updates.

We start by recalling the definition of leverage scores, which play an important role in randomized linear algebra [18]. Let $N,M\in\mathbb{N}^{*}$ , $M\geq N$ . Let $\bm{A}\in\mathbb{R}^{N\times M}$ be a matrix of full rank. For $i\in[M]$ , denote $\bm{a}_{i}$ the $i$ -th column of the matrix $\bm{A}$ . Now, the $i$ -th leverage score of the matrix $\bm{A}$ is defined by

[TABLE]

while the cross-leverage score between the $i$ -th column and the $j$ -th column is defined by

[TABLE]

It holds [18]

[TABLE]

and we have the following result.

Lemma 3.

Let $N,M\in\mathbb{N}^{*}$ , $M\geq N$ . Let $\bm{A}\in\mathbb{R}^{N\times M}$ of full rank and $\rho\in\mathbb{R}_{+}^{*}$ and $i\in[M]$ . Let $\bm{W}\in\mathbb{R}^{M\times M}$ a diagonal matrix such that $\bm{W}_{i,i}=\sqrt{1+\rho}$ and $\bm{W}_{j,j}=1$ for $j\neq i$ . Then

[TABLE]

and

[TABLE]

The proof of this lemma is similar to Lemma 5 in [11]. We recall the proof for completeness.

Proof.

(Adapted from [11]) The Sherman-Morrison formula applied to $\bm{A}\bm{W}\bm{W}^{\operatorname{\intercal}}\bm{A}^{\operatorname{\intercal}}$ and the vector $\sqrt{\rho}\bm{a}_{i}$ yields

[TABLE]

By definition of $\tau_{i}(\bm{A}\bm{W})$

[TABLE]

Now let $j\in[M]-\{i\}$ . By definition of $\tau_{j}(\bm{A}\bm{W})$

[TABLE]

∎

C.3 Principal angles between subspaces in Hilbert spaces

We recall in this section the definition of principal angles between subspaces in Hilbert spaces and connect them to the determinant of the Gramian matrix of their orthonormal bases.

Proposition 5.

Let $\mathcal{H}$ be a Hilbert space. Let $\mathcal{P}_{1}$ and $\mathcal{P}_{2}$ be two finite-dimensional subspaces of $\mathcal{H}$ with $N=\dim\mathcal{P}_{1}=\dim\mathcal{P}_{2}$ . Denote $\bm{\Pi}_{\mathcal{P}_{1}}$ and $\bm{\Pi}_{\mathcal{P}_{2}}$ the orthogonal projections of $\mathcal{H}$ onto these two subspaces. There exist two orthonormal bases for $\mathcal{P}_{1}$ and $\mathcal{P}_{2}$ denoted $(\bm{v}_{i}^{1})_{i\in[N]}$ and $(\bm{v}_{i}^{2})_{i\in[N]}$ , and a set of angles $\theta_{i}(\mathcal{P}_{1},\mathcal{P}_{2})\in[0,\frac{\pi}{2}]$ such that

[TABLE]

and for $i\in[1,...,N]$

[TABLE]

and

[TABLE]

and

[TABLE]

In particular

[TABLE]

We refer to [21] for the proof in the finite-dimensional case and [14] for the general case. The following result shows that the principal angles are somewhat independent of the choice of orthonormal bases. It can be found in [6, 42] for the finite dimensional case. We give here the proof for the general case, for the sake of completeness.

Corollary 2.

Let $(\bm{w}^{1}_{i})_{i\in[N]}$ be any orthonormal basis of $\mathcal{P}_{1}$ and $(\bm{w}^{2}_{i})_{i\in[N]}$ be any orthonormal basis of $\mathcal{P}_{2}$ , and let $\bm{W}=(\langle\bm{w}^{1}_{i},\bm{w}^{2}_{j}\rangle_{\mathcal{H}})_{1\leq i,j\leq N}$ and $\bm{G}=\bm{W}\bm{W}^{\operatorname{\intercal}}$ . Then the eigenvalues of $\bm{G}$ are the $\cos^{2}\theta_{i}(\mathcal{P}_{1},\mathcal{P}_{2})$ . In particular, $\operatorname{Det}^{2}\bm{W}=\operatorname{Det}\bm{G}=\prod\limits_{i\in[N]}\cos^{2}\theta_{i}(\mathcal{P}_{1},\mathcal{P}_{2})$ .

Proof.

Let $(\bm{v}^{i}_{i})_{i\in[N]}$ , $i\in\{1,2\}$ , be the bases of Proposition 5. Let $\bm{U}^{1}\in\mathbb{O}_{N}(\mathbb{R})$ be such that

[TABLE]

Similarly, there exists a matrix $\bm{U}^{2}\in\mathbb{O}_{N}(\mathbb{R})$ such that

[TABLE]

This implies that

[TABLE]

where $\displaystyle\bm{V}=(\langle\bm{v}^{1}_{i},\bm{v}^{2}_{j}\rangle_{\mathcal{H}})_{1\leq i,j\leq N}$ . Then

[TABLE]

Thus the eigenvalues of $\bm{G}$ are the eigenvalues of $\bm{V}\bm{V}^{\operatorname{\intercal}}$ . By Proposition 5, the diagonal elements of $\bm{V}$ are

[TABLE]

We finish the proof by showing that the anti-diagonal elements satisfy

[TABLE]

By (65),

[TABLE]

Then

[TABLE]

Thus

[TABLE]

Finally, $\bm{V}$ is a diagonal matrix and the eigenvalues of $\bm{G}$ are the $\cos^{2}\theta_{i}(\mathcal{P}_{1},\mathcal{P}_{2})$ . ∎

Appendix D Proofs of our results

Section D.1 contains the proof of Proposition 2. In the main paper, we use it under the form of Corollary 1 to ensure that $\bm{K}(\bm{x})$ is almost surely invertible when $\bm{x}=\{x_{1},\dots,x_{N}\}$ is a projection DPP with reference measure $\mathrm{d}\omega$ and kernel (10). This allows computing the quadrature weights.

The rest of Section D deals with Theorem 1, our upper bound on the approximation error of DPP-based kernel quadrature. The proof is rather long, but can be decomposed in four steps, which we now introduce for ease of reading.

First, we prove Lemma 1, which separates the search for an upper bound into examining the contribution of the three terms in (17); this is Section D.2. The first two terms of (17) only depend on the function $g$ in (1), and we leave them be. The third term is more geometric, and relates to the approximation error of the space spanned by $(e^{\cal F}_{n})_{n\in[N]}$ by the (random) subspace ${\cal T}(\bm{x})$ .

Second, in Section D.3, we bound this geometric term for a fixed DPP realization $\bm{x}$ . We pay attention to obtain a bound that will later yield a tractable expectation under that DPP. This is done in Proposition 4, which in turn requires two intermediate results, Lemma 4 and Proposition 6.

Third, we take the expectation of the bound in Proposition 4 under the proposed DPP. This is done in Proposition 3, which is proven thanks to Proposition 2, Lemmas 2, 5 & 6. This is Section D.4.

Fourth, Theorem 1 is obtained in Section D.5, using the results of the previous steps, and an argument to reduce the proof to RKHSs with flat initial spectrum.

D.1 Proof of Proposition 2

Proof.

Recall the Mercer decomposition of $k$ :

[TABLE]

where the convergence is point-wise on $\mathcal{X}$ . Define for $M\in\mathbb{N}^{*},\>M\geq N$ the $M$ -th truncated kernel

[TABLE]

By (77)

[TABLE]

Let $\bm{x}=(x_{1},\dots,x_{N})\in\mathcal{X}^{N}$ such that $\operatorname{Det}\bm{E}(\bm{x})\neq 0$ , and define

[TABLE]

By the continuity of the function $\bm{M}\in\mathbb{R}^{N\times N}\mapsto\operatorname{Det}\bm{M}$ and by (79)

[TABLE]

Thus to prove that $\operatorname{Det}\bm{K}(\bm{x})>0$ , it is enough to prove that the $\operatorname{Det}\bm{K}_{M}(\bm{x})$ is larger than a positive real number for $M$ large enough. We write

[TABLE]

with $\bm{F}_{M}(\bm{x})=(e_{i}(x_{j}))_{(i,j)\in[M]\times[N]}$ and $\Sigma_{M}$ is a diagonal matrix containing the first $M$ eigenvalues $(\sigma_{m})$ . The Cauchy-Binet identity yields

[TABLE]

Therefore,

[TABLE]

so that $\bm{K}(\bm{x})$ is a.s. invertible. ∎

D.2 Proof of Lemma 1

Proof.

First, we prove that

[TABLE]

Recall that

[TABLE]

and that we assumed in Section 1 that $\mathcal{F}$ is dense in $\mathbb{L}_{2}(\mathrm{d}\omega)$ , so that $(e_{m})_{m\in\mathbb{N}}$ is an orthonormal basis of $\mathbb{L}_{2}(\mathrm{d}\omega)$ and the eigenvalues $\sigma_{n}$ are strictly positive. Now let $\bm{\Sigma}^{-1/2}:\mathcal{F}\rightarrow\mathbb{L}_{2}(\mathrm{d}\omega)$ and $\bm{\Sigma}^{1/2}:\mathbb{L}_{2}(\mathrm{d}\omega)\rightarrow\mathcal{F}$ be defined by

[TABLE]

Observe that $\bm{\Sigma}^{-1/2}\mu_{g}=\bm{\Sigma}^{-1/2}\bm{\Sigma}g=\bm{\Sigma}^{1/2}g\in\mathcal{F}$ . Now, for $m\in\mathbb{N}^{*}$ ,

[TABLE]

As a consequence,

[TABLE]

Now we turn to proving (17) from the main text. Define first the operators $\bm{\Sigma}_{N},\bm{\Sigma}_{N}^{1/2},\bm{\Sigma}_{N}^{\perp},\bm{\Sigma}_{N}^{\perp 1/2}:\mathbb{L}_{2}(\mathrm{d}\omega)\rightarrow\mathcal{F}$ , $\bm{\Sigma}_{N}^{1/2}:\mathbb{L}_{2}(\mathrm{d}\omega)\rightarrow\mathcal{F}$ and $\bm{\Sigma}_{N}^{\perp}:\mathbb{L}_{2}(\mathrm{d}\omega)\rightarrow\mathcal{F}$ by

[TABLE]

Note that $\bm{\Sigma}^{1/2}=\bm{\Sigma}_{N}^{1/2}+\bm{\Sigma}_{N}^{\perp 1/2}$ and

[TABLE]

Using (94), there exists $\tilde{\mu}_{g}\in\mathcal{F}$ such that $\|\tilde{\mu}_{g}\|_{\mathcal{F}}\leq 1$ and $\mu_{g}=\bm{\Sigma}^{1/2}\tilde{\mu}_{g}$ . Now, the approximation error writes

[TABLE]

The operator $\bm{\Pi}_{\mathcal{T}(\bm{x})^{\perp}}$ is an orthogonal projection and $\|\tilde{\mu}_{g}\|_{\mathcal{F}}\leq 1$ so that by (99)

[TABLE]

Now, recall that the $(e_{n}^{\mathcal{F}})_{n\in[N]}$ is orthonormal. Moreover for $n\in[N]$ , $e_{n}^{\mathcal{F}}$ is an eigenfunction of $\bm{\Sigma}_{N}^{1/2}$ and the corresponding eigenvalue is $\sqrt{\sigma}_{n}$ . Thus

[TABLE]

Then

[TABLE]

Remarking that $\|g\|_{d\omega,1}=\sum\limits_{n\in[N]}|\langle\tilde{\mu}_{g},e_{n}^{\mathcal{F}}\rangle_{\mathcal{F}}|$ concludes the proof of (17) and therefore Lemma 1. ∎

D.3 Proof of Proposition 4

Proposition 4 gives an upper bound to the term $\max\limits_{n\in[N]}\sigma_{n}\|\bm{\Pi}_{\mathcal{T}(\bm{x})^{\perp}}e_{n}^{\mathcal{F}}\|_{\mathcal{F}}^{2}$ that appears in Lemma 1. We first prove a technical result, Lemma 4, and then combine it with Proposition 6 to finish the proof. We conclude with the proof of Proposition 6.

D.3.1 A preliminary lemma

Let $\bm{x}=(x_{1},\dots,x_{N})\in\mathcal{X}^{N}$ . Recall that $\bm{K}(\bm{x})=(k(x_{i},x_{j}))_{1\leq i,j\leq N}$ and denote $\tilde{\bm{K}}(\bm{x})=(\tilde{k}(x_{i},x_{j}))_{1\leq i,j\leq N}$ , see section 4.2.2. In the following, we define

[TABLE]

Lemma 4 below shows that each term of the form $\Delta_{n}^{\cal F}(\bm{x})$ measures the squared norm of the projection of $e_{n}^{\mathcal{F}}$ on ${\cal T}(\bm{x})$ . The same holds for $\Delta_{n}^{\tilde{\cal F}}(\bm{x})$ and the projection of $e_{n}^{\tilde{\mathcal{F}}}$ onto $\tilde{{\cal T}}(\bm{x})$ .

Indeed, $\displaystyle\|\bm{\Pi}_{\mathcal{T}(\bm{x})^{\perp}}e_{n}^{\mathcal{F}}\|^{2}_{\mathcal{F}}=1-\|\bm{\Pi}_{\mathcal{T}(\bm{x})}e_{n}^{\mathcal{F}}\|^{2}_{\mathcal{F}}$ since $\|e_{n}^{\mathcal{F}}\|^{2}_{\mathcal{F}}=1$ . Thus it is sufficient to prove that $\|\bm{\Pi}_{\mathcal{T}(\bm{x})}e_{n}^{\mathcal{F}}\|^{2}_{\mathcal{F}}=\Delta_{n}^{\cal F}(\bm{x})$ . This boils down to showing that $\bm{K}(\bm{x})^{-1}$ is the matrix of the inner product $\langle\cdot,\cdot\rangle_{\cal F}$ restricted to ${\mathcal{T}(\bm{x})}$ .

Lemma 4.

For $n\in\mathbb{N}^{*}$ , let $e_{n}^{\mathcal{F}}(\bm{x}),e_{n}^{\tilde{\mathcal{F}}}(\bm{x})\in\mathbb{R}^{N}$ the vectors of the evaluations of $e_{n}^{\mathcal{F}}$ and $e_{n}^{\tilde{\mathcal{F}}}$ on the elements of $\bm{x}$ respectively. Then

[TABLE]

We give the proof of (108); the proof of (109) follows the same lines.

Proof.

Let us write

[TABLE]

where the $c_{i}$ are the elements of the vector $\bm{c}=\bm{K}(\bm{x})^{-1}e_{n}^{\mathcal{F}}(\bm{x})$ . Then

[TABLE]

Since $(e_{m}^{\cal F})_{m\in\mathbb{N}^{*}}$ is orthonormal,

[TABLE]

Using Mercer’s theorem, see (79),

[TABLE]

Combining (112) and (113) along with the definition of the vector $\bm{c}=\bm{K}(\bm{x})^{-1}e_{n}^{\mathcal{F}}(\bm{x})$ yields

[TABLE]

∎

D.3.2 End of the proof of Proposition 4

Proof.

By Lemma 4, the inequality (22) in Proposition 4 is equivalent to

[TABLE]

As an intermediate remark, note that in the special case $n=1$ , by construction

[TABLE]

where $\prec$ is the Loewner order, the partial order defined by the convex cone of positive semi-definite matrices. Thus

[TABLE]

Noting that $\tilde{\sigma}_{1}=\sigma_{1}$ and that

[TABLE]

yields (115) for $n=1$ :

[TABLE]

For $n\neq 1$ , the proof is much more subtle. Indeed, a naive application of the inequality (117) would lead to the following inequality

[TABLE]

Since $\forall n\in\mathbb{N}$ , $e_{n}^{\tilde{\mathcal{F}}}=\sqrt{\sigma_{1}/\sigma_{n}}e_{n}^{\mathcal{F}}$ , we get

[TABLE]

and hence the unsatisfactory inequality

[TABLE]

We can prove a better inequality by applying a sequence of rank-one updates to the kernel $k$ to build $N$ intermediate kernels $k^{(\ell)}$ that lead to $N$ inequalities sharp enough to prove (115) for $n\neq 1$ . Then inequality (115) will result as a corollary of Proposition 6 below. To this aim, we define $N$ RKHS $\tilde{\mathcal{F}}_{\ell}$ , $1\leq\ell\leq N$ , that interpolate between $\mathcal{F}$ and $\tilde{\mathcal{F}}$ . For $\ell\in[N]$ , define the kernel $\tilde{k}^{(\ell)}$ by

[TABLE]

and let $\tilde{\mathcal{F}}_{\ell}$ the RKHS corresponding to the kernel $\tilde{k}^{(\ell)}$ . For $\bm{x}\in\mathcal{X}^{N}$ , define $\tilde{\bm{K}}^{(\ell)}(\bm{x})=(\tilde{k}^{(\ell)}(x_{i},x_{j}))_{1\leq i,j\leq N}$ . Similar to previous notations, we define as well

[TABLE]

Now we have the following useful proposition.

Proposition 6.

For $n\in[N]\smallsetminus\{1\}$ , we have

[TABLE]

and

[TABLE]

For ease of reading, we first show that inequality (115) and therefore Proposition 4 is easily deduced from this Proposition 6 and then give its proof.

Let $n\in[N]$ such that $n\neq 1$ . We first remark that $\mathcal{F}=\tilde{\mathcal{F}}_{1}$ and use $(n-2)$ times inequality (126) of Proposition 6:

[TABLE]

Then we use (125) that is connected to the rank-one update from the kernel $k^{(n-1)}$ to $k^{(n)}$ so that

[TABLE]

Then we apply (126) to the r.h.s. again $N-n-1$ times to finally get:

[TABLE]

since $\tilde{k}^{(N)}=\tilde{k}$ and $\tilde{\mathcal{F}}_{N}=\tilde{\mathcal{F}}$ . This concludes the proof of the desired inequality (115) and therefore of Proposition 4. ∎

D.3.3 Proof of Proposition 6

Proof.

(Proposition 6) Let $n\in[N]\smallsetminus\{1\}$ , and $M\in\mathbb{N}$ such that $M\geq N$ . Let $\bm{A}_{\ell}\in\mathbb{R}^{N\times M}$ defined by

[TABLE]

For $\ell\in[N]$ define

[TABLE]

Let $\bm{W}_{\ell}\in\mathbb{R}^{M\times M}$ the diagonal matrix defined by

[TABLE]

Then one has the simple relation

[TABLE]

which prepares the use of Lemma 3 in Section C.2. By definition of the $n$ -th leverage score of the matrix $\bm{A}$ , see (54) in Section C.2,

[TABLE]

Define similarly $\Delta_{n,M}^{\tilde{\cal F}_{\ell}}(\bm{x})=e_{n}^{\tilde{\mathcal{F}}_{\ell}}(\bm{x})^{\operatorname{\intercal}}\tilde{\bm{K}}_{M}^{(\ell)}(\bm{x})^{-1}e_{n}^{\tilde{\mathcal{F}}_{\ell}}(\bm{x})$ . Thanks to (57) of Lemma 3 and (133) and for $\ell=n$

[TABLE]

where $\displaystyle\rho_{n}=\frac{\sigma_{1}}{\sigma_{n}}-1$ . Thus

[TABLE]

Then

[TABLE]

since $\rho_{n}\geq 0$ and $\tau_{n}\Big{(}\bm{A}_{n-1}\Big{)}\in[0,1]$ thanks to (56). This proves that for $M\in\mathbb{N}^{*}$ such that $M\geq N$ ,

[TABLE]

Now,

[TABLE]

Moreover the application $\bm{X}\mapsto\bm{X}^{-1}$ is continuous in $GL_{N}(\mathbb{R})$ . This proves the inequality (125) of Proposition 6. To prove the inequality (126), we start by using (58):

[TABLE]

which implies that

[TABLE]

Then for $M\geq N$ ,

[TABLE]

As above, we conclude the proof by considering the limit $M\to\infty$

[TABLE]

This proves inequality (126) and concludes the proof of Proposition 6. ∎

D.4 Proof of Proposition 3

In this section, $\bm{x}=(x_{1},\dots,x_{N})\in\mathcal{X}^{N}$ is the realization of the DPP of Theorem 1. Let $\bm{E}^{\mathcal{F}}(\bm{x})=(e_{i}^{\mathcal{F}}(x_{j}))_{1\leq i,j\leq N}$ and $\bm{E}(\bm{x})=(e_{i}(x_{j}))_{1\leq i,j\leq N}$ , and $\bm{K}(\bm{x})=(k(x_{i},x_{j}))_{1\leq i,j\leq N}$ . Moreover, let $\mathcal{E}^{\mathcal{F}}_{N}=\operatorname{\mathrm{Span}}(e_{m}^{\mathcal{F}})_{m\in[N]}$ and $\mathcal{T}(\bm{x})=\operatorname{\mathrm{Span}}\left(k(x_{i},.)\right)_{i\in[N]}$ .

We first prove two lemmas that are necessary to prove Proposition 3.

D.4.1 Two preliminary lemmas

Lemma 5.

Let $\bm{x}=(x_{1},\dots,x_{N})\in\mathcal{X}^{N}$ such that $\operatorname{Det}^{2}\bm{E}(\bm{x})\neq 0$ . Then,

[TABLE]

Proof.

The condition $\operatorname{Det}^{2}\bm{E}(\bm{x})\neq 0$ yields by Proposition 2 that $\bm{K}(\bm{x})$ is non singular. Thus $\dim\mathcal{T}(\bm{x})=N$ . Let $(t_{i})_{i\in[N]}$ an orthonormal basis of $\mathcal{T}(\bm{x})$ with respect to $\langle.,.\rangle_{\mathcal{F}}$ . Using Corollary 2, and the fact that $(e_{n}^{\mathcal{F}})_{n\in[N]}$ is an orthonormal basis of $\mathcal{E}^{\mathcal{F}}_{N}$ according to $\langle.,.\rangle_{\mathcal{F}}$ ,

[TABLE]

Now, write for $i\in[N]$ ,

[TABLE]

Thus

[TABLE]

Then

[TABLE]

where

[TABLE]

Thus

[TABLE]

Now, let $\bm{c}_{i}$ the columns of the matrix $\bm{C}(\bm{x})$ . $(t_{i})_{i\in[N]}$ is an orthonormal basis of $\mathcal{T}(\bm{x})$ with respect to $\langle.,.\rangle_{\mathcal{F}}$ , then by (147)

[TABLE]

Therefore

[TABLE]

Thus

[TABLE]

Combining (146), (152) and (155) concludes the proof of Lemma 5:

[TABLE]

∎

Lemma 6.

[TABLE]

Proof.

Let $\bm{x}=(x_{1},\dots,x_{N})\in\mathcal{X}^{N}$ . From (79)

[TABLE]

Moreover,

[TABLE]

Now, for $T\subset[M]$ such that $|T|=N$ , $(e_{t})_{t\in T}$ is an orthonormal family of $\mathbb{L}_{2}(\mathrm{d}\omega)$ , then by [25] Lemma 21:

[TABLE]

Thus

[TABLE]

Now, $\displaystyle\sum\limits_{n\in\mathbb{N}^{*}}\sigma_{n}<\infty$ implies that $\displaystyle\sum\limits_{T\subset\mathbb{N}^{*},|T|=N}\prod\limits_{t\in T}\sigma_{t}<\infty$ . In fact, for $\ell\in[N]$ let $p_{\ell}$ the $\ell$ -th symmetric polynomial. By Maclaurin’s inequality [60], and for any vector $\bm{\nu}\in\mathbb{R}_{+}^{M}$

[TABLE]

Thus

[TABLE]

This inequality is independent of the dimension $M$ thus it can be extended for $\bm{\nu}\in\mathbb{R}_{+}^{\mathbb{N}^{*}}$ with $\displaystyle\sum\limits_{n\in\mathbb{N}^{*}}\nu_{n}<\infty$ . Therefore

[TABLE]

Furthermore,

[TABLE]

Then by monotone convergence theorem, $\displaystyle\bm{x}\mapsto\frac{1}{N!}\operatorname{Det}\bm{K}(\bm{x})$ is mesurable and

[TABLE]

∎

D.4.2 End of the proof of Proposition 3

Proof.

Remember that

[TABLE]

Then by Lemma 5 and the fact that $\operatorname{Det}^{2}\bm{E}^{\mathcal{F}}(\bm{x})=\prod\limits_{n\in[N]}\sigma_{n}\operatorname{Det}^{2}\bm{E}(\bm{x})$

[TABLE]

Then, taking the expectation with respect to $\bm{x}$ resulting from a DPP of kernel $\operatorname*{\mathfrak{K}}(x,y)$ ,

[TABLE]

Now, by Lemma 6

[TABLE]

Therefore,

[TABLE]

∎

D.5 Proof of Theorem 1

Proof.

Thanks to Proposition 4 and Lemma 2 (for $\tilde{\cal F}$ and $\tilde{k}$ )

[TABLE]

Then Proposition 3 applied to $\tilde{\cal F}$ with kernel $\tilde{k}$ yields

[TABLE]

Every subset $T\subset\mathbb{N}^{*}$ such that $|T|=N$ can be written as $T=V\cup W$ with $V\subset[N]$ and $W\subset\mathbb{N}^{*}\smallsetminus[N]$ , and this decomposition is unique. Then

[TABLE]

Therefore

[TABLE]

where for $\ell\in[N]$ , $p_{\ell}$ is the $\ell$ -th symmetric polynomial with the convention that $p_{0}=1$ .

Finally, thanks to (163) above

[TABLE]

As a consequence, by writing $r_{N}=\sum\limits_{m\geq N+1}\sigma_{m}$ ,

[TABLE]

which can be plugged in Lemma 1 to conclude the proof. ∎

Appendix E The intuitions behind the algorithm

The algorithm presented in this article is based on several intuitions. In this section, we summarize these intuitions.

E.1 The geometric intuition

Recall that the quadrature problem in a RKHS boils down to a problem of interpolation of the mean element $\mu_{g}$ by a mixture of $k(x_{i},.)$ , where $g\in\mathbb{L}_{2}(\mathrm{d}\omega)$ such that $\|g\|_{\mathrm{d}\omega}\leq 1$ . A promising algorithm would thus be to select the nodes $\{x_{i},i\in[N]\}$ so as to minimize the projection of $\mu_{g}$ onto $\mathcal{T}(\bm{x})=\operatorname{\mathrm{Span}}(k(x_{i},\cdot);i\in[N])$ . Upper bounding the approximation error $\|\mu_{g}-\bm{\Pi}_{\mathcal{T}(\bm{x})}\mu_{g}\|_{\mathcal{F}}$ is not easy in general. One the one side, we propose to replace $\mu_{g}$ by its projection $\bm{\Pi}_{\mathcal{E}_{N}^{\mathcal{F}}}\mu_{g}$ onto the first eigenfunctions of $\bm{\Sigma}$ . Then it is easy to prove that

[TABLE]

On the other side, if we find a quadrature rule such that $\|\bm{\Pi}_{\mathcal{E}_{N}^{\mathcal{F}}}\mu_{g}-\bm{\Pi}_{\mathcal{T}(\bm{x})}\mu_{g}\|_{\mathcal{F}}$ is small, then we can guarantee an overall approximation error that is not too much worse than the PCA error (179). After introducing an auxiliary RKHS $\tilde{\mathcal{F}}$ with kernel $\tilde{k}$ , we express this second term using the principal angles between the subspaces $\tilde{\mathcal{T}}(\bm{x})$ and $\mathcal{E}_{N}^{\tilde{\mathcal{F}}}$ (see section 4.2.2). This yields a bound on the interpolation error

[TABLE]

The first term in the right hand side of (180) is $2\sigma_{N+1}$ , which corresponds to the approximation error observed in numerical simulations. The second term depends on the largest principal angle $\theta_{N}$ between the subspaces $\tilde{\mathcal{T}}(\bm{x})$ and $\mathcal{E}_{N}^{\tilde{\mathcal{F}}}$ , see Figure 5. This term can in turn be bounded by the symmetrized quantity

[TABLE]

which has a tractable expectation under the projection DPP that we consider in this paper. As an illustration of (180), Figure 6 compares the quality of approximation of a mean element $\mu_{g}$ using kernel interpolation based on two configurations of nodes: the first configuration (top) is well spread and the second configuration (bottom) is not. Observe that the largest principal angle $\theta_{N}$ for the first configuration is around $\pi/4$ , so that $\tan^{2}\theta_{N}\approx 1$ ; while it is around $\pi/2$ for the second configuration so that $\tan^{2}\theta_{N}\gg 1$ . Now observe that the first design of nodes gives the best reconstruction. This observation is consistent with (180).

E.2 The inclusion probability of DPPs and the Christoffel functions

The optimal distribution $q_{\lambda}$ , see section 2.2, can be linked to the so-called Christoffel functions [50]. These functions are rooted in the literature on orthogonal polynomials [46]. To make it simpler, we introduce them in dimension $d=1$ . They are defined by

[TABLE]

Christoffel functions have a more explicit form [46] that can be used for pointwise evaluation

[TABLE]

where $(P_{m})_{m\in\mathbb{N}}$ are the orthonormal polynomials with respect to $\mathrm{d}\omega$ . To establish a connection with $q_{\lambda}$ , the authors of [50] defined regularized Christoffel functions for some kernel $k$ :

[TABLE]

The authors derived an asymptotic equivalent of the function $C_{\lambda,w,k}$ in the regime $\lambda\rightarrow 0$ under some assumptions on the kernel. Furthermore, they proved that $C_{\lambda,w,k}$ is tied to $q_{\lambda}$ by the following relationship (Lemma 5, [50]):

[TABLE]

On the other hand, assume that the $(\psi_{n})$ are the family of orthonormal polynomials with respect to $\mathrm{d}\omega$ . Let $x\in\mathcal{X}$ and $\bm{x}$ a random subset of $\mathcal{X}^{N}$ drawn from the Projection DPP $(\operatorname*{\mathfrak{K}},\mathrm{d}\omega)$ , then

[TABLE]

In other words, the inclusion probability of the corresponding projection DPP is related to the inverse of the Christoffel function as defined in (183). Figure 7 illustrates the evaluations of the inclusion probability of the projection DPP in the case of RKHS defined by the Gaussian kernel along with the Gaussian measure in the real line. Recall that in this case the eigenfunctions are given by

[TABLE]

The theoretical analysis of the "bumps" of the functions $x\mapsto 1/N\operatorname*{\mathfrak{K}}(x,x)\mathrm{d}\omega(x)$ was carried out in [20]. More precisely, the authors studied the approximations of those bumps by Gaussians centred on the Hermite polynomials roots, see Figure 7 (b). We observe a similar behaviour for the multidimensional Gaussian case as illustrated in Figure 8: the inclusion probability of the projection DPP have has local maxima around the tensor products of the Hermite polynomials roots. In other words, the quadratures based on nodes sampled according to a projection DPP are probabilistic relaxations of classical quadratures based on roots of orthogonal polynomials that can be defined even if $N$ is not the square of an integer (the cases $N\in\{17,21\}$ in Figure 8).

Bibliography66

The reference list from the paper itself. Each links out to its DOI / PubMed record.

1Bach [2017] F. Bach. On the equivalence between kernel quadrature rules and random feature expansions. The Journal of Machine Learning Research , 18(1):714–751, 2017.
2Bach et al. [2012] F. Bach, S. Lacoste-Julien, and G. Obozinski. On the equivalence between herding and conditional gradient algorithms. In Proceedings of the 29th International Coference on International Conference on Machine Learning , ICML’12, pages 1355–1362, 2012.
3Bardenet and Hardy [2016] R. Bardenet and A. Hardy. Monte Carlo with determinantal point processes. ar Xiv:1605.00361 , May 2016.
4Belhadji et al. [2018] A. Belhadji, R. Bardenet, and P. Chainais. A determinantal point process for column subset selection. ar Xiv:1812.09771 , 2018.
5Berlinet and Thomas-Agnan [2011] A. Berlinet and Ch. Thomas-Agnan. Reproducing kernel Hilbert spaces in probability and statistics . Springer Science & Business Media, 2011.
6Björck and Golub [1973] A. Björck and G. H. Golub. Numerical methods for computing angles between linear subspaces. Mathematics of computation , 27(123):579–594, 1973.
7Bojanov [1981] B.D. Bojanov. Uniqueness of the optimal nodes of quadrature formulae. Mathematics of computation , 36(154):525–546, 1981.
8Briol et al. [2015] F.X. Briol, C. Oates, M. Girolami, and M.A. Osborne. Frank-Wolfe Bayesian quadrature: Probabilistic integration with theoretical guarantees. In Advances in Neural Information Processing Systems , pages 1162–1170, 2015.

TL;DR

Contribution

Findings

Abstract

Peer Reviews

Code & Models

Videos

Taxonomy

Kernel quadrature with DPPs

Abstract

1 Introduction

Notation.

2 Related work on kernel-based quadrature

2.1 Bayesian quadrature and the design of nodes

2.2 Leverage-score quadrature

Proposition 1** (Proposition 2 in [1]).**

3 Projection determinantal point processes

4 Kernel quadrature with projection DPPs

4.1 Main result

Proposition 2**.**

Corollary 1**.**

Theorem 1**.**

4.2 Bounding the approximation error under the DPP

4.2.1 Bounding the approximation error using principal angles

Lemma 1**.**

Lemma 2**.**

4.2.2 Taking the expectation under the DPP

Proposition 3**.**

Proposition 4**.**

4.3 Discussion

5 Numerical simulations

5.1 The periodic Sobolev space and the Korobov space

5.2 The Gaussian kernel

6 Conclusion

Acknowledgments

Appendix A Implementation details

A.1 The one-dimensional periodic Sobolev space

A.2 The one-dimensional Gaussian kernel

A.3 The case of a tensor product of RKHSs

A.3.1 The multivariate integral operator

A.3.2 Fixing an order on multi-indices

Appendix B Supplementary simulations

B.1 The multi Fourier ensemble and Korobov RKHS

B.2 The multi Gaussian ensemble

Appendix C Mercer’s theorem, leverage scores, and principal angles

C.1 Mercer decomposition in non-compact subspaces

Theorem 2**.**

C.2 Leverage score changes under rank 1 updates

Lemma 3**.**

Proof.

C.3 Principal angles between subspaces in Hilbert spaces

Proposition 5**.**

Corollary 2**.**

Proof.

Appendix D Proofs of our results

D.1 Proof of Proposition 2

Proof.

D.2 Proof of Lemma 1

Proof.

D.3 Proof of Proposition 4

D.3.1 A preliminary lemma

Lemma 4**.**

Proof.

D.3.2 End of the proof of Proposition 4

Proof.

Proposition 6**.**

D.3.3 Proof of Proposition 6

Proof.

D.4 Proof of Proposition 3

D.4.1 Two preliminary lemmas

Lemma 5**.**

Proof.

Lemma 6**.**

Proof.

D.4.2 End of the proof of Proposition 3

Proof.

D.5 Proof of Theorem 1

Proof.

Appendix E The intuitions behind the algorithm

E.1 The geometric intuition

Proposition 1 (Proposition 2 in [1]).

Proposition 2.

Corollary 1.

Theorem 1.

Lemma 1.

Lemma 2.

Proposition 3.

Proposition 4.

Theorem 2.

Lemma 3.

Proposition 5.

Corollary 2.

Lemma 4.

Proposition 6.

Lemma 5.

Lemma 6.