Worst-case optimal approximation with increasingly flat Gaussian kernels
Toni Karvonen, Simo S\"arkk\"a

TL;DR
This paper investigates the optimal approximation of positive linear functionals in Gaussian kernel-induced spaces, revealing convergence to polynomial and Gaussian quadrature methods as the kernel becomes flatter.
Contribution
It introduces a new perspective on approximation with flat Gaussian kernels and generalizes the interpolation problem, including optimal point selection and convergence analysis.
Findings
Convergence to polynomial methods with fixed points.
Extension to optimal point selection leading to Gaussian quadrature.
Explicit characterization of the RKHS via damped polynomials.
Abstract
We study worst-case optimal approximation of positive linear functionals in reproducing kernel Hilbert spaces induced by increasingly flat Gaussian kernels. This provides a new perspective and some generalisations to the problem of interpolation with increasingly flat radial basis functions. When the evaluation points are fixed and unisolvent, we show that the worst-case optimal method converges to a polynomial method. In an additional one-dimensional extension, we allow also the points to be selected optimally and show that in this case convergence is to the unique Gaussian quadrature type method that achieves the maximal polynomial degree of exactness. The proofs are based on an explicit characterisation of the reproducing kernel Hilbert space of the Gaussian kernel in terms of exponentially damped polynomials.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
∎
11institutetext: Department of Electrical Engineering and Automation
Aalto University, Espoo, Finland
11email: [email protected], [email protected]
Worst-case optimal approximation with increasingly flat Gaussian kernels
Toni Karvonen
Simo Särkkä
Abstract
We study worst-case optimal approximation of positive linear functionals in reproducing kernel Hilbert spaces induced by increasingly flat Gaussian kernels. This provides a new perspective and some generalisations to the problem of interpolation with increasingly flat radial basis functions. When the evaluation points are fixed and unisolvent, we show that the worst-case optimal method converges to a polynomial method. In an additional one-dimensional extension, we allow also the points to be selected optimally and show that in this case convergence is to the unique Gaussian quadrature type method that achieves the maximal polynomial degree of exactness. The proofs are based on an explicit characterisation of the reproducing kernel Hilbert space of the Gaussian kernel in terms of exponentially damped polynomials.
Keywords:
Worst-case analysis Reproducing kernel Hilbert spaces Gaussian kernel Gaussian quadrature
1 Introduction
Most popular kernels used scattered data approximation [Fasshauer and McCourt, 2015, Wendland, 2005] and Gaussian process regression [Rasmussen and Williams, 2006] are isotropic (i.e., radial basis functions), depending only on the Euclidean distance between the points:
[TABLE]
for a continuous positive-definite function and a length-scale parameter . Given any function evaluated at distinct points such a kernel can be used to construct a unique kernel interpolant based on the translates . The kernel interpolant is
[TABLE]
where are the Lagrange cardinal functions that solve the linear system
[TABLE]
and satisfy . Uniqueness of the solution for each is guaranteed by positive-definiteness of the matrix on the left-hand side of this system.
When , the kernel becomes increasingly flat and the linear system (3) increasingly ill-conditioned.111Note that most of the literature we cite parametrises the kernel in terms of the inverse length-scale and accordingly considers the case . Nevertheless, the corresponding kernel interpolant is typically well-behaved at this limit. Starting with the work of Driscoll and Fornberg [2002], it has been shown that a certain unisolvency assumption on implies that the kernel interpolant converges to (i) a polynomial interpolant if the kernel is infinitely smooth [Driscoll and Fornberg, 2002, Fornberg et al., 2004, Larsson and Fornberg, 2005, Lee et al., 2007, Schaback, 2005, 2008] or (ii) a polyharmonic spline interpolant if the kernel is finitely smooth [Lee et al., 2014, Song et al., 2012]. Further generalisations appear in [Lee et al., 2015]. The former case covers kernels such as Gaussians, multiquadrics, and inverse multiquadrics while the latter applies to, for example, Matérn kernels and Wendland’s functions. Among the most interesting of these results is the one by Schaback [2005] who proved that the interpolant at the increasingly flat limit of the Gaussian kernel
[TABLE]
exists regardless of the geometry of and coincides with the de Boor and Ron polynomial interpolant [de Boor, 1994, de Boor and Ron, 1992]. Furthermore, numerical ill-conditioning for large , mentioned above, has necessitated the development of techniques for stable evaluation of the kernel interpolant [Cavoretto et al., 2015, Fasshauer and McCourt, 2012, Fornberg et al., 2013, Wright and Fornberg, 2017]. Increasingly flat kernels have been also discussed independently in the literature on the use of Gaussian processes for numerical integration Minka [2000], O’Hagan [1991], Särkkä et al. [2016], albeit accompanied only with non-rigorous arguments. Even though the intuition that the lowest degree terms in the Taylor expansion of the kernel dominate construction of the interpolant as and that this ought to imply convergence to a polynomial interpolant is quite clear, this is not always translated into transparent proofs.
The purpose of this article is to generalise the aforementioned results on flat limits of kernel interpolants for worst-case optimal approximation of general positive linear functionals in the reproducing kernel Hilbert space (RKHS) of the Gaussian kernel (4). That such generalisations are possible is not perhaps surprising; it is rather the simple proof technique made possible by the worst-case framework and an explicit characterisation [Minh, 2010] of the Gaussian RKHS that we find the most interesting aspect of the present work.
1.1 Worst-case optimal approximation
Let be a subset of with a non-empty interior and a positive linear functional acting on continuous real-valued functions defined on and satisfying for every polynomial on . The functionals most often discussed in this article are the point evaluation and the integration functionals
[TABLE]
respectively. Derivative evaluation functionals are also often considered. A cubature rule (quadrature if ) with the distinct points and weights is a weighted approximation to of the form
[TABLE]
When restricted on , the positive-definite kernel in (1) induces a unique reproducing kernel Hilbert space where the reproducing property holds for every and . With minor modifications everything in this section holds also when the kernel is not isotropic. Because the kernel is isotropic, by the assumption that is finite if is a polynomial. This guarantees that for any and consequently that for any .
The worst-case error of the cubature rule (6) in is
[TABLE]
Given a fixed set of distinct points, we are interested in the kernel cubature rule whose weights are chosen so as to minimise the worst-case error:
[TABLE]
These weights are unique and available as the solution to the linear system [Oettershagen, 2017, Section 3.2]
[TABLE]
Although our notation does not make this explicit, the weights obviously depend on the linear functional and the evaluation points . For each , the kernel interpolant now arises as the kernel cubature rule for approximation of the point evaluation functional in (5) and the Lagrange functions are . In this case the worst-case error coincides with the power function [Schaback, 1993]. For an arbitrary , the kernel cubature rule can be obtained by applying to the kernel interpolant:
[TABLE]
That is, the weights are .
1.2 Contributions
Recall that we only consider the Gaussian kernel (4). This article contains two theoretical main contributions:
- •
In Section 2 we prove that if is unisolvent with respect to a full polynomial space and , then converges (as ) to the unique cubature rule that satisfies for every polynomial of degree at most . This result, contained in Theorem 2.4 and Corollary 2.5, is a generalisation for arbitrary positive linear functionals of the interpolation results cited earlier. If is bounded, the results hold for any positive linear functional satisfying the mild assumptions imposed earlier. However, boundedness of is not necessary: at the end of Section 2 we supply an example involving integration over with respect to the Gaussian measure.
- •
In Section 3 we present a generalisation, based on a theorem of Barrow [1978], for optimal kernel quadrature rules [Oettershagen, 2017, Chapter 5] that have both their points and weights selected so as to minimise the worst-case error. The result, Theorem 3.4, states that such rules, if unique, converge to the -point Gaussian quadrature rule for the functional , which is the unique quadrature rule such that for every polynomial of degree at most . This partially settles a conjecture posed by O’Hagan [1991, Section 3.3], and further discussed in [Minka, 2000, Särkkä et al., 2016], on convergence of optimal kernel quadrature rules to Gaussian quadrature rules.
Some generalisations for other kernels and cubature rules of more general form than (6) are briefly discussed in Section 4.
2 Fixed points
The following theorem, which provides a characterisation of the RKHS of the Gaussian kernel (4), is the central tool of this article. This results is due to Steinwart et al. [2006] and Minh [2010]; see also [Steinwart and Christmann, 2008, Section 4.4] and [De Marchi and Schaback, 2009, Example 3]. In this theorem (and the remainder of the article) stands for the collection of -dimensional non-negative multi-indices: . The absolute value and factorial of are and .
Theorem 2.1 (Steinwart 2006; Minh 2010)
Let be a subset of with a non-empty interior. Then the RKHS induced by the Gaussian kernel (4) with length-scale consists of the functions
[TABLE]
where convergence is absolute. Its inner product is . Furthermore, the collection
[TABLE]
of functions forms an orthonormal basis of .
Two crucial implications of this theorem are that consists of functions expressible as series of exponentially damped polynomials, the damping effect vanishing as , and that, due to the terms appearing in the RKHS norm, the high-degree terms contribute the most to the norm. Consequently, the worst-case error (7), taking into account only functions of at most unit norm, is dominated by low-degree terms when is large. The rest of this section formalises this intuition.
Let stand for the space of -variate polynomials of degree at most :
[TABLE]
In this section we assume that the point set is -unisolvent. That is,
[TABLE]
and the zero function is the only element of that vanishes on . This is equivalent to non-singularity of the (generalised) Vandermonde matrix
[TABLE]
where , …, . It follows that there is a unique polynomial cubature rule such that for every . Its weights solve the linear system of equations, where the -vector has the elements . In this section we prove that the worst-case optimal weights for the Gaussian kernel (4) converge to as .
Define then
[TABLE]
so that functions in the Gaussian RKHS, characterised by Theorem 2.1, are of the form for coefficients decaying sufficiently fast. Since the exponential function has no real roots, determinant of the matrix
[TABLE]
satisfies and is hence non-singular. From non-singularity it follows that there are unique weights such that for every satisfying . The weights solve , where the -vector has the elements .222See [Fasshauer and McCourt, 2012] for an interpolation method based on a closely related basis derived from a Mercer eigendecomposition of the Gaussian kernel and [Karvonen and Särkkä, 2019] for an explicit construction of weights similar to in the case is the Gaussian integral. This auxiliary cubature rule plays an important role in our argument. To summarise, the following three weights (or sequences of weights) appear in the proofs below:
The weights , solved from (8), are the worst-case optimal weights for the Gaussian kernel (4). The results concern the behaviour of these weights as . 2. 2.
The weights are constructed such that the cubature rule defined by them is exact for all polynomials up to degree : whenever . 3. 3.
The auxiliary weights satisfy for every and .
Lemma 2.2
Suppose that is -unisolvent and for every . Then there is a constant such that for any .
Proof
The assumption and unisolvency of imply that . Because for any polynomial , both the weights and are finite, which implies the claim. ∎
Lemma 2.3
Suppose that is -unisolvent and for every . If is any sequence of weights such that
[TABLE]
then .
Proof
We have and
[TABLE]
where each of the terms on the right-hand side vanishes as . Because and is non-singular, we conclude that . ∎
We are ready to prove the main result of the article for a fixed -unisolvent point set consisting of distinct points. First, by considering one of the basis functions (10) we show that for every . Second, the sub-optimal cubature rule defined above can be used, in combination with (9), to establish the upper bound . These two bounds imply that for every . If , Lemma 2.3 then implies that .
Theorem 2.4
Let for some and be -unisolvent. Suppose that for every such that and that
[TABLE]
for some and any sequence such that . Then
[TABLE]
where are the weights of the unique polynomial cubature rule such that for every .
Proof
For every select the function
[TABLE]
From Theorem 2.1 it follows that since is one of the basis functions (10). Thus, by definition of the worst-case error,
[TABLE]
Next we derive an appropriate upper bound on by considering the unique sub-optimal cubature rule that is exact for every with . In the expansion (9) of a function in we have for every term with . Consequently, the worst-case error admits the bound
[TABLE]
where are the coefficients that define in Theorem 2.1. A consequence of (9) is that implies for some real numbers such that . Therefore, for ,
[TABLE]
by assumption (14). Moreover, because
[TABLE]
for some and every , we have
[TABLE]
where follows from convergence of the last term and Lemma 2.2. Thus
[TABLE]
when . Since is worst-case optimal, we have thus established with (15) and (16) that, for sufficiently large ,
[TABLE]
for every such that and a constant independent of . That is,
[TABLE]
The claim then follows by setting in Lemma 2.3. ∎
Assumptions of Theorem 2.4 hold, for instance, if the domain is bounded.
Corollary 2.5
Let for some and be -unisolvent. Suppose that is bounded. Then
[TABLE]
where are the weights of the unique polynomial cubature rule such that for every .
Proof
On a bounded domain the convergence as is uniform. Thus
[TABLE]
as for every . Assumption (14) is also satisfied:
[TABLE]
where for and finiteness follows from the assumption . ∎
However, boundedness of is not necessary. Consider Gaussian integration:
[TABLE]
If has an odd element, for every by symmetry. If for some the convergence as follows from the monotone convergence theorem. To verify (14), recall that the absolute moments of the standard Gaussian distribution are
[TABLE]
where is the Gamma function. Because for any and
[TABLE]
if is odd, we have
[TABLE]
Thus
[TABLE]
if .
3 Optimal points in one dimension
Let and for . In this section we consider quadrature rules whose points are also selected so as to minimise the worst-case error. A kernel quadrature rule is optimal if its points and weights satisfy
[TABLE]
In order to eliminate degrees of freedom in ordering the points we require that the points are in ascending order (i.e., ). Even though optimal kernel quadrature rules have been studied since the 1970s [Barrar et al., 1974, Bojanov, 1979, Larkin, 1970, Richter, 1970, Richter-Dyn, 1971] for the integration functional , , their theory is still not complete (the main results have been recently collated by Oettershagen [2017, Section 5.1]). Although uniqueness results are been proved only for totally positive isotropic kernels of the form (1) and integration when [Braess and Dyn, 1982], there exists numerical evidence suggesting that the optimal rule is unique in more general settings [Oettershagen, 2017, p. 97]. Note that the Gaussian kernel (4) we consider is totally positive.
In Theorem 3.4 we show that uniqueness of an optimal kernel quadrature rule for each implies that its increasingly flat limit is , the -point Gaussian quadrature rule for the linear functional . This is the unique quadrature rule that is exact for every polynomial of degree at most : whenever . This degree of exactness is maximal; there are no -point quadrature rules exact for all polynomials up to degree . The most familiar methods of this type are of course the classical Gaussian quadrature rules for numerical integration [Gautschi, 2004, Section 1.4]. For example, the Gauss–Legendre quadrature rule satisfies
[TABLE]
for every polynomial of degree at most and its points are the roots of the th degree Legendre polynomial. Theorem 3.4 was conjectured by O’Hagan [1991, Section 3.3] in 1991 in the form that the optimal kernel quadrature rule has the classical Gauss–Hermite quadrature rule as its increasingly flat limit if the kernel is Gaussian and is the Gaussian integral. More discussion of this conjecture—but no rigorous proofs—can be found in [Minka, 2000, Section 4].
The proof of Theorem 3.4 is based on a general result by Barrow [1978] on existence and uniqueness of generalised Gaussian quadrature rules. This result replaces the polynomials in a Gaussian quadrature rule with generalised polynomials formed out of functions that constitute an extended Chebyshev system [Karlin and Studden, 1966, Chapter 1]. A collection of functions is an extended Chebyshev system if any non-trivial linear combination of the functions has at most zeroes, counting multiplicities. That is, if and for , , and , then . Any basis of the space of polynomials of degree at most is an extended Chebyshev system. Importantly, the functions in (12) are an extended Chebyshev system for any . To verify this, note that any can be written as for some polynomial of degree at most and consequently
[TABLE]
for some polynomials . From this expression we see that for every if and only if for every . Since can have at most zeroes, counting multiplicities, it follows that the same is true of .
Theorem 3.1 (Barrow 1978)
Let be an extended Chebyshev system and a positive linear functional on . Then there exist unique points and positive weights such that
[TABLE]
Lemma 3.2
Let and suppose that a cubature rule with non-negative weights satisfies for some positive function such that for all . Then
[TABLE]
Proof
The claim follows immediately from the inequalities
[TABLE]
∎
Lemma 3.3
Let be a metric space, a constant, and a function. If there is a continuous function such that uniformly as and a unique minimiser for which , then a function such that has .
Proof
The inequality shows that since by assumption and by uniformity of the convergence . Because is continuous, non-negative, and has a unique minimiser , this implies that . ∎
Theorem 3.4
Suppose that for . If for every there exists a unique optimal kernel quadrature rule , then its points and weights converge to those of the -point Gaussian quadrature rule for :
[TABLE]
where and are the unique points and weights such that for every . Moreover, .
Proof
In a manner identical to the proof of Theorem 2.4, we establish the lower bound
[TABLE]
that holds for every . Because are an extended Chebyshev system, Theorem 3.1 guarantees the existence of a unique -point quadrature rule such that for every . The points of this rule are distinct and lie inside and the weights are positive. We can then replicate the rest of the proof of Theorem 2.4 in one dimension but with and Lemma 2.2 replaced with Lemma 3.2 (applied to the function ) to show that, for sufficiently large and a constant independent of ,
[TABLE]
for every . Consequently,
[TABLE]
for every . We then fix and invoke Lemma 3.3 with the function
[TABLE]
domain , and . Because the domain is bounded, for every . Thus
[TABLE]
uniformly on . Since the unique minimiser of is , the claim follows from (18) and Lemma 3.3. ∎
4 Generalisations
This section discusses some straightforward generalisations of the results in Sections 2 and 3.
4.1 Damped power series kernels
Theorem 2.1 for the Gaussian kernel (4) is a consequence of the identity
[TABLE]
where is a power series kernel [Zwicknagl, 2009]. Accordingly, the results in Sections 2 and 3 can be generalised for a class of kernels that we call damped power series kernels. Let be a non-zero function and define . Then a damped power series kernel is
[TABLE]
for and weight parameters such that the series converges for any and . Arguments identical to those used in [Minh, 2010, Zwicknagl, 2009] establish that is a positive-definite kernel and that its RKHS consists of functions
[TABLE]
The Gaussian kernel is recovered by setting , , and . Note that the Gaussian kernel is an exception; damped power series kernels are rarely stationary.
Denote . If we assume that (i) is bounded, (ii) for every , and (iii) a summability condition analogous to (14) holds, then a generalisation for damped power series kernels of Theorem 2.4 is readily obtained. To generalise Theorem 3.4 we also need to assume that constitutes an extended Chebyshev system.
4.2 Taylor space kernels
Let . Taylor space kernels [Dick, 2006, Zwicknagl and Schaback, 2013] are obtained by selecting in (19). As , the corresponding kernel quadrature rules then converge to polynomial rules. Perhaps the two most interesting special cases are the exponential kernel
[TABLE]
and the Szegő kernel
[TABLE]
The Szegő kernel induces a Hardy space on a disk of radius . Interestingly, it has been pointed out already in the 1970s that approximation with the Szegő kernel yields polynomial methods as [Larkin, 1970, Section 3]. See also [Minka, 2000, Section 4]. An extensive numerical investigation has been recently published by Oettershagen [2017, Section 6.2].
4.3 General information functionals
It would also be easy to replace the cubature rule (6) with a generalised version
[TABLE]
where are any bounded linear functionals. If are such that the matrices
[TABLE]
which are generalisations of (11) and (13), are non-singular, then Theorem 2.4 and Corollary 2.5 can be generalised.
4.4 Non-unisolvent point sets
If the kernel is Gaussian but point set is not unisolvent, Schaback [2005] has proved that the kernel interpolant (2) converges the de Boor and Ron polynomial interpolant [de Boor, 1994, de Boor and Ron, 1992], which is the unique interpolant to at in a point-dependent polynomial space having in a certain sense minimal degree. We expect that extensions for non-unisolvent points of the results in Section 2 are possible. The kernel cubature weights would presumably convergence to the weights such that for every .
Acknowledgements.
This work was supported by the Aalto ELEC Doctoral School and the Academy of Finland. We thank the reviewers for numerous comments that helped in improving the presentation.
The reference list from the paper itself. Each links out to its DOI / PubMed record.
- 1Barrar et al. [1974] R. B. Barrar, H. L. Loeb, and H. Werner. On the existence of optimal integration formulas for analytic functions. Numerische Mathematik , 23(2):105–117, 1974.
- 2Barrow [1978] D. L. Barrow. On multiple node Gaussian quadrature formulae. Mathematics of Computation , 32(142):431–439, 1978.
- 3Bojanov [1979] B. D. Bojanov. On the existence of optimal quadrature formulae for smooth functions. Calcolo , 16(1):61–70, 1979.
- 4Braess and Dyn [1982] D. Braess and N. Dyn. On the uniqueness of monosplines and perfect splines of least L 1 subscript 𝐿 1 L_{1} - and L 2 subscript 𝐿 2 L_{2} -norm. Journal d’Analyse Mathématique , 41(1):217–233, 1982.
- 5Cavoretto et al. [2015] R. Cavoretto, G. E. Fasshauer, and M. Mc Court. An introduction to the Hilbert-Schmidt SVD using iterated Brownian bridge kernels. Numerical Algorithms , 68(2):393–422, 2015.
- 6de Boor [1994] C. de Boor. Polynomial interpolation in several variables. In J. Rice and R. A. De Millo, editors, Studies in Computer Science , pages 87–109. 1994.
- 7de Boor and Ron [1992] C. de Boor and A. Ron. The least solution for the polynomial interpolation problem. Mathematische Zeitschrift , 210(1):347–378, 1992.
- 8De Marchi and Schaback [2009] S. De Marchi and R. Schaback. Nonstandard kernels and their applications. Dolomites Research Notes on Approximation , 2(1):16–43, 2009.
