TL;DR
This paper introduces a near-linear complexity method for compressing, inverting, and performing approximate PCA on dense kernel matrices derived from elliptic boundary value problems, using sparse Cholesky factorization.
Contribution
It presents a novel algorithm that efficiently approximates dense kernel matrices with provable guarantees, improving computational complexity over previous methods.
Findings
Achieves near-linear complexity for matrix compression and inversion.
Provides an approximate sparse PCA with optimal convergence rate.
Demonstrates improved performance for elliptic PDE solvers and kernel matrix operations.
Abstract
Dense kernel matrices obtained from point evaluations of a covariance function at locations arise in statistics, machine learning, and numerical analysis. For covariance functions that are Green's functions of elliptic boundary value problems and homogeneously-distributed sampling points, we show how to identify a subset , with , such that the zero fill-in incomplete Cholesky factorisation of the sparse matrix is an -approximation of . This factorisation can provably be obtained in complexity in space and in time, improving upon the state of the art for general elliptic…
| 5.26e-03 | 20000 | 0.71 | 0.81 | 0.42 | 1.25e-03 (3.68e-06) | 1.11e-03 (3.01e-06) | |
| 5.26e-03 | 20000 | 0.71 | 0.81 | 0.42 | 1.25e-03 (3.68e-06) | 1.11e-03 (3.01e-06) | |
| 2.94e-03 | 40000 | 1.21 | 1.19 | 1.00 | 1.27e-03 (3.32e-06) | 1.12e-03 (3.56e-06) | |
| 1.62e-03 | 80000 | 2.72 | 2.82 | 2.55 | 1.30e-03 (3.20e-06) | 1.21e-03 (3.29e-06) | |
| 8.91e-04 | 160000 | 6.86 | 6.03 | 6.11 | 1.28e-03 (3.57e-06) | 1.16e-03 (3.32e-06) | |
| 4.84e-04 | 320000 | 17.22 | 13.79 | 15.66 | 1.23e-03 (3.19e-06) | 1.11e-03 (2.40e-06) | |
| 2.63e-04 | 640000 | 41.40 | 31.02 | 36.02 | 1.24e-03 (2.58e-06) | 1.09e-03 (3.02e-06) | |
| 1.41e-04 | 1280000 | 98.34 | 65.96 | 85.99 | 1.23e-03 (3.72e-06) | 1.10e-03 (3.74e-06) | |
| 7.55e-05 | 2560000 | 233.92 | 148.43 | 197.52 | 1.16e-03 (2.82e-06) | 1.04e-03 (3.36e-06) |
| 1.30e-02 | 20000 | 1.61 | 1.44 | 2.94 | 1.49e-03 (5.00e-06) | 1.20e-03 (5.09e-06) | |
| 7.60e-03 | 40000 | 3.26 | 3.32 | 8.33 | 1.21e-03 (4.29e-06) | 9.91e-04 (3.72e-06) | |
| 4.35e-03 | 80000 | 7.46 | 7.64 | 22.46 | 1.06e-03 (3.74e-06) | 8.51e-04 (2.93e-06) | |
| 2.45e-03 | 160000 | 20.95 | 18.42 | 57.64 | 9.81e-04 (2.33e-06) | 7.88e-04 (3.23e-06) | |
| 1.37e-03 | 320000 | 53.58 | 40.72 | 141.46 | 9.27e-04 (2.26e-06) | 7.53e-04 (2.72e-06) | |
| 7.61e-04 | 640000 | 133.55 | 96.67 | 350.10 | 8.98e-04 (3.25e-06) | 7.25e-04 (3.02e-06) | |
| 4.19e-04 | 1280000 | 312.43 | 212.57 | 820.07 | 8.59e-04 (2.79e-06) | 7.00e-04 (2.87e-06) | |
| 2.29e-04 | 2560000 | 795.68 | 480.17 | 1981.92 | 8.96e-04 (2.76e-06) | 7.73e-04 (4.28e-06) |
| 8.78e-05 | 254666 | 38.06 | 33.72 | 17.54 | 2.04e-02 (1.73e-02) | 2.34e-02 (2.75e-02) | |
| 1.76e-04 | 964858 | 71.07 | 67.85 | 61.35 | 2.32e-03 (6.02e-06) | 2.09e-03 (7.50e-06) | |
| 2.90e-04 | 999810 | 115.07 | 112.56 | 152.93 | 3.92e-04 (1.44e-06) | 3.72e-04 (2.32e-06) | |
| 4.26e-04 | 999999 | 165.91 | 166.60 | 312.19 | 6.70e-05 (2.98e-07) | 5.68e-05 (2.55e-07) | |
| 5.83e-04 | 1000000 | 227.62 | 229.76 | 566.94 | 1.45e-05 (6.69e-08) | 1.08e-05 (5.01e-08) | |
| 7.59e-04 | 1000000 | 292.52 | 300.65 | 944.33 | 4.05e-06 (4.96e-08) | 2.10e-06 (1.69e-08) | |
| 9.53e-04 | 1000000 | 363.90 | 380.07 | 1476.71 | 1.62e-06 (2.30e-08) | 4.08e-07 (9.47e-09) | |
| 1.16e-03 | 1000000 | 447.47 | 467.07 | 2200.32 | 8.98e-07 (1.44e-08) | 1.42e-07 (5.14e-09) |
| 1.87e-04 | 998046 | 87.83 | 56.44 | 85.20 | 1.69e-02 (6.89e-04) | 1.60e-02 (3.36e-04) | |
| 5.17e-04 | 1000000 | 226.84 | 158.42 | 599.86 | 8.81e-04 (3.21e-06) | 7.15e-04 (2.99e-06) | |
| 1.05e-03 | 1000000 | 446.52 | 326.27 | 2434.52 | 1.85e-04 (5.37e-07) | 1.59e-04 (5.30e-07) | |
| 1.82e-03 | 1000000 | 747.65 | 567.06 | 7227.45 | 2.89e-05 (1.94e-07) | 1.84e-05 (1.15e-07) | |
| 2.82e-03 | 1000000 | 1344.59 | 928.27 | 17640.58 | 1.15e-05 (1.06e-07) | 5.34e-06 (5.34e-08) |
| 1000000 | 1000000 | 1000000 | 1000000 | 1000000 | 1000000 | 1000000 | 999893 | |
|---|---|---|---|---|---|---|---|---|
| 7.04e-05 | 2.89e-05 | 2.49e-05 | 3.58e-05 | 6.03e-05 | 8.77e-05 | 1.18e-04 | 1.46e-04 | |
| (3.98e-07) | (1.79e-07) | (1.11e-07) | (1.19e-07) | (2.37e-07) | (3.06e-07) | (4.52e-07) | (5.39e-07) | |
| 5.19e-05 | 1.85e-05 | 1.77e-05 | 2.82e-05 | 4.88e-05 | 6.87e-05 | 9.06e-05 | 1.13e-04 | |
| (2.26e-07) | (1.18e-07) | (8.11e-08) | (1.30e-07) | (2.37e-07) | (3.50e-07) | (5.14e-07) | (5.45e-07) |
| 999923 | 1000000 | 1000000 | 1000000 | 1000000 | 1000000 | 1000000 | 1000000 | |
|---|---|---|---|---|---|---|---|---|
| 4.65e-04 | 5.98e-05 | 2.36e-05 | 1.19e-05 | 4.84e-06 | 4.17e-06 | 2.25e-06 | 1.42e-06 | |
| (4.23e-07) | (1.56e-07) | (9.53e-08) | (6.32e-08) | (4.14e-08) | (4.99e-08) | (1.86e-08) | (1.64e-08) | |
| 3.81e-04 | 3.49e-05 | 9.83e-06 | 4.65e-06 | 1.47e-06 | 8.49e-07 | 4.25e-07 | 2.12e-07 | |
| (4.98e-07) | (1.59e-07) | (5.56e-08) | (2.63e-08) | (7.73e-09) | (1.04e-08) | (4.81e-09) | (3.24e-09) |
| 1.76e-04 | 1.77e-04 | 1.78e-04 | 1.80e-04 | 1.82e-04 | 1.84e-04 | 1.85e-04 | |
| 61.92 | 62.15 | 62.81 | 64.27 | 64.87 | 65.50 | 66.12 | |
| 1000000 | 1000000 | 1000000 | 1000000 | 1000000 | 1000000 | 1000000 | |
| 1.17e-03 | 1.11e-03 | 1.28e-03 | 1.60e-03 | 1.72e-03 | 1.89e-03 | 2.11e-03 | |
| (2.74e-06) | (3.00e-06) | (2.73e-06) | (4.28e-06) | (3.95e-06) | (5.11e-06) | (5.07e-06) |
| 1.62e-04 | 997635 | 80.60 | 57.11 | 52.49 | 1.57e-02 (1.13e-03) | |
| 3.76e-04 | 1000000 | 173.86 | 135.61 | 248.78 | 2.88e-03 (1.14e-05) | |
| 6.76e-04 | 1000000 | 302.98 | 247.74 | 748.62 | 8.80e-04 (4.97e-06) | |
| 1.05e-03 | 1000000 | 462.98 | 397.42 | 1802.44 | 3.44e-04 (2.54e-06) | |
| 1.49e-03 | 1000000 | 645.56 | 556.72 | 3696.31 | 1.44e-04 (8.76e-07) | |
| 2.02e-03 | 1000000 | 891.08 | 758.88 | 6855.23 | 7.61e-05 (5.66e-07) | |
| 2.62e-03 | 1000000 | 1248.90 | 990.86 | 11598.66 | 4.57e-05 (4.36e-07) |
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
\newsiamremark
remarkRemark \newsiamremarkhypothesisHypothesis
\newsiamthmclaimClaim \newsiamthmexampleExample
\newsiamthmconditionCondition
\newsiamthmconstructionConstruction
\headersInversion of dense kernel matrices Florian Schäfer, T. J. Sullivan, and Houman Owhadi
Compression, inversion, and approximate PCA of dense kernel matrices at near-linear computational complexity
Florian Schäfer California Institute of Technology, MC 305-16, 1200 East California Boulevard, Pasadena, CA 91125, USA,
, Phone: (626) 395-3531, Fax: (626) 578-0124,
Corresponding Author [email protected]
T. J. Sullivan Mathematics Institute and School of Engineering, The University of Warwick, Coventry, CV4 7AL, UK, ; and Zuse Institute Berlin, Takustraße 7, 14195 Berlin, Germany, [email protected]
Houman Owhadi California Institute of Technology
Abstract
Dense kernel matrices obtained from point evaluations of a covariance function at locations arise in statistics, machine learning, and numerical analysis. For covariance functions that are Green’s functions of elliptic boundary value problems and homogeneously-distributed sampling points, we show how to identify a subset , with , such that the zero fill-in incomplete Cholesky factorisation of the sparse matrix is an -approximation of . This factorisation can provably be obtained in complexity in space and in time, improving upon the state of the art for general elliptic operators; we further present numerical evidence that can be taken to be the intrinsic dimension of the data set rather than that of the ambient space. The algorithm only needs to know the spatial configuration of the and does not require an analytic representation of . Furthermore, this factorization straightforwardly provides an approximate sparse PCA with optimal rate of convergence in the operator norm. Hence, by using only subsampling and the incomplete Cholesky factorization, we obtain, at nearly linear complexity, the compression, inversion and approximate PCA of a large class of covariance matrices. By inverting the order of the Cholesky factorization we also obtain a solver for elliptic PDE with complexity in space and in time, improving upon the state of the art for general elliptic operators.
keywords:
Cholesky factorization, covariance function, gamblet transform, kernel matrix, sparsity, principal component analysis
{AMS}
65F30, 42C40, 65F50, 65N55, 65N75, 60G42, 68Q25, 68W40
1 Introduction
1.1 Dense kernel matrices and the -bottleneck
Kernel matrices, i.e. square matrices of the form
[TABLE]
obtained from pointwise evaluation of a symmetric positive-definite kernel at a collection of points in a domain , play an important role in statistics, machine learning, and scientific computing. In statistics, they are used as covariance matrices of Gaussian process priors. In machine learning, they equip the feature space with a meaningful inner product via the kernel trick [44]. In scientific computing, they appear as Green’s functions (i.e. fundamental solutions) of linear elliptic partial differential equations (PDEs).
For all these applications, it is usually necessary to perform some or all of the following tasks:
- (1)
compute , given ; 2. (2)
compute , given ; 3. (3)
compute ; 4. (4)
sample from the normal/Gaussian distribution ; 5. (5)
approximate eigenspaces corresponding to the leading eigenvalues of .
The first four of these tasks can be performed by computing the Cholesky factorization of (i.e. the decomposition where is lower triangular). For many popular covariance functions, most notably those of smooth random processes, the matrices will be dense. For large this results in a computational complexity of for the Cholesky factorization and a complexity of to even store the matrix. When is sparse, i.e. has relatively few non-zero entries, better complexity can be achieved — the obvious limiting case being (i.e. linear) complexity if is diagonal. However, for practical problems, the cubic scaling restricts dense Cholesky factorization to problems with . The breadth of kernel matrices’ uses means that there is correspondingly high interest in achieving approximate Cholesky factorization of at linear or near-linear cost.
1.2 Existing approaches
Many fast methods are available for approximating dense kernel matrices and their applicability depends on specific assumptions made on . If the precision matrix is sparse and can be approximated directly (e.g. by discretizing a PDE), then sparse linear solvers can be used. These include multigrid solvers [24, 17, 38, 40] and sparse Cholesky factorization methods with nested dissection ordering [31, 30, 57, 32]. This approach has been proposed for problems arising in spatial statistics [56, 71, 72, 70]. In other situations, available methods directly approximate the covariance matrix based on low-rank approximations, sparsity, and hierarchy. Low-rank techniques such as the Nyström approximation [85, 79, 28] or rank-revealing Cholesky factorization [5, 26] seek to approximate by low-rank matrices whereas sparsity-based methods like covariance tapering [29] seek to approximate with a sparse matrix by setting entries corresponding to long-range interactions to zero. These two approximations can also be combined to obtain sparse low-rank approximations [75, 68, 77, 6, 80], which can be interpreted as imposing a particular graphical structure on the Gaussian process. When is neither sufficiently sparse nor of sufficiently low rank, these approaches can be implemented in a hierarchical manner. For low-rank methods, this leads to hierarchical (- and -) matrices [42, 39, 41], hierarchical off-diagonal low rank (HODLR) matrices [3, 4], and hierarchically semiseparable (HSS) matrices [19, 86, 54] that rely on computing low-rank approximations of sub-blocks of corresponding to far-field interactions on different scales. The interpolative factorization developed by [43] combines hierarchical low-rank structure with the sparsity obtained from an elimination ordering of nested-dissection type. Hierarchical low-rank structure was originally developed as an algebraic abstraction of the fast multipole method of [35]. In order to construct hierarchical low-rank approximations from entries of the kernel matrix efficiently, both deterministic and randomized algorithms have been proposed [9, 60]. For many popular covariance functions, including Green’s functions of elliptic PDEs [8], hierarchical matrices allow for (near-)linear-in- complexity algorithms for the inversion and approximation of , at exponential accuracy. Wavelet-based methods [13, 33], using the separation and truncation of interactions on different scales, can be seen as a hierarchical application of sparse approximation approaches. The resulting algorithms have near-linear computational complexity and rigorous error bounds for asymptotically smooth covariance functions. [25] use operator-adapted wavelets to compress the expected solution operators of random elliptic PDEs. In [50], although no rigorous accuracy estimates are provided, the authors establish the near-linear computational complexity of algorithms resulting from the multi-scale generalization of probabilistically motivated sparse and low-rank approximations [75, 68, 77, 6, 80].
1.3 Our main result and and overview of the paper
Our main result is to show that a small modification of the Cholesky factorization algorithm is both accurate and scalable, when applied to kernel matrices obtained from kernels identified as Green’s functions of elliptic PDEs and a (roughly) homogeneously distributed cloud of points. Such kernels are oftentimes used as covariance functions of smooth Gaussian processes (to enforce a smoothness prior on the function to be recovered/interpolated) and therefore a large class of popular kernels fall into this category. The cheap, accurate, approximate Cholesky factors provided by our method thereby serve tasks (1–4) from Section 1.1. We furthermore show that by reversing the elimination order we obtain a fast direct solver for elliptic PDEs.
Contrary to the present belief that fast solvers for elliptic integral operators require the use of hierarchical low-rank structure or wavelets with a high order of vanishing moments, we show that state-of-the-art performance can be obtained just by zero fill-in Cholesky factorization (which just amounts to skipping some steps in the Cholesky factorization algorithm — wavelets are only used in the detailed rigorous analysis of the algorithm). While there is a huge literature on the sparse Cholesky factorization of sparse matrices, we are not aware of any prior literature on the sparse Cholesky factorization of dense matrices.
For elliptic PDEs with arbitrary -coefficients, -matrices can be used to compute -approximate Cholesky factors of both differential and integral operators in computational complexity [8, 39, 42, 7]. -matrices can improve these complexities to [41, 14, 15]. The “fast gamblet transform” of [64, 65] can invert stiffness matrices of arbitrary elliptic operators in computational complexity . Our computational complexities of for the Cholesky factorization of differential operators and for the Cholesky factorization of integral operators improve upon the state of the art while using a much simpler algorithm.
Our method relies upon a cleverly-constructed elimination ordering and sparsity pattern, which we use in the incomplete Cholesky factorization of the matrix . Simplified versions of these constructions are given in Section 2; Section 3 gives a overview, without detailed proof, of why the method yields the desired results. In particular, Section 2.4 shows how the method provides a sparse approximate principal component analysis (PCA), thereby serving task (5).
Section 4 presents detailed numerical experiments that illustrate the power of our method, and Section 5 gives the mathematical proofs of correctness and accuracy vs. complexity. Section 8 contains concluding remarks, and some technical results are deferred to an Appendix.
2 Overview of the algorithm and its setting
In this introductory section we give a brief overview of the setting in which our theoretical results apply (the class of kernels associated to elliptic operators) and highlight its main features. All detailed numerical experiments and analysis will be deferred to Sections 4 and 5 respectively.
2.1 The class of elliptic operators
In order to establish rigorous, a priori, complexity-vs.-accuracy estimates in Section 5 we will assume that is the Green’s function of an elliptic operator of order (), defined on a bounded domain with Lipschitz boundary, and acting on , the Sobolev space of (zero boundary value) functions having derivatives of order in . More precisely, writing for the dual space of with respect to the scalar product, our rigorous estimates will be stated for an arbitrary linear bijection
[TABLE]
that is symmetric (i.e. ), positive (i.e. ), and local in the sense that
[TABLE]
Let and denote the operator norms of and . The complexity and accuracy estimates for our algorithm will depend on (and only on) , , and the parameter
[TABLE]
which is a measure of the homogeneity of the distribution of the cloud of points .
Since our algorithm only requires the locations of the points and is oblivious to the exact knowledge of , for our numerical experiments in Section 4 we will consider (2.1), general elliptic operators with or without boundary conditions (these include Matérn kernels with fractional values of ) and exponential kernels.
2.2 Zero fill-in incomplete Cholesky factorization (ICHOL(0))
A simple approach to decreasing the computational complexity of Cholesky factorization is the zero fill-in incomplete Cholesky factorization [62] (ICHOL(0)). When performing Gaussian elimination using ICHOL(0), we treat all entries of both the input matrix and the output factors outside a prescribed sparsity pattern as zero and correspondingly ignore all operations in which they are involved. Figure 2.1 shows a comparison of ordinary Cholesky factorization and ICHOL(0). Our approach to kernel matrices consists of applying Algorithm 2 with an elimination ordering and a sparsity pattern that are chosen based on the locations of the ; 5.38 gives the details of this elimination ordering and sparsity pattern.
Write \|\hbox to5.71527pt{\hss\cdot\hss}\|_{\operatorname{Fro}} for the Frobenius matrix norm and for a constant depending only on , , , , , and . To simplify notation, the asymptotic bounds in this paper are stated in the case where the logarithmic factors are at least one. Our main result is the following:
Theorem 2.1**.**
Let and be defined as in (2.1) and (2.3). For , the sparse Cholesky factor , obtained from Algorithm 2 with the elimination ordering and sparsity pattern described in 5.38, satisfies
[TABLE]
*The selection of the ordering and sparsity pattern, as well as Algorithm 2, can be performed in computational complexity in time and in space. In particular, we can obtain an -accurate approximation in Frobenius norm in complexity in time and in space. *
Remark 2.2**.**
*For problems arising in Gaussian process regression, there will typically be no domain on the boundary of which the process is conditioned to be zero; equivalently, will be all of . This introduces an additional error, but we still observe good approximation of the covariances even of points close to the boundary (see Section 4.2 for a detailed discussion). *
We will now present a simplified version of the elimination ordering and sparsity pattern (compared to the one mentioned in Theorem 2.1). Although the proof of Theorem 2.1 does not cover the stability of ICHOL(0) under this simplified version (rather, it covers the one described in 5.38), extensive numerical experiments suggest that ICHOL(0) remains stable under this simplified version, and since it is also user-friendly we recommend this as the “go-to” version for a simple, practical implementation.111Although more complex, the ordering used in Theorem 2.1 has more potential for optimization by exploiting parallelism and dense linear algebra operations.
2.3 The elimination ordering and sparsity pattern
We use a maximum-minimum distance ordering (maximin ordering) [36] as the elimination ordering. This ordering is obtained by successively picking the point that is furthest away from and the points that were already picked. If , then we select an arbitrary as first index to eliminate; otherwise, we choose the first index as
[TABLE]
Then, for the first indices of the ordering already chosen, we choose
[TABLE]
until we have ordered all the points (see Fig. 2.2).
Let
[TABLE]
be the distance between and and the earlier points in the ordering. For , let be the sparsity pattern defined by
[TABLE]
Here, parameterizes a trade-off between computational efficiency and accuracy. For a given , the sparsity pattern will have entries and the Cholesky factorization will require floating-point operations. Figure 2.3 shows the sparsity pattern for . While a naïve implementation requires distance evaluations, Theorem 4.1 shows that Algorithm 3 delivers this sparsity pattern at computational complexity .
2.4 Sparse approximate PCA
The sparse Cholesky factorization described in Section 2 is also rank revealing in the sense that the low-rank approximation obtained by using only the first columns of the Cholesky factorization achieves an accuracy within a constant factor of optimal rank- approximation (measured in operator norm). This is illustrated by Fig. 2.4 and the following theorem:
Theorem 2.3**.**
In the setting of Theorem 2.1, let be the rank- matrix defined by the first columns of the (dense) Cholesky factor of . Then
[TABLE]
*where is the operator norm of and depends only on , , , , , and . *
The rank- approximation estimate (2.9) is a numerical homogenization accuracy estimate similar those obtained in [59, 67, 64, 65, 46]. Numerical homogenization basis functions can be identified by the last rows of the lower triangular Cholesky factor of , obtained with the reverse elimination ordering described in Section 6.2.
3 Why it works — justification of the method
The method described in Section 2 combines two crude approximations. First, it discards all but entries of the dense matrix . Second, it skips all but operations of the Cholesky factorization of (which has complexity ). The obvious question is: why is the resulting approximation of accurate for ?
3.1 Sparse Cholesky factors of dense matrices
The first part of the answer is that the Cholesky factors of decay exponentially quickly away from the sparsity pattern when the maximin ordering is used as the elimination ordering. This decay is illustrated in Fig. 3.1 and by the following Theorem 3.1. Write for a constant depending only on , , , , , and .
Theorem 3.1**.**
In the setting of Theorem 2.1, let be the full Cholesky factor of in the maximin ordering of Section 2. Then, for , as defined in Section 2, and
[TABLE]
*the inequality holds. *
Algorithm 2 computes the exact Cholesky factorization under the assumption that the entries of lying outside are zero. Theorem 3.1 shows that this assumption holds true up to an approximation error that decays exponentially in , which supports the claim of accuracy of Algorithm 2 for . We will now explain the exponential decay of based on a probabilistic interpretation of Gaussian elimination.
3.2 Gaussian elimination, conditioning of Gaussian random variables,
and the screening effect
The dense (block-)Cholesky factorization of a matrix can be seen as the recursive application of the matrix identity
[TABLE]
where, at each step of the outermost loop, the above identity is applied to the Schur complement obtained at the previous step. If the Schur complements appearing during the factorization are sparse, then the final Cholesky factorization will also be sparse.
For , the well-known identities
[TABLE]
imply that the sparsity of Cholesky factors of is equivalent to conditional independence of Gaussian vectors with covariance matrix . In the spatial statistics literature, it is well known that many smooth Gaussian processes are subject to the screening effect [81]. This effect, illustrated in Fig. 3.2, means that the value of the process at a given site, conditioned on the values at nearby sites, is only weakly dependent on the values at distant sites.
Consider now the th step of Cholesky factorization in the ordering described in Section 2. Any pair with will have points between them that have already been eliminated, as illustrated in Fig. 3.3. Thus, the screening effect suggests that their correlation will be weak, which supports choosing as a truncation radius.
3.3 Cholesky factorization and operator-adapted wavelets
Cholesky factorization in the maximin ordering is intimately related to computing operator-adapted wavelets. In Section 5 we will use this connection to prove the accuracy of our approximation.
Operator-adapted wavelets.
[64] and [65] introduced a novel class of operator-adapted wavelets called gamblets (see also [66]). For an operator defined as in (2.1), gamblets can be identified as conditional expectations of the Gaussian process . To construct the gamblets up to level we start with a hierarchy of measurement functions ; heuristically, labels a scale, and a location at that scale. These measurement functions are linearly nested in the sense that, for ,
[TABLE]
for some rank- matrices . Writing [\hbox to5.71527pt{\hss\cdot\hss},\hbox to5.71527pt{\hss\cdot\hss}] for the duality product between and , the conditional expectations
[TABLE]
act as -adapted pre-wavelets. These pre-wavelets can be identified as optimal recovery splines in the sense of [63] through the representation formula
[TABLE]
where is the th entry of the inverse of the matrix with entries . The linear nesting of the across scales implies that the linear spaces are nested (i.e. ). The multi-resolution decomposition is then obtained by defining as the orthogonal complement of in with respect to the energy scalar product . Basis functions for are identified (for ) by
[TABLE]
or, equivalently, by
[TABLE]
with , where J^{(k)}\cong\bigl{(}I^{(k)}\setminus I^{(k-1)}\bigr{)} and is a matrix such that (writing for the transpose of ). See Fig. 3.4 for an illustration.
For simplicity we write and . Write for the stiffness matrices B^{(k)}\coloneqq\bigl{\langle}\chi_{i}^{(k)},\chi_{j}^{(k)}\bigr{\rangle}. The gamblets are -adapted wavelets in the sense that, under sufficient conditions on the , they satisfy the following three properties:
- •
Scale orthogonality in the energy scalar product, i.e.
[TABLE]
This leads to the block-diagonalization of the operator (with the as diagonal blocks).
- •
Uniform Riesz stability in the energy norm: the condition numbers of the blocks are uniformly bounded in .
- •
Exponential decay, which leads to sparse blocks : the gamblets exhibit exponential decay on the scale associated with .
Although the scale-orthogonality property (3.10) is always satisfied, the two others (exponential decay and uniform Riesz stability) depend on the properties of and the . In the setting of the localization of numerical homogenization basis functions (where is an elliptic PDE and the measurements are local and possibly not explicitly introduced), rigorous exponential decay estimates were pioneered in [59] and generalized in [52, 64, 46, 65]; see Section 5.3.2 for detailed comparisons. For spanning the space of local polynomials of order up to , bounded condition numbers are shown by [64, 65]. The homogenization results obtained in the special case [59, 67, 46] are closely related to the lower bound on the spectrum of (see Section 5.3.3).
Relation to Cholesky factorization.
To explain the connection between gamblets and Cholesky factorization, let , let be the identity matrix, let be the identity matrix, and let be the symmetric matrix with block defined for by
[TABLE]
or equivalently by
[TABLE]
Then, the block-Cholesky factorization of satisfies the identity
[TABLE]
where is a block-diagonal matrix with the diagonal block equal to and
[TABLE]
Therefore, computing gamblets associated to the operator and measurement functions is equivalent to computing a block-Cholesky factorization of in the multiresolution basis given by the .
The Cholesky decomposition of (1.1) belongs to this setting. Indeed, although the maximin ordering of Section 2 has no explicit multiscale structure, this structure can be introduced, as described in Fig. 3.5, by decomposing into a nested hierarchy , and choosing \phi_{i}^{(k)}=\boldsymbol{\delta}(\hbox to5.71527pt{\hss\cdot\hss}-x_{i}) for and , where denotes the unit (unscaled) Dirac delta function. Under this choice, for and for . Letting label the indices in and choosing for and for implies . The exponential decay of and follows from known results [65] on exponential decay of the . The uniform bound on the condition number of the is proved in Section 5.3.3. The exponential decay and uniform bound on the condition numbers of the blocks imply the exponential decay of the Cholesky factors of and hence of . The approximation error estimate (2.4) is then obtained by matching the sparsity set with the near-sparse structure of .
4 Implementation and numerical results
4.1 Selection of the sparsity pattern and ordering
This section introduces an -complexity algorithm (Algorithm 3) for selecting the sparsity pattern and ordering used as inputs in Algorithm 2. This algorithm does not explicitly query the position of the and only uses pairwise distances by processing points one by one by updating a mutable binary heap, keeping track of the point to be processed at each step. With this approach, our proposed algorithm is oblivious to the dimension of the ambient space and, in particular, can automatically exploit low-dimensional structure in the point cloud . In order to avoid computing all pairwise distances, as illustrated in Fig. 4.1, Algorithm 3 uses the sparsity pattern obtained on the coarser scales to restrict computation at the finer scales to local neighborhoods.
Theorem 4.1**.**
*The output of Algorithm 3 is the ordering and sparsity pattern described in Section 2. Furthermore, in the setting of Theorem 3.1, if the oracles \operatorname{\mathtt{dist}}(\hbox to5.71527pt{\hss\cdot\hss},\hbox to5.71527pt{\hss\cdot\hss}) and \operatorname{\mathtt{dist}}_{\partial\Omega}(\hbox to5.71527pt{\hss\cdot\hss}) can be queried in complexity , then the complexity of Algorithm 3 is bounded by , where is a constant depending only on , and . *
Theorem 4.1 is proved in Appendix A. As discussed therein, in the case , Algorithm 3 has the advantage that its computational complexity depends only on the intrinsic dimension of the dataset, which can be much smaller than .
4.2 The case of the whole space ()
Many applications in Gaussian process statistics and machine learning are in the setting. In that setting, the Matérn family of kernels (4.5) is a popular choice that is equivalent to using the whole-space Green’s function of an elliptic PDE as covariance function [83, 84]. Let be a bounded domain containing the . The case is not covered in Theorem 3.1 because in this case the screening effect is weakened near the boundary of by the absence of measurements points outside of . Therefore, distant points close to the boundary of will have stronger conditional correlations than similarly distant points in the interior of (see Fig. 4.2). As observed by [70] and [20], Markov random field (MRF) approaches that use a discretization of the underlying PDE face similar challenges at the boundary. While the weakening of the exponential decay at the boundary worsens the accuracy of our method, the numerical results of Section 4.4 (which are all obtained without imposing boundary conditions) suggest that its overall impact is limited. In particular, as shown in Fig. 4.2, it does not cause significant artifacts in the quality of the approximation near the boundary. This differs from the significant boundary artifacts of MRF methods, which have to be mitigated against by a careful calibration of boundary conditions [70, 20]. Although the numerical results presented in this section are mostly obtained with , in many practical applications, the density of measurement points will slowly (rather than abruptly) decrease towards zero near the boundary of the sampled domain, which drastically decreases the boundary errors shown above. Accuracy can also be enhanced by adding artificial points at the boundary. By applying the Cholesky factorization to , and then restricting the resulting matrix to , we can obtain a very accurate approximate matrix-vector multiplication. Although not in the form of a Cholesky factorization, this approximation can be efficiently inverted using iterative methods such as conjugate gradient [78] preconditioned with the Cholesky factorization obtained from the original set of points.
4.3 Nuggets and measurement errors
In the Gaussian process regression setting it is common to to model measurement error by adding a nugget to the covariance matrix:
[TABLE]
The addition of a diagonal matrix diminishes the screening effect and thus the accuracy of Algorithm 2. This problem can be avoided by rewriting the modified covariance matrix as
[TABLE]
where . As noted in Section 6.2, can be interpreted as a discretized partial differential operator and has near-sparse Cholesky factors in the reverse elimination ordering. Adding a multiple of the identity to amounts to adding a zeroth-order term to the underlying PDE and thus preserves the sparsity of the Cholesky factors. This leads to the sparse decomposition
[TABLE]
where is the order-reversing permutation and is the Cholesky factor of . Fig. 4.3 shows that the exponential decay of these Cholesky factors is robust with respect .
This idea can be turned into an algorithm by first approximately computing using Algorithm 2; then using to approximate , which can be done in near-linear complexity by exploiting sparsity; and then approximating , again using Algorithm 2. While this algorithm is asymptotically efficient, our preliminary results suggest that the additional inversion step significantly increases the constants featured in the approximation accuracy. Therefore, when low accuracy is sufficient, we instead recommend simply applying Algorithm 2 to the matrix . This preserves the original approximation accuracy and the matrix inversion can then efficiently be performed using iterative methods such as conjugate gradient (CG) [78] by taking advantage of the fast matrix-vector multiplication obtained from the sparse factorization. For small values of (which would lead to slow convergence of CG) we can directly apply Algorithm 2 to . For large values of , will be well conditioned and the convergence of is fast. For intermediate values of , we can apply Algorithm 2 to and use the resulting factors as a preconditioner for CG. Sampling from can be done by adding independent samples from and . Approximations of the log-determinant could be obtained either by applying Algorithm 2 directly to (with some loss of accuracy) or by combining iterative methods [74, 27] with the fast matrix-vector multiplication obtained from the sparse factorization of . Just like CG, these methods benefit from the fact that we can work with well-conditioned matrices for small and large . A detailed investigation of the efficiency of the above mentioned strategies for computing with nuggets is beyond the scope of this work.
4.4 Numerical results
We will now present numerical evidence in support of our results. All experiments reported below were run on a workstation using an Intel®Core™i7-6400 CPU with 4.00GHz and 64 GB of RAM. The time-critical parts of the code are run on a single thread, leaving the exploration of parallelism to future work. The Julia scripts implementing the experiments can be found online under https://github.com/f-t-s/nearLinKernel. In the following, denotes the number of nonzero entries of the lower-triangular factor ; denotes the time taken by Algorithm 3 to compute the maximin ordering and sparsity pattern ; denotes the time taken to compute the entries of on ; and denotes the time taken to perform Algorithm 2 (ICHOL(0)), all measured in seconds. The relative error in Frobenius norm is approximated by
[TABLE]
where the pairs of indices are independently and uniformly distributed in . This experiment is repeated 50 times and the resulting mean and standard deviation (in brackets) are reported. For measurements in , in order to isolate the boundary effects, we also consider the quantity which is defined as , but with only those sample for which . Most of our experiments will use the Matérn class of covariance functions [61], defined by
[TABLE]
where is the modified Bessel function of second kind [1, Section 9.6] and , are parameters describing the degree of smoothness, and the length-scale of interactions, respectively [69]. In Fig. 4.5, the Matérn kernel is plotted for different degrees of smoothness. The Matérn covariance function is used in many branches of statistics and machine learning to model random fields with finite order of smoothness [37, 69].
As observed by [83, 84], the Matérn kernel is the Green’s function of an elliptic PDE of possibly fractional order in the whole space. Therefore, for , the Matérn kernel falls into the framework of our theoretical results, up to the behavior at the boundary discussed in Section 4.2. Since the locations of our points will be chosen at random, some of the points will be very close to each other, resulting in an almost singular matrix that can become nonpositive under the approximation introduced by ICHOL(0). If Algorithm 2 encounters a nonpositive pivot , then we set the corresponding column of to zero, resulting in a low-rank approximation of the original covariance matrix. We report the rank of in our experiments and note that we obtain a full-rank approximation for moderate values of .
We begin by investigating the scaling of our algorithm as increases. To this end, we consider (the exponential kernel), and choose randomly distributed points in for . The results are summarized in Table 4.1 and Table 4.4, and in Fig. 4.4, and confirm the near-linear computational complexity of our algorithm.
Next, we investigate the trade-off between computational efficiency and accuracy of the approximation. To this end, we choose , and , , corresponding to fourth-order equations in two and three dimensions. We choose data points and apply our method with different values of . The results of these experiments are tabulated in Tables 4.3 and 4.4 and the impact of on the approximation error is visualized in Fig. 4.4.
While our theoretical results only cover integer-order elliptic PDEs, we observe no practical difference between the numerical results for Matérn kernels corresponding to integer- and fractional-order smoothness. As an illustration, for the case , we provide approximation results for ranging around (corresponding to a fourth-order elliptic PDE) and (corresponding to a sixth-order elliptic PDE). As seen in Table 4.5, the results vary continuously as changes, with no qualitative differences between the behavior for integer- and fractional-order PDEs. To further illustrate the robustness of our method, we consider the Cauchy class of covariance functions introduced in [34]
[TABLE]
As far as we are aware, the Cauchy class has not been associated to an elliptic PDE. Furthermore, it does not have exponential decay in the limit , which allows us to emphasize the point that the exponential decay of the error is not due to the exponential decay of the covariance function itself. Table 4.6 gives the results for and .
In Gaussian process regression, the ambient dimension is typically too large to ensure computational efficiency of our algorithm. However, since our algorithm only requires access to pairwise distances between points, it can take advantage of low intrinsic dimension of the dataset. We might be concerned that in this case, interaction through the higher dimensional ambient space will disable the screening effect. As a first demonstration that this is not the case, we will draw points in and equip them with a third component according to , for i.i.d. standard Gaussian. Figure 4.6 shows the resulting point sets for different values of , and Table 4.7 shows that the approximation is robust to increasing values of .
An appealing feature of our method is that it can be formulated in terms of the pairwise distances alone. This means that the algorithm will automatically exploit any low-dimensional structure in the dataset. In order to illustrate this feature, we artificially construct a dataset with low-dimensional structure by randomly rotating four low-dimensional structures into a -dimensional ambient space (see Fig. 4.7). Table 4.8 shows that the resulting approximation is even better than the one obtained in dimension , illustrating that our algorithm did indeed exploit the low intrinsic dimension of the dataset.
5 Analysis of the algorithm
5.1 General Setting
We will start the analysis in a more general setting than that of Section Section 2.1. Let be a separable Banach space with dual space , and write [\hbox to5.71527pt{\hss\cdot\hss},\hbox to5.71527pt{\hss\cdot\hss}] for the duality product between and . Let be a linear bijection and let . Assume to be symmetric and positive (i.e. and for ). Let \|\hbox to5.71527pt{\hss\cdot\hss}\| be the quadratic (energy) norm defined by for and let \|\hbox to5.71527pt{\hss\cdot\hss}\|_{\ast} be its dual norm defined by
[TABLE]
Let be linearly independent elements of (known as measurement functions) and let be the symmetric positive-definite matrix defined by
[TABLE]
We assume that we are given and a partition of . We represent matrices as block matrices according to this partition. Given an matrix we write for the th block of and for the sub-matrix of defined by blocks ranging from to and to . Unless specified otherwise we write for the lower-triangular Cholesky factor of and define
[TABLE]
We interpret the as labelling a hierarchy of scales with representing the coarsest and the finest. We write for .
Throughout this section we assume that the ordering of the set of indices is compatible with the partition , i.e. , and together imply . We will write or for the Cholesky factor of in that ordering.
5.2 Main examples
We will prove the main results of this section in the setting where is defined as in Section 2.1 and the are chosen as in Examples 5.1 and 5.2. We will assume (without loss of generality after rescaling) that . As described in Fig. 3.5, successive points of the maximin ordering can be gathered into levels, so that, after appropriate rescaling of the measurements, the Cholesky factorization in the maximin ordering falls in the setting of Example 5.1.
Example 5.1**.**
Let . For let be a nested hierarchy of points in that are homogeneously distributed at each scale in the sense of the following three inequalities:
- (1)
, 2. (2)
, and 3. (3)
.
Let and for . Let denote the unit Dirac delta function and choose
[TABLE]
Given subsets we extend a matrix to an element of by padding it with zeros.
Example 5.2**.**
(See Fig. 5.1.) For , let be uniformly Lipschitz convex sets forming a regular nested partition of in the following sense. For , is a disjoint union except for the boundaries. is a nested set of indices, i.e. for . For and , there exists a subset such that and . Assume that each contains a ball of center and radius , and is contained in the ball . For and , let the submatrices satisfy and for each , where denotes the volume of . Let and for . Let be the matrix defined by . Let be the matrix defined by for , we set
[TABLE]
*and define . In order to keep track of the distance between the different of Example 5.2, we choose an arbitrary set of points with the property that for each . *
5.3 Exponential decay of Cholesky factors
Our bound on the ICHOL(0) approximation error will be based on the following exponential decay estimate on the entries of the Cholesky factor of :
[TABLE]
for a constant and a suitable distance measure d(\hbox to5.71527pt{\hss\cdot\hss},\hbox to5.71527pt{\hss\cdot\hss})\colon I\times I\to\mathbb{R}.
5.3.1 Algebraic Identities and roadmap
The following block-Cholesky decomposition of will be used to obtain the estimate (5.6).
Lemma 5.3**.**
*We have , with and defined by *
[TABLE]
*In particular, if is the lower-triangular Cholesky factor of , then the lower-triangular Cholesky factor of is given by . *
Proof 5.4**.**
*To obtain Lemma 5.3 we successively apply Lemma 5.5 to (see Appendix B for details). Lemma 5.5 summarizes classical identities satisfied by Schur complements. *
Lemma 5.5** ([87, Chapter 1.1]).**
Let be symmetric positive definite and its inverse. Then
[TABLE]
where
[TABLE]
Based on Lemma 5.3, (5.6) can be established by ensuring that:
- (1)
the matrices (and hence also ) decay exponentially according to d(\hbox to5.71527pt{\hss\cdot\hss},\hbox to5.71527pt{\hss\cdot\hss}); 2. (2)
the matrices have uniformly bounded condition numbers; 3. (3)
the products of exponentially decaying matrices decay exponentially; 4. (4)
the inverses of well-conditioned exponentially decaying matrices decay exponentially; 5. (5)
the Cholesky factors of the inverses of well-conditioned exponentially decaying matrices decay exponentially; and 6. (6)
if a block lower-triangular matrix with unit block-diagonal decays exponentially, then so does its inverse.
We will carry out this program in the setting of Examples 5.1 and 5.2 and prove that (5.6) holds with
[TABLE]
To prove (1), the matrices , (interpreted as coarse-grained versions of and ), and will be identified as stiffness matrices of the -adapted wavelets described in Section 3.3. This identification is established on the general identities for , , and where the and are defined as in (3.7) and (3.8).
5.3.2 Exponential decay of
Our proof of the exponential decay of will be based on that of as expressed in the following condition:
{condition}
Let be constants such that for and ,
[TABLE]
The matrices are coarse-grained versions of the local operator and as such inherit some of its locality in the form of exponential decay. Such exponential localization results were first obtained by [59] for the coarse-grained operators obtained from local orthogonal decomposition (LOD) applied to second-order elliptic PDEs with rough coefficients. [64] gives similar results for measurement functions chosen as in Example 5.2. [46] extend the results on exponential decay to higher-order operators satisfying a strong ellipticity condition. These results were obtained using similar mass chasing techniques that are difficult to extend to general higher-order operators. [52] present a simpler proof of the exponential decay of the LOD basis functions of [59] based on the exponential convergence of subspace iteration methods. [65] extend this technique (by presenting necessary and sufficient conditions expressed as frame inequalities in dual spaces) to elliptic PDEs of arbitrary (integer) order and new classes of (possibly non-conforming) measurements, including those of Example 5.1 and Example 5.2. More recently, [18] show localization results for the fractional partial differential operators by using the Caffarelli–Silvestre extension. The results of [65] are sufficient to show that Section 5.3.2 holds true in the setting of Example 5.1 and Example 5.2.
Theorem 5.6** ([65]).**
In Example 5.1, the matrices satisfy
[TABLE]
and in Example 5.2 they satisfy
[TABLE]
*with the constants and depending only on , , , , , and . In particular, they satisfy Section 5.3.2 with the constants described above. *
Proof 5.7**.**
Our Example 5.1 is equivalent to Example 2.29 of [65]. In [65, Theorem 2.25 and Theorem 2.26] it is shown that in the gamblets computed in this setting decay exponentially on the length-scale , with respect to the energy norm. By [65, Theorem 3.8] we have and, therefore, the exponential decay of gamblets implies the exponential decay of the .
We further note that Example 5.2 is equivalent to Example 2.27 in [65]. Therefore, by the same theorems, as above, the results of [65] imply exponential decay of the in this setting222We point out that the block in our notation is in the notation of [65]..
*See also [66, Theorem 15.45] for a detailed proof and [66, Theorem 15.43] for required sufficient lower bounds on . *
5.3.3 Bounded condition numbers
In this section, we will bound the condition numbers of based on the following condition, which we will show to be satisfied for Examples 5.1 and 5.2.
{condition}
Let be constants such that for ,
[TABLE]
Theorem 5.8**.**
Section 5.3.3* implies that, for all ,*
[TABLE]
and, for ,
[TABLE]
Proof 5.9**.**
The lower bound in (5.19) follows from (5.18) and
[TABLE]
*The upper bound in (5.19) follows from (5.17) and B^{(k)}=\bigl{(}\bigl{(}\Theta^{(k)}\bigr{)}^{-1}\bigr{)}_{k,k}. *
The following theorem shows that (5.18) is a Poincaré inequality closely related to the accuracy of numerical homogenization basis functions [59, 67, 46] and (5.17) is an inverse Sobolev inequality related to the regularity of the discretization of :
Theorem 5.10**.**
Section 5.3.3* holds true if the constants and satisfy*
- (1)
, for and ; and 2. (2)
, for , , and .
Proof 5.11**.**
Inequality (5.17) is a direct consequence of the first assumption of the theorem, whereas (5.18) follows from the variational property [87, Theorem 5.1] of the Schur complement:
[TABLE]
We will now show that Examples 5.1 and 5.2 satisfy the conditions of Theorem 5.10. For simplicity, for and we still write for the unique element such that for . The following Fenchel conjugate identity [16, Ex. 3.27, p. 93] will be useful throughout this section.
[TABLE]
The first condition can be verified similarly as is done in [65].
Lemma 5.12**.**
Let be given as in Examples 5.1 and 5.2. Then there exists a constant depending only on , , and , such that
[TABLE]
*for , , and . *
Proof 5.13**.**
*The proof can be found in Appendix B. *
In order to verify the second condition in Theorem 5.10, we will construct a such that integrates to zero against polynomials of order at most on domains of size . Then an application of the Bramble–Hilbert lemma [21] will yield the desired factor . To avoid scaling issues we define, for and ,
[TABLE]
noting that . To obtain estimates independent of the regularity of , for the simplicity of the proof and without loss of generality, we will partially work in the extended space (rather than on ). We write for the zero extension of to and for the extension of to an element of the dual space of . We introduce new measurement functions in the complement of as follows. For we consider countably infinite index sets . We choose points satisfying
[TABLE]
We then define, for and , for Example 5.1, and for Example 5.2. Let denote the linear space of polynomials of degree at most (on ).
Lemma 5.14**.**
Let be as in Example 5.1 or Example 5.2. Given and let be such that
[TABLE]
and . Then, for , and satisfy
[TABLE]
*with and as in (5.1). *
We proceed by proving Lemma 5.14 in the setting of Example 5.1. The proof in the setting of Example 5.2 can be found in Appendix B. For write and for write for the vector of partial derivatives of of order , i.e. \mathrm{D}^{k}u\coloneqq\Bigl{(}\frac{\partial^{k}u}{\partial_{i_{1}}\cdots\partial_{i_{k}}}\Bigr{)}_{i_{1},\ldots,i_{k}=1,\ldots,d}. The proof of Lemma 5.14 will use the following version of the Bramble–Hilbert lemma:
Lemma 5.15** ([21]).**
Let be convex and let be a sublinear functional on for such that
- (1)
there exists a constant such that, for all ,
[TABLE] 2. (2)
and for all .
Then, for all ,
[TABLE]
The following lemma is obtained from Lemma 5.15:
Lemma 5.16**.**
For and , let be as in Lemma 5.14 and Example 5.2 and define . Then there exists a constant such that, for all ,
[TABLE]
Proof 5.17**.**
We apply Lemma 5.15 to the linear functional . Since the second requirement of Lemma 5.15 is fulfilled by definition, it remains to bound . We only execute the proof for Example 5.1; the proof for Example 5.2 is analogous. We first note that while the sum in the definition of only ranges over , we can increase it to run over all of , since for , the support of is disjoint from that of . Let . Writing for the continuity constant of the embedding of into , the inequalities
[TABLE]
and
[TABLE]
imply that
[TABLE]
Therefore the first condition of Lemma 5.15 holds with
[TABLE]
*and we conclude the proof by writing for any constant depending only on and . *
We can now conclude the proof of Lemma 5.14.
Proof 5.18** (Proof of Lemma 5.14).**
Write and . Equation (5.24) implies that
[TABLE]
The packing inequality together with Lemma 5.16 yields
[TABLE]
Applying the inequality to each summand yields
[TABLE]
Since, for all ,
[TABLE]
*we have , and this completes the proof. *
The following geometric lemma shows that the assumption (5.28) of Lemma 5.14 can be satisfied with a uniform bound on the value of and the norm of weights .
Lemma 5.19**.**
There exists constants and such that for all there exists weights satisfying (5.28) and (with defined as in Lemma 5.14)
[TABLE]
Proof 5.20**.**
For Example 5.1, (5.28) is equivalent to
[TABLE]
where .
Fix , let and write . Write . Since the function p(\hbox to5.71527pt{\hss\cdot\hss})\mapsto p(\frac{\hbox to4.00069pt{\hss\cdot\hss}-x_{i}}{\lambda}) is surjective on , (5.43) is satisfied if
[TABLE]
For a multiindex and a point , write . Use the convention if and . To satisfy (5.44) it is sufficient to identify a subset of and w_{i,\hbox to4.00069pt{\hss\cdot\hss}}\in\mathbb{R}^{\tilde{I}^{(k)}} such that , for , and
[TABLE]
Let be the matrix defined by
[TABLE]
for a multiindex and a point . Let be defined by for . Equation (5.45) is then equivalent to
[TABLE]
where is defined by for . We will now identify by inverting (5.47). To achieve this while keeping the norm of under control we will seek to identify the subset and such that (the minimal singular value of ) is bounded from below by a constant depending only on and .
For let be elements of satisfying for all . Let and, for , let . Observe that for the points are on a regular grid. Let be the matrix defined by . Let be the Vandermonde matrix defined by . Writing for the minimal singular value of we have, for , by [45, Theorem 4.2.12],
[TABLE]
Since univariate polynomial interpolation on points with polynomials of degree is uniquely solvable, we have and . Therefore, the continuity of the minimal singular value with respect the entries of implies that there exists depending only on such that implies . Since (by construction) the form a covering of of radius , the form a covering of of radius and for each there exists an that is at distance at most from . Let be the collection of corresponding labels. It follows from that , and for . Selecting implies that and for . Defining
[TABLE]
*the weights satisfy and (5.28) with a depending only on and . This concludes the proof for Example 5.1. The proof is similar for Example 5.2 with minor changes (the bound on also depends on ). *
The following lemma concerns the satisfaction of the second condition of Theorem 5.10:
Lemma 5.21**.**
In the setting of Examples 5.1 and 5.2, there exists some constant such that, for , and ,
[TABLE]
Proof 5.22**.**
*Apply Lemma 5.14 with the bounds on and obtained in Lemma 5.19. *
The following theorem is a direct consequence of Theorem 5.10, Lemma 5.12 and Lemma 5.21.
Theorem 5.23**.**
*In the setting of Examples 5.1 and 5.2 there exists a constant such that Section 5.3.3 is fulfilled with and . *
5.3.4 Propagation of exponential decay
We will now derive the exponential decay of the Cholesky factors by combining the algebraic identities of Lemma 5.3 with the bounds on the condition numbers of the (implied by Section 5.3.3) and the exponential decay of the (specified in Section 5.3.2). The core of our proof is based on a combination/extension of the results of [23, 49, 11, 10, 53, 12] on decay algebras. The pseudodistance d(\hbox to5.71527pt{\hss\cdot\hss},\hbox to5.71527pt{\hss\cdot\hss}) appearing in (5.6) is not a pseudometric because it does not satisfy the triangle inequality. However, to prove (5.6) we we will only need the following weaker version of the triangle inequality:
Definition 5.24**.**
A function is called a hierarchical pseudometric if
- (1)
; 2. (2)
; 3. (3)
for all , d(\hbox to5.71527pt{\hss\cdot\hss},\hbox to5.71527pt{\hss\cdot\hss}) restricted to is a pseudometric; 4. (4)
for all and , we have .
Note that the d(\hbox to5.71527pt{\hss\cdot\hss},\hbox to5.71527pt{\hss\cdot\hss}) specified in (5.13) for Examples 5.1 and 5.2 is a hierarchical pseudometric. For a hierarchical pseudometric d(\hbox to5.71527pt{\hss\cdot\hss},\hbox to5.71527pt{\hss\cdot\hss}) and , let
[TABLE]
The following theorem states the main result of this section:
Theorem 5.25** (Exponential decay of the Cholesky factors).**
Assume that fulfils Sections 5.3.2 and 5.3.3 with the constants and the hierarchical pseudometric d(\hbox to5.71527pt{\hss\cdot\hss},\hbox to5.71527pt{\hss\cdot\hss}). Then
[TABLE]
*where , , , and is defined as in Theorem 5.8. *
The remaining part of this section will present the proof of Theorem 5.25. We will use the following lemma on the stability of exponential decay under matrix multiplication, the proof of which is a minor modification of that of [49].
Lemma 5.26**.**
Let be an index set that is partitioned as and let satisfy
[TABLE]
Let be such that for and let
[TABLE]
Then, for ,
[TABLE]
Proof 5.27**.**
Set , . Then
[TABLE]
The proof of the following lemma (on the stability of exponential decay under matrix inversion for well conditioned matrices) is nearly identical to that of [49] (we only keep track of constants; see also [23] for a related result on the inverse of sparse matrices).
Lemma 5.28**.**
Let be symmetric and positive definite with for some and a metric d(\hbox to5.71527pt{\hss\cdot\hss},\hbox to5.71527pt{\hss\cdot\hss}) on . It holds true that
[TABLE]
*where , , , and is the condition number of . *
Proof 5.29**.**
*On a compact set not containing [math], the function can be accurately approximated by low-order polynomials in . Then, the spread of the exponential decay can be controlled by Lemma 5.26. See Appendix B for details. *
By representing Schur complements as matrix inverses, Lemma 5.28 can also be used to show that the Cholesky factors of well-conditioned exponentially-decaying matrices are exponentially decaying. The following lemma appears in a similar form in [12] for banded matrices and in [53] without explicit constants.
Lemma 5.30**.**
Let be symmetric and positive definite with condition number and such that for some constant and some metric on . Let be the Cholesky factor (in an arbitrary order) of (). Then
[TABLE]
*where , , and . *
Proof 5.31**.**
Lemma 5.5* implies that the Schur complements of can be expressed as inverses of sub-matrices of . The result then follows from Lemma 5.28 (see B.5 for details). *
The last ingredient needed to prove the exponential decay of the Cholesky factors of is the following lemma showing the stability of exponential decay under inversion for block-lower-triangular matrices (this operation appears in the definition of in (5.7)):
Lemma 5.32**.**
Let be an index set that is partitioned as and assume that the matrix is block-lower triangular with respect to this partition, with identity matrices as diagonal blocks. If d(\hbox to5.71527pt{\hss\cdot\hss},\hbox to5.71527pt{\hss\cdot\hss}) is a hierarchical pseudometric such that (for some and ) then it holds true that
[TABLE]
*with . *
Proof 5.33**.**
The Neumann series of a block-lower-triangular matrix with identity matrices on the (block) diagonal can be written as
[TABLE]
*Since the sum terminates in steps, the thickening of the exponential decay can be bounded using Lemma 5.26. See B.6 for details. *
By applying the above results to the decomposition obtained in Lemma 5.3, we conclude the proof of Theorem 5.25. See B.7 for details.
5.4 Complexity and error estimates
The results of the previous sections allow us to prove the following theorem on the exponential decay of the Cholesky factors and the accuracy of their truncation:
Theorem 5.34**.**
In the setting of Examples 5.1 and 5.2 there exist constants depending only on , , , , , , and , such that the entries of the Cholesky factor of satisfy
[TABLE]
where is the hierarchical pseudometric defined by
[TABLE]
As a consequence, writing
[TABLE]
*with , we have \bigl{\|}\Theta-L^{S}L^{S,\top}\bigr{\|}_{\operatorname{Fro}}\leq\epsilon for . Furthermore, writing , using the -perturbation of as the input to Algorithm 2 returns as the output. *
Proof 5.35**.**
Theorems 5.6* and 5.23 imply that Sections 5.3.2 and 5.3.3 are fulfilled with constants depending only on , , , , , and . Theorem 5.25 concludes the exponential decay of . The accuracy of the truncated factors follows directly from the exponential decay. *
Theorem 3.1 is a direct consequence of Theorem 5.34.
Proof 5.36** (Proof of Theorem 3.1).**
As described in Section 3.3, the maximin ordering can be represented as a hierarchical ordering satisfying the conditions of Example 5.1. The result follows from Theorem 5.34 by observing that the sparsity pattern specified in Section 2 satisfies
[TABLE]
*Scaling the weights of the measurement functions to increases the error by a factor that is at most polynomial in , which can be subsumed into the -dependence of by increasing the constants in the decay estimates. *
While accurate (per Theorem 5.34), it is computationally inefficient to compute the full Cholesky factor first (with Algorithm 1) and then truncate it according to . Instead, we want to directly compute an approximation of from the incomplete factorization Algorithm 2, whose complexity is bounded by the following theorem:
Theorem 5.37**.**
*In the setting of Examples 5.1 and 5.2, there exists a constant , such that, for , the application of Algorithm 2 has computational complexity in space and in time. In particular, implies the upper bounds of on the space complexity, and of on the time complexity. *
Proof 5.38**.**
Defining , for implies that . Therefore implies the bound on space complexity.
*Consider the structure of the nested for-loops of Algorithm 2 and observe that, for every in the innermost loop, the number of distinct satisfying , and is at most . This implies the upper bound on the time complexity. *
Theorems 5.34 and 5.37 imply that the application of Algorithm 2 to (the -perturbation of described in Theorem 5.34) returns an -accurate Cholesky factorization of in computational complexity . In practice we do not have access to , so we need to rely on the stability of Algorithm 2 to deduce that and (used as inputs) would yield similar outputs, for sufficiently small . Even though such a stability property of ICHOL(0) would also be required by prior works on incomplete LU-factorization such as [33], we did not find this type of result in the literature. We also found it surprisingly difficult to prove (and were unable to do so) for the maximin ordering and sparsity pattern, although we always observed stability of Algorithm 2 in practice, for reasonable values of . We can however prove stability of Algorithm 2 when using a slight modification of the ordering and sparsity pattern that compromises neither the computational complexity nor the accuracy of the factorization. The modified ordering and sparsity pattern, being inspired by the concepts of red-black orderings [48] and supernodal factorizations [73, 58] also allows one to take advantage of parallelism and dense linear algebra operations and could therefore be used to improve the practical performance of the algorithm. For , and , write
[TABLE]
{construction}
[Supernodal multicolor ordering and sparsity pattern]
Let with and let d(\hbox to5.71527pt{\hss\cdot\hss},\hbox to5.71527pt{\hss\cdot\hss}) be a hierarchical pseudometric. For , define the supernodal multicolor ordering and sparsity pattern as follows. For each , select a subset of indices such that
[TABLE]
Assign every index in to the element of closest to it, using an arbitrary method to break ties. That is, writing for the assignment of to ,
[TABLE]
for all and such that . Define and define the auxiliary sparsity pattern by
[TABLE]
Define the sparsity pattern as
[TABLE]
and call the elements of supernodes. Color each in one of colors such that no with have the same color. For write for the such that and write for the color of . Define the supernodal multicolor ordering by reordering the elements of such that
- (1)
for , and ; 2. (2)
within each level , we order the elements of supernodes colored in the same color consecutively, i.e. given such that , for , and ; and 3. (3)
the elements of each supernode appear consecutively, i.e. given such that , for , and .
Starting from a hierarchical ordering and sparsity pattern, the modified ordering and sparsity pattern can be obtained efficiently:
Lemma 5.39**.**
*In the setting of Examples 5.1 and 5.2, given , there exist constants and depending only on the dimension and the cost of computing d(\hbox to5.71527pt{\hss\cdot\hss},\hbox to5.71527pt{\hss\cdot\hss}) such that the ordering and sparsity pattern presented in 5.38 can be constructed with , for each , in computational complexity . *
Proof 5.40**.**
*The aggregation into supernodes can be done via a greedy algorithm by keeping track of all nodes that are not already within distance of a supernode and removing them one-at-a-time. We can then go through -neighbourhoods and remove points within distance from our list of candidates for future supernodes. To create the coloring, we use the greedy graph coloring of [47] on the undirected graph with vertices and edges \bigl{\{}(\tilde{i},\tilde{j})\in\tilde{S}_{\rho}\,\big{|}\,\tilde{i},\tilde{j}\in\tilde{J}^{(k)}\bigr{\}}. Defining as the maximum number of edges connected to any vertex of , the computational complexity of greedy graph coloring is bounded above by and the number of colors used by . A sphere-packing argument shows that is at most a constant depending only on the dimension , which yields the result. *
Theorem 5.41**.**
In the setting of Examples 5.1 and 5.2, there exists a constant depending only on , , and such that, given the ordering and sparsity pattern defined as in 5.38 with , the incomplete Cholesky factor obtained from Algorithm 2 has accuracy
[TABLE]
*Furthermore, Algorithm 2 has complexity of at most in time and at most in space. *
Proof 5.42**.**
*The triangle inequality implies that and hence the bound on the complexity of Algorithm 2 follows from Theorem 5.37. The approximation property of the incomplete factors follows from the last part of Theorem 5.34 and a stability result for the incomplete Cholesky factorization with the supernodal multicolor ordering and sparsity pattern detailed in Appendix C. *
This allows us to prove the main theorem presented in the introduction.
Proof 5.43** (Proof of Theorem 2.1).**
Theorem 2.1* follows from Theorem 5.41 since rescaling the weights of the measurements to increases bounds on errors by at most a multiplicative polynomial factor in . By increasing the constant, this factor can be subsumed in the -dependence of . *
We have now established the results on exponential decay of the Cholesky factors of and the accuracy of Algorithm 2. Before proceeding to the next section, we will quickly establish a result on low-rank approximation of the Cholesky factors.
Theorem 5.44** (Approximate PCA).**
In the setting of Theorem 3.1, take and let be the matrix formed by the leading columns of the Cholesky factors of in the maximin ordering. Let be as in (2.7). Then there exists a constant depending only on , , , and such that
[TABLE]
Proof 5.45**.**
*Write with and . By Lemma 5.5, the approximation error made by keeping only the first columns of the Cholesky factorization is equal to the Schur complement . Consider the implicit hierarchy of the maximin ordering as in Fig. 3.5 with and let be such that . Write with and . The variational property (5.22) implies that . Theorem 5.23 (with obtained from the implicit hierarchy of Fig. 3.5) implies that (the extra multiplicative term arises because the measurement functions are scaled by in Example 5.1 with ). We conclude the proof using . *
6 Extensions and byproducts
6.1 The cases
Theorem 3.1 requires that to ensure that the elements of are continuous (by the Sobolev embedding theorem) and that pointwise evaluations of the Green’s function are well defined. The accuracy estimate of Theorem 3.1 can be extended to by replacing pointwise evaluations of the Green’s function by local averages and using variants of the Haar pre-wavelets of Example 5.2 instead of variants of the subsampled Diracs of Example 5.1 to decompose as in (3.13). Numerical experiments also suggest that the exponential decay of Cholesky factors still holds for if the local averages of Example 5.2 are sub-sampled as in Example 5.1, whereas the low-rank approximation becomes sub-optimal. As illustrated in Table 4.5, for Matérn kernels we observe no difference (in accuracy vs. complexity) between integer and non-integer values of .
6.2 Sparse factorization of
Let be the Cholesky factorization of the covariance matrix . Writing for the order-reversing permutation,
[TABLE]
Since is lower triangular, it is the Cholesky factor of in the reverse elimination ordering. Furthermore, since and both and are exponentially decaying, the Cholesky factors of are also exponentially decaying if the Gaussian elimination is performed using the reverse of Section 2’s ordering. In fact, the following, stronger, theorem holds:
Theorem 6.1**.**
In the setting of Theorem 3.1, let
[TABLE]
let be the Cholesky factor of in the reverse ordering, and define
[TABLE]
*Then there exists a constant depending only on , , , , and such that for , we have \bigl{\|}PAP-L^{\mathring{S}_{\rho}}L^{\mathring{S}_{\rho},\top}\bigr{\|}_{\operatorname{Fro}}\leq\epsilon. *
Using this result and the fact that has nonzero entries per column, one can prove that using Algorithm 2 with a supernodal ordering as described in 5.38 yields an -approximate Cholesky factorization of in computational complexity in time and in space. The matrix is essentially a discretized elliptic partial differential operator, and analogous results can be obtained in the setting where is obtained as a discretization of with regular finite elements and and is the inverse of that discretized operator. Numerical experiments suggest that exponential decay properties also hold for discretized second-order elliptic equations in two or three dimensions (where ) when using subsampling as in Example 5.1; see [76, Section 3.1] for a special case of this result on regular meshes. Thus, by computing the incomplete Cholesky factorization, we obtain a direct solver for general elliptic PDEs with complexity in time and in space. To the best of our knowledge, this is the best asymptotic complexity reported for such a solver in the literature (for elliptic PDEs with rough coefficients and rigorous a priori estimates of complexity vs. accuracy). It is not surprising that we obtain a fast solver for elliptic PDEs because our work is based on the fast solvers introduced in [64, 65], which in turn can be shown to be a block-wise version of the Cholesky factorization in nonstandard form introduced by [33], where the inverses of diagonal blocks are computed using iterative methods. By instead applying the Cholesky factorization in nonstandard form, the logarithmic factor in the complexity of the gamblet transform can be improved. However, the error estimates of [64] and [65] improve significantly upon those in [33] by establishing that exponential accuracy can be obtained with a finite number of vanishing moments even for rough coefficients. The present work further extends the results on Cholesky factorization to the setting of multiresolution schemes based on subsampling (without any vanishing moments). For such multiresolution basis the nonstandard form just reduces to computing an ordinary incomplete Cholesky factorization with the smaller sparsity pattern , thus greatly simplifying the implementation. We note that by using direct inversion methods similar to [55] it would be possible in principle to directly compute -approximations of the Cholesky factors of from entries of at computational cost of , but we defer a more detailed investigation to future work.
7 Comparison to related work
7.1 -matrix approximations from sparse Cholesky factorization
The -matrix data structure [39] uses low-rank approximations for blocks () fulfilling the admissibility condition
[TABLE]
The approximation property of the incomplete Cholesky factorization in maximin ordering (Theorem 3.1) directly implies bounds on the spectral decay of admissible blocks in the -matrix framework, as can be seen from the representation
[TABLE]
of the Cholesky factorization of . If is sparse according to the sparsity pattern obtained in Section 2 then can contribute to the rank of the sub-matrix only if
[TABLE]
The number of satisfying (7.3) is at most , which recovers (up to constants) the same rank bounds as obtained in [8] for second-order elliptic PDEs with rough coefficients. However the converse is not true and most hierarchical matrix representations can not be written in terms of a sparse Cholesky factorization of . For example, adding a diagonal matrix to does not affect the ranks of admissible blocks, but it diminishes the screening effect and thus the approximation property of the incomplete Cholesky factorization as obtained in Section 2 (see Section 4.3).
7.2 Comparison to Cholesky factorization in wavelet bases
[33] compute sparse Cholesky factorizations of (discretized) differential/integral operators represented in a wavelet basis. Using a fine-to-coarse elimination ordering, they establish that the resulting Cholesky factors decay polynomially with an exponent matching the number of vanishing moments of the underlying wavelet basis.
For differential operators, this coincides algorithmically with the Cholesky factorization described in Section 6.2 and the gamblet transform of [64] and [65], whose estimates guarantee exponential decay. In particular [33] numerically observe a uniform bound on which they relate to the approximate sparsity of their proposed Cholesky factorization.
For integral operators, [33] use a fine-to-coarse ordering and we use a coarse-to-fine ordering. While their results rely on the approximate sparsity of the integral operator represented in the wavelet basis, our approximation remains accurate for multiresolution bases (e.g. the maximin ordering in Section 2) in which is dense, which avoids the complexity of a basis transform (or the implementation of adaptive quadrature rules to mitigate this cost).
7.3 Vanishing moments
Let denote the set of polynomials of order at most that are supported on . [64] and [65] show that (5.18) and (5.17) hold when is an elliptic partial differential operator of order (as described in Section 2.1) and the measurements are local polynomials of order up to (i.e. with ). Using these as measurements is equivalent to using wavelets satisfying the vanishing moment condition
[TABLE]
The requirement for vanishing moments has three important consequences. First, it requires that the order of the operator be known a priori, so that a suitable number of vanishing moments can be ensured. Second, ensuring a suitable number of vanishing moments greatly increases the complexity of the implementation. Third, in order to provide vanishing moments, the measurements , , have to be obtained from weighted averages over domains of size of order . Therefore, even computing the first entry of the matrix in the multiresolution basis will have complexity , since it requires taking an average over almost all of . One of the main analytical result of this paper is to show that these vanishing moment conditions and local averages are not necessary for higher order operators (which, in particular, enables the generalization of the gamblet transform to hierarchies of measurements defined as in Examples 5.1 and 5.2).
7.4 Comparison to Multiresolution Approximation (M-RA)
In spatial statistics, the method most closely related to ours is the M-RA of [50] where a Gaussian process is approximated by a sum, at different scales, of predictive processes described in [6]. Following the intuition of the screening effect, these processes are assumed to be block-independent with respect to a domain decomposition at the respective scale, allowing for near-linear computational complexity. Although the specific multiresolution scheme and its accuracy are a function of the specific choice of basis functions and of the knots to be conditioned upon at each scale, no systematic strategy and no theoretical error bounds are provided for best accuracy. We suspect that no scheme relying on block-sparsity assumptions can also guarantee exponential accuracy in near-linear computational complexity, though we note that the taper-M-RA introduced by [51], independently of and after the first version of the present article, does not impose conditional block-independence and could therefore be made exponentially accurate. While our present work and that of [50] are both motivated by a hierarchical exploitation of the screening effect, we identify a concrete and simple algorithm that has a guaranteed exponential accuracy for a wide range of kernel matrices.
8 Conclusions
We have shown that the dense covariance matrices obtained from a wide range of covariance functions associated to smooth Gaussian processes have almost sparse Cholesky factors. Using this property, these matrices can be inverted in near-linear computational complexity just by applying zero fill-in incomplete Cholesky factorization with an a priori ordering and sparsity pattern. Sparse Cholesky factorization of sparse matrices is by now a classical field, but we are not aware of prior work on the sparse factorization of dense matrices, other than for the purpose of preconditioning. While our algorithm is subject to the curse of high dimensionality like other hierarchy-based methods, it is able to exploit low dimensionality in the data without any user intervention. Our results are motivated by the probabilistic interpretation of Cholesky factorization and proved rigorously by using and generalizing recent results on operator-adapted wavelets. By reversing the elimination order, we also obtain a fast direct solver for elliptic PDEs whose rigorous a priori accuracy-vs.-complexity estimates advance the current state of the art for general elliptic PDEs.
Acknowledgments
FS and HO gratefully acknowledge support by the Air Force Office of Scientific Research and the DARPA EQUiPS Program (award number FA9550-16-1-0054, Computational Information Games) and the Air Force Office of Scientific Research (award number FA9550-18-1-0271, Games for Computation and Learning). TJS has been supported by the Freie Universität Berlin within the Excellence Initiative of the German Research Foundation. This collaboration has been facilitated by the Statistical and Applied Mathematical Sciences Institute through the National Science Foundation award number DMS-1127914. Any opinions, findings, and conclusions or recommendations expressed in this material are those of the authors and do not necessarily reflect the views of the above-named institutes and agencies. We would like to thank C. Oates and P. Schröder for helpful discussions, and C. Scovel for many helpful comments and suggestions.
Appendix A Correctness and computational complexity of the maximum-minimum distance ordering
Recall the variables used in Algorithm 3: the integer array contains the minimax ordering; the real array contains the distances of each point to the points that are already included in the minimax ordering; and the arrays of integer arrays and will contain the entries of the sparsity pattern in the sense that
[TABLE]
We begin by showing correctness of the algorithm.
Theorem A.1**.**
The ordering and sparsity pattern produced by Algorithm 3 coincide with those described in Section 1. Furthermore, whenever the while-loop in 22 is entered,
- (1)
for all , is as defined in Section 1; 2. (2)
the array contains all and for all other in , contains exactly those that satisfy ; and 3. (3)
for all , consists of and all those that satisfy .
Proof A.2**.**
*It is easy to see that if the for-loop in 27 were running over all , then the algorithm would yield the correct result. We claim that the restriction of the running variable to does not change the result of the algorithm. The proof will proceed by induction. Let us assume that the Algorithm 3 has been correct up to a given time that 27 is visited. Then, by choice of and the triangle inequality, any that is omitted by the for-loop must satisfy . Since was chosen to have maximal minimal distance among the points remaining in , and , this means that adding to the maximin ordering can not decrease the maximal minimal distance of . Thus, skipping the operation does not change the choice of and . Similarly, implies that the if-statement inside of the for-loop is false, meaning that skipping does not change the update of or , from which the result follows. *
Having established Theorem A.1, we will now use , , and to refer to the maximin ordering, the length-scale of the point with index , and the th index in the maximin ordering. We will now bound the complexity of Algorithm 3 in the setting of Theorem 2.1.
Theorem A.3**.**
*In the setting of Theorem 2.1, there exists a constant depending depending only on , , and , such that, for , Algorithm 3 has computational complexity in space and CN\bigl{(}\rho^{d}\log^{2}N+C_{\operatorname{\mathtt{dist}}_{\partial\Omega}}\bigr{)} in time, where is the computational complexity of invoking the function . *
Proof A.4**.**
*As a first step, we will upper-bound the number of iterations of the for-loop in 27 throughout the algorithm. To simplify the notation, will denote a positive constant that depends on , and that may change throughout the proof. We claim that there exists depending only on , , and , such that, for all , by the time it appears in the while-loop at 22, there exists an index such that and . Indeed, since has Lipschitz boundary, it satisfies an interior cone condition [2] in the sense that there exist and such that every point is the tip of a spherical cone within with opening angle and radius . This spherical cone contains a ball with radius , which depends only on and . Let be such a cone with tip . By a scaling argument, the spherical cone then contains a ball of radius , for all . For any and any ball with radius at least , there exists a such that and . Thus, for , there exists a with . By a sphere-packing argument, we can find a such that, for all , , which yields the claim. Because of the above, for , there exists a point satisfying the constraint in 25 with . Thus, the number of times the for-loop in 27 is visited for a given index is bounded above by . By a sphere-packing argument, , for a constant depending only on , , and . Summing the above over yields the upper bound . The most costly step in the for-loop in 27 is the decrease! operation requiring the restoration of the heap property, which has computational complexity . Thus, the overall computational complexity is at most CN\bigl{(}\rho^{d}\log^{2}N+C_{\operatorname{\mathtt{dist}}_{\partial\Omega}}\bigr{)}. The bound on the space complexity follows, since each iteration of the for-loop consumes memory. *
Proof A.5** (Proof of Theorem 4.1).**
Theorem 4.1* follows from Theorems A.1 and A.3. *
Algorithm 3 uses only pairwise distances between points, and thus automatically adapts to low-dimensional structure in the . Indeed, for , the computational complexity of Algorithm 3 depends only on the intrinsic dimension of the dataset.
{condition}
[Intrinsic dimension]
There exist constants , independent of , such that, for all and ,
[TABLE]
We say that the point set has intrinsic dimension .
{condition}
[Polynomial Scaling]
There exists a polynomial for which
[TABLE]
Theorem A.6**.**
*Let and . Then the computational complexity of Algorithm 3 is at most in space and CN\bigl{(}\log(N)\rho^{\tilde{d}}(\log N+C_{\operatorname{\mathtt{dist}}})+C_{\operatorname{\mathtt{dist}}_{\partial\Omega}}\bigr{)} in time, for a constant C=C\bigl{(}C_{\tilde{d}},\tilde{d},\boldsymbol{p}\bigr{)} depending only on the constants in A.5 and A.5. *
Proof A.7**.**
*The proof is analogous to that of Theorem A.3. The main difference is that the claim on the existence of is replaced by the fact — which follows directly from the definition of the maximin ordering — that, for all , there exists a such that and . In particular, any leads to near-linear computational complexity. *
Appendix B Proofs of Section 5
Proof B.1** (Proof of Lemma 5.3).**
The main idea is to recursively apply Lemma 5.5. First, applying Lemma 5.5 with first block and second block yields
[TABLE]
We now repeat this operation recursively. After the th step, the central matrix has an upper-left block consisting of . We then apply Lemma 5.5 to this upper-left block, with the splitting given by and . This reduces the central matrix more and more towards the block-diagonal matrix , while splitting off a triangular factor to either side. Doing this up to the th step yields the following identity:
[TABLE]
We now combine the lower-triangular factors, obtaining
[TABLE]
Here, we have used the formulae for the inverses and products of elementary lower-triangular matrices [82, pp.150–151],
[TABLE]
*where is the th standard Euclidean basis row vector, with . *
Proof B.2** (Proof of Lemma 5.12).**
We prove the result in the setting of Example 5.1, since the proof for Example 5.2 is similar. The inequality and (5.24) imply that
[TABLE]
*The identity \|\phi_{i}\|_{H^{-s}\left(B_{(\delta/2)h^{k}}^{2}(x_{i})\right)}^{2}=h^{2sk}(\delta/2)^{2s-d}\|\boldsymbol{\delta}(\hbox to5.71527pt{\hss\cdot\hss}-0)\|_{H^{-s}\left(B_{1}(0)\right)}^{2} concludes the proof with C_{\Phi}\coloneqq\|\mathcal{L}\|(\delta/2)^{d-2s}\left\|\boldsymbol{\delta}(\hbox to5.71527pt{\hss\cdot\hss}-0)\right\|_{H^{-s}\left(B_{1}(0)\right)}^{-1} and . *
Proof B.3** (Proof of Lemma 5.14 in the case of Example 5.2).**
Let be a set of points such that covers , and such that . For and , we write if is the element of closest to (using an arbitrary way to break ties). For , , and we have
[TABLE]
The Bramble–Hilbert lemma [22] and the vanishing moment property (5.28) of yield that
[TABLE]
Summing over all and choosing the constant appropriately yields
[TABLE]
Since the are -orthogonal to each other and ,
[TABLE]
Inserting the definition of the yields
[TABLE]
We will now use the fact that on , we have the norm inequalities n^{-1/2}|\hbox to5.71527pt{\hss\cdot\hss}|_{1}\leq|\hbox to5.71527pt{\hss\cdot\hss}|_{2}\leq|\hbox to5.71527pt{\hss\cdot\hss}|_{1}. By a sphere-packing argument, for any , we have Thus, the number of summands in the innermost sum is at most and using the above norm inequalites, we obtain
[TABLE]
*Putting the above together yields the result. *
Proof B.4** (Proof of Lemma 5.28).**
Define
[TABLE]
and observe that . Since , it follows from a Neumann series argument that . The positive definiteness of implies that
[TABLE]
Let . Lemma 5.26 implies that
[TABLE]
Combining the above estimates yields
[TABLE]
By choosing
[TABLE]
and , we obtain
[TABLE]
This yields the upper bound
[TABLE]
Optimising the term on line (B.39) over yields
[TABLE]
Proof B.5** (Proof of Lemma 5.30).**
In this proof we will use the notation to denote the individual indices from to , as opposed to matrix blocks. We will establish the result by showing that, for all , the th column of (when considered as an element of by zero padding) satisfies the exponential decay stated in the lemma. Let . Then . Lemma 5.5 implies that , and hence Lemma 5.28 yields that
[TABLE]
*Here we used the facts that the spectrum of is contained in and that the right-hand side of the above estimate is increasing in and . The estimate completes the proof. *
Proof B.6** (Proof of Lemma 5.32).**
For any matrix for which the Neumann series converges in the operator norm, we have . Therefore, if the right-hand side series is convergent. Since has the block-lower-triangular structure
[TABLE]
it follows that is -nilpotent, i.e. and the Neuman series terminates after the first summands. Using this we will now show that the exponential decay of is preserved under inversion. To this end, consider the th block of and observe that
[TABLE]
where the inequality follows from Lemma 5.26. Summing (B.43) over , we obtain, for ,
[TABLE]
*which concludes the proof of the lemma. *
With the above results on the propagation of exponential decay we can now conclude the proof of Theorem 5.25.
Proof B.7** (Proof of Theorem 5.25).**
Applying Lemma 5.28, Section 5.3.2, and the condition number bound in Section 5.3.3 yields the following estimate for :
[TABLE]
with C_{R}=\max\left\{1,\frac{2C_{\gamma}\bigl{\|}B^{(k),-1}\bigr{\|}}{1+\kappa}\right\} and . Lemma 5.5 and Section 5.3.3 yield
[TABLE]
Using these estimates, we obtain
[TABLE]
where , and . Applying Lemma 5.26 to the products appearing in the definition of in Lemma Lemma 5.3, we obtain
[TABLE]
Lemma 5.32* now yields the following decay bound for :*
[TABLE]
For a positive-definite matrix , let denote its lower-triangular Cholesky factor and set . Following the same procedure as in the bound of the decay of yields the decay bound
[TABLE]
Applying Lemma 5.26 to the product yields the decay bound
[TABLE]
Appendix C Proof of accuracy of incomplete Cholesky factorization in the supernodal multicolor ordering
We will now bound the approximation error of the Cholesky factors obtained from Algorithm 2, using the supernodal multicolor ordering and sparsity pattern described in 5.38. For , let be the submatrix and let be the (dense and lower-triangular) Cholesky factor of a matrix .
First observe that Algorithm 2 with supernodal multicolor ordering and sparsity pattern is equivalent to the block-incomplete Cholesky factorization described in Algorithm 4 where the function sets all entries of outside of to zero.
We will now reformulate the above algorithm using the fact that the elimination of nodes of the same color, on the same level of the hierarchy, happens consecutively. Let be the maximal number of colors used on any level of the hierarchy. We can then write , where is the set of indices on level colored in the color . Let be the restriction of to and write . We can then rewrite Algorithm 4 as
For and a matrix with for all , let be the matrix obtained by applying followed by the Schur complementation M\leftarrow M-M_{(:,:),(k,l)}\bigl{(}M_{(k,l),(k,l)}\bigr{)}^{-1}M_{(k,l),(:,:)}. We now prove a stability estimate for the operator . Let be the restriction of a matrix to .
Lemma C.1**.**
For and let be such that
[TABLE]
and (writing for the submatrix of and for maximal singular values) define
[TABLE]
If
[TABLE]
then the following perturbation estimate holds:
[TABLE]
Proof C.2**.**
Write , for the versions of , set to zero outside of . For ,
[TABLE]
where the second equality follows from the matrix identity
[TABLE]
Now recall that, for all , and . Therefore, and . Combining these estimates and using the triangle inequality yields
[TABLE]
Recursive application of the above lemma gives a stability result for the incomplete Cholesky factorization.
Lemma C.3**.**
For , let and be a supernodal ordering and sparsity pattern such that the maximal number of colors used on each level is at most . Let be an invertible lower-triangular matrix with nonzero pattern and define . Assume that satisfies Section 5.3.3 with constant . Then there exists a universal constant such that, for all and all with ,
[TABLE]
*where is the Cholesky factor obtained by applying Algorithm 5 to . *
Proof C.4**.**
*The result follows from applying Lemma C.1 at each step of Algorithm 5. *
We can prove Theorem 5.41 by using the stability result obtained above.
Proof C.5** (Proof of Theorem 5.41).**
Theorem 5.34* implies that by choosing there exists a lower-triangular matrix with sparsity pattern such that \bigl{\|}\Theta-\tilde{L}^{S_{\rho}}\tilde{L}^{S_{\rho},\top}\bigr{\|}_{\operatorname{Fro}}\leq\epsilon. Theorem 5.23 implies that the Example 5.1 and Example 5.2 satisfy . Therefore, choosing ensures that and thus that satisfies Section 5.3.3 with constant , where is the corresponding constant for . By possibly changing the constant again, also ensures that*
[TABLE]
where is the constant of Lemma C.3, since and, by Lemma 5.39, is bounded independently of . Thus, by Lemma C.3, the Cholesky factor obtained from applying Algorithm 5 to \Theta=\tilde{\Theta}+\bigl{(}\Theta-\tilde{\Theta}\bigr{)} satisfies
[TABLE]
*where is the constant with which satisfies Section 5.3.3 and the polynomial depends only on , , and . Since, for the ordering and sparsity pattern , the Cholesky factors obtained via Algorithms 2 and 5 coincide, we obtain the result. *
The reference list from the paper itself. Each links out to its DOI / PubMed record.
- 1[1] M. Abramowitz and I. A. Stegun , Handbook of Mathematical Functions with Formulas, Graphs, and Mathematical Tables , vol. 55 of National Bureau of Standards Applied Mathematics Series, U.S. Government Printing Office, Washington, D.C., 1964.
- 2[2] R. A. Adams and J. J. F. Fournier , Sobolev Spaces , vol. 140 of Pure and Applied Mathematics (Amsterdam), Elsevier/Academic Press, Amsterdam, second ed., 2003.
- 3[3] S. Ambikasaran and E. Darve , An 𝒪 ( N log N ) 𝒪 𝑁 𝑁 \mathcal{O}(N\log N) fast direct solver for partial hierarchically semi-separable matrices , J. Sci. Comput., 57 (2013), pp. 477–501, https://doi.org/10.1007/s 10915-013-9714-z . · doi ↗
- 4[4] S. Ambikasaran, D. Foreman-Mackey, L. Greengard, D. W. Hogg, and M. O’Neil , Fast direct methods for Gaussian processes , IEEE Trans. Pattern Anal. Mach. Intell., 38 (2016), pp. 252–265, https://doi.org/10.1109/TPAMI.2015.2448083 . · doi ↗
- 5[5] F. R. Bach and M. I. Jordan , Kernel independent component analysis , J. Mach. Learn. Res., 3 (2003), pp. 1–48, https://doi.org/10.1162/153244303768966085 . · doi ↗
- 6[6] S. Banerjee, A. E. Gelfand, A. O. Finley, and H. Sang , Gaussian predictive process models for large spatial data sets , J. R. Stat. Soc. Ser. B Stat. Methodol., 70 (2008), pp. 825–848, https://doi.org/10.1111/j.1467-9868.2008.00663.x . · doi ↗
- 7[7] M. Bebendorf , Hierarchical Matrices , vol. 63 of Lecture Notes in Computational Science and Engineering, Springer-Verlag, Berlin, 2008, https://doi.org/10.1007/978-3-540-77147-0 . · doi ↗
- 8[8] M. Bebendorf and W. Hackbusch , Existence of ℋ ℋ \mathcal{H} -matrix approximants to the inverse FE-matrix of elliptic operators with L ∞ superscript 𝐿 L^{\infty} -coefficients , Numer. Math., 95 (2003), pp. 1–28, https://doi.org/10.1007/s 00211-002-0445-6 . · doi ↗
