Compression, inversion, and approximate PCA of dense kernel matrices at   near-linear computational complexity

Florian Sch\"afer; T. J. Sullivan; Houman Owhadi

arXiv:1706.02205·math.NA·November 3, 2020

Compression, inversion, and approximate PCA of dense kernel matrices at near-linear computational complexity

Florian Sch\"afer, T. J. Sullivan, Houman Owhadi

PDF

1 Repo

TL;DR

This paper introduces a near-linear complexity method for compressing, inverting, and performing approximate PCA on dense kernel matrices derived from elliptic boundary value problems, using sparse Cholesky factorization.

Contribution

It presents a novel algorithm that efficiently approximates dense kernel matrices with provable guarantees, improving computational complexity over previous methods.

Findings

01

Achieves near-linear complexity for matrix compression and inversion.

02

Provides an approximate sparse PCA with optimal convergence rate.

03

Demonstrates improved performance for elliptic PDE solvers and kernel matrix operations.

Abstract

Dense kernel matrices $Θ \in R^{N \times N}$ obtained from point evaluations of a covariance function $G$ at locations ${x_{i}}_{1 \leq i \leq N} \subset R^{d}$ arise in statistics, machine learning, and numerical analysis. For covariance functions that are Green's functions of elliptic boundary value problems and homogeneously-distributed sampling points, we show how to identify a subset $S \subset {1, \dots, N}^{2}$ , with $# S = O (N lo g (N) lo g^{d} (N / ϵ))$ , such that the zero fill-in incomplete Cholesky factorisation of the sparse matrix $Θ_{ij} 1_{(i, j) \in S}$ is an $ϵ$ -approximation of $Θ$ . This factorisation can provably be obtained in complexity $O (N lo g (N) lo g^{d} (N / ϵ))$ in space and $O (N lo g^{2} (N) lo g^{2 d} (N / ϵ))$ in time, improving upon the state of the art for general elliptic…

Tables8

Table 1. Table 4.1: G ν , l Matérn subscript superscript 𝐺 Matérn 𝜈 𝑙 G^{\text{Mat{\'{e}}rn}}_{\nu,l} , with ν = 0.5 𝜈 0.5 \nu=0.5 , l = 0.2 𝑙 0.2 l=0.2 , ρ = 3.0 𝜌 3.0 \rho=3.0 , and d = 2 𝑑 2 d=2 .

$N$	$nnz (L) / N^{2}$	$rank (L)$	$t_{SortSparse}$	$t_{Entries}$	$t_{ICHOL(0)}$	$E$	$\bar{E}$
$20000$	5.26e-03	20000	0.71	0.81	0.42	1.25e-03 (3.68e-06)	1.11e-03 (3.01e-06)
$20000$	5.26e-03	20000	0.71	0.81	0.42	1.25e-03 (3.68e-06)	1.11e-03 (3.01e-06)
$40000$	2.94e-03	40000	1.21	1.19	1.00	1.27e-03 (3.32e-06)	1.12e-03 (3.56e-06)
$80000$	1.62e-03	80000	2.72	2.82	2.55	1.30e-03 (3.20e-06)	1.21e-03 (3.29e-06)
$160000$	8.91e-04	160000	6.86	6.03	6.11	1.28e-03 (3.57e-06)	1.16e-03 (3.32e-06)
$320000$	4.84e-04	320000	17.22	13.79	15.66	1.23e-03 (3.19e-06)	1.11e-03 (2.40e-06)
$640000$	2.63e-04	640000	41.40	31.02	36.02	1.24e-03 (2.58e-06)	1.09e-03 (3.02e-06)
$1280000$	1.41e-04	1280000	98.34	65.96	85.99	1.23e-03 (3.72e-06)	1.10e-03 (3.74e-06)
$2560000$	7.55e-05	2560000	233.92	148.43	197.52	1.16e-03 (2.82e-06)	1.04e-03 (3.36e-06)

Table 2. Table 4.2: G ν , l Matérn subscript superscript 𝐺 Matérn 𝜈 𝑙 G^{\text{Mat{\'{e}}rn}}_{\nu,l} , with ν = 0.5 𝜈 0.5 \nu=0.5 , l = 0.2 𝑙 0.2 l=0.2 , ρ = 3.0 𝜌 3.0 \rho=3.0 , and d = 3 𝑑 3 d=3 .

$N$	$nnz (L) / N^{2}$	$rank (L)$	$t_{SortSparse}$	$t_{Entries}$	$t_{ICHOL(0)}$	$E$	$\bar{E}$
$20000$	1.30e-02	20000	1.61	1.44	2.94	1.49e-03 (5.00e-06)	1.20e-03 (5.09e-06)
$40000$	7.60e-03	40000	3.26	3.32	8.33	1.21e-03 (4.29e-06)	9.91e-04 (3.72e-06)
$80000$	4.35e-03	80000	7.46	7.64	22.46	1.06e-03 (3.74e-06)	8.51e-04 (2.93e-06)
$160000$	2.45e-03	160000	20.95	18.42	57.64	9.81e-04 (2.33e-06)	7.88e-04 (3.23e-06)
$320000$	1.37e-03	320000	53.58	40.72	141.46	9.27e-04 (2.26e-06)	7.53e-04 (2.72e-06)
$640000$	7.61e-04	640000	133.55	96.67	350.10	8.98e-04 (3.25e-06)	7.25e-04 (3.02e-06)
$1280000$	4.19e-04	1280000	312.43	212.57	820.07	8.59e-04 (2.79e-06)	7.00e-04 (2.87e-06)
$2560000$	2.29e-04	2560000	795.68	480.17	1981.92	8.96e-04 (2.76e-06)	7.73e-04 (4.28e-06)

Table 3. Table 4.3: G ν , l Matérn subscript superscript 𝐺 Matérn 𝜈 𝑙 G^{\text{Mat{\'{e}}rn}}_{\nu,l} , with ν = 1.0 𝜈 1.0 \nu=1.0 , l = 0.2 𝑙 0.2 l=0.2 , N = 10 6 𝑁 superscript 10 6 N=10^{6} , and d = 2 𝑑 2 d=2 .

	$nnz (L) / N^{2}$	$rank (L)$	$t_{SortSparse}$	$t_{Entries}$	$t_{ICHOL(0)}$	$E$	$\bar{E}$
$ρ = 2.0$	8.78e-05	254666	38.06	33.72	17.54	2.04e-02 (1.73e-02)	2.34e-02 (2.75e-02)
$ρ = 3.0$	1.76e-04	964858	71.07	67.85	61.35	2.32e-03 (6.02e-06)	2.09e-03 (7.50e-06)
$ρ = 4.0$	2.90e-04	999810	115.07	112.56	152.93	3.92e-04 (1.44e-06)	3.72e-04 (2.32e-06)
$ρ = 5.0$	4.26e-04	999999	165.91	166.60	312.19	6.70e-05 (2.98e-07)	5.68e-05 (2.55e-07)
$ρ = 6.0$	5.83e-04	1000000	227.62	229.76	566.94	1.45e-05 (6.69e-08)	1.08e-05 (5.01e-08)
$ρ = 7.0$	7.59e-04	1000000	292.52	300.65	944.33	4.05e-06 (4.96e-08)	2.10e-06 (1.69e-08)
$ρ = 8.0$	9.53e-04	1000000	363.90	380.07	1476.71	1.62e-06 (2.30e-08)	4.08e-07 (9.47e-09)
$ρ = 9.0$	1.16e-03	1000000	447.47	467.07	2200.32	8.98e-07 (1.44e-08)	1.42e-07 (5.14e-09)

Table 4. Table 4.4: G ν , l Matérn subscript superscript 𝐺 Matérn 𝜈 𝑙 G^{\text{Mat{\'{e}}rn}}_{\nu,l} , with ν = 0.5 𝜈 0.5 \nu=0.5 , l = 0.2 𝑙 0.2 l=0.2 , N = 10 6 𝑁 superscript 10 6 N=10^{6} , and d = 3 𝑑 3 d=3 .

	$nnz (L) / N^{2}$	$rank (L)$	$t_{SortSparse}$	$t_{Entries}$	$t_{ICHOL(0)}$	$E$	$\bar{E}$
$ρ = 2.0$	1.87e-04	998046	87.83	56.44	85.20	1.69e-02 (6.89e-04)	1.60e-02 (3.36e-04)
$ρ = 3.0$	5.17e-04	1000000	226.84	158.42	599.86	8.81e-04 (3.21e-06)	7.15e-04 (2.99e-06)
$ρ = 4.0$	1.05e-03	1000000	446.52	326.27	2434.52	1.85e-04 (5.37e-07)	1.59e-04 (5.30e-07)
$ρ = 5.0$	1.82e-03	1000000	747.65	567.06	7227.45	2.89e-05 (1.94e-07)	1.84e-05 (1.15e-07)
$ρ = 6.0$	2.82e-03	1000000	1344.59	928.27	17640.58	1.15e-05 (1.06e-07)	5.34e-06 (5.34e-08)

Table 5. Table 4.5: We tabulate the approximation rank and error for ρ = 5.0 𝜌 5.0 \rho=5.0 and N = 10 6 𝑁 superscript 10 6 N=10^{6} points uniformly distributed in [ 0 , 1 ] 3 superscript 0 1 3 [0,1]^{3} . The covariance function is G ν , 0.2 Matérn subscript superscript 𝐺 Matérn 𝜈 0.2 G^{\text{Mat{\'{e}}rn}}_{\nu,0.2} for ν 𝜈 \nu ranging around ν = 0.5 𝜈 0.5 \nu=0.5 and ν = 1.5 𝜈 1.5 \nu=1.5 . Even though the intermediate values of ν 𝜈 \nu correspond to a fractional order elliptic PDE, the behavior of the approximation stays the same.

	$ν = 0.3$	$ν = 0.5$	$ν = 0.7$	$ν = 0.9$	$ν = 1.1$	$ν = 1.3$	$ν = 1.5$	$ν = 1.7$
$rank (L)$	1000000	1000000	1000000	1000000	1000000	1000000	1000000	999893
$E$	7.04e-05	2.89e-05	2.49e-05	3.58e-05	6.03e-05	8.77e-05	1.18e-04	1.46e-04
	(3.98e-07)	(1.79e-07)	(1.11e-07)	(1.19e-07)	(2.37e-07)	(3.06e-07)	(4.52e-07)	(5.39e-07)
$\bar{E}$	5.19e-05	1.85e-05	1.77e-05	2.82e-05	4.88e-05	6.87e-05	9.06e-05	1.13e-04
	(2.26e-07)	(1.18e-07)	(8.11e-08)	(1.30e-07)	(2.37e-07)	(3.50e-07)	(5.14e-07)	(5.45e-07)

Table 6. Table 4.6: G l , α , β Cauchy subscript superscript 𝐺 Cauchy 𝑙 𝛼 𝛽 G^{\text{Cauchy}}_{l,\alpha,\beta} for ( l , α , β ) = ( 0.4 , 0.5 , 0.025 ) 𝑙 𝛼 𝛽 0.4 0.5 0.025 (l,\alpha,\beta)=(0.4,0.5,0.025) (first table) and ( l , α , β ) = ( 0.2 , 1.0 , 0.20 ) 𝑙 𝛼 𝛽 0.2 1.0 0.20 (l,\alpha,\beta)=(0.2,1.0,0.20) (second table), for N = 10 6 𝑁 superscript 10 6 N=10^{6} and d = 2 𝑑 2 d=2 .

	$ρ = 2.0$	$ρ = 3.0$	$ρ = 4.0$	$ρ = 5.0$	$ρ = 6.0$	$ρ = 7.0$	$ρ = 8.0$	$ρ = 9.0$
$rank (L)$	999923	1000000	1000000	1000000	1000000	1000000	1000000	1000000
$E$	4.65e-04	5.98e-05	2.36e-05	1.19e-05	4.84e-06	4.17e-06	2.25e-06	1.42e-06
	(4.23e-07)	(1.56e-07)	(9.53e-08)	(6.32e-08)	(4.14e-08)	(4.99e-08)	(1.86e-08)	(1.64e-08)
$\bar{E}$	3.81e-04	3.49e-05	9.83e-06	4.65e-06	1.47e-06	8.49e-07	4.25e-07	2.12e-07
	(4.98e-07)	(1.59e-07)	(5.56e-08)	(2.63e-08)	(7.73e-09)	(1.04e-08)	(4.81e-09)	(3.24e-09)

Table 7. Table 4.7: G ν , l Matérn subscript superscript 𝐺 Matérn 𝜈 𝑙 G^{\text{Mat{\'{e}}rn}}_{\nu,l} for ν = 0.5 𝜈 0.5 \nu=0.5 , l = 0.2 𝑙 0.2 l=0.2 , and ρ = 3.0 𝜌 3.0 \rho=3.0 with N = 10 6 𝑁 superscript 10 6 N=10^{6} points chosen as in Fig. 4.6 .

	$δ_{z} = 0.0$	$δ_{z} = 0.1$	$δ_{z} = 0.2$	$δ_{z} = 0.3$	$δ_{z} = 0.4$	$δ_{z} = 0.5$	$δ_{z} = 0.6$
$\frac{nnz (L)}{N^{2}}$	1.76e-04	1.77e-04	1.78e-04	1.80e-04	1.82e-04	1.84e-04	1.85e-04
$t_{ICHOL(0)}$	61.92	62.15	62.81	64.27	64.87	65.50	66.12
$rank (L)$	1000000	1000000	1000000	1000000	1000000	1000000	1000000
$E$	1.17e-03	1.11e-03	1.28e-03	1.60e-03	1.72e-03	1.89e-03	2.11e-03
	(2.74e-06)	(3.00e-06)	(2.73e-06)	(4.28e-06)	(3.95e-06)	(5.11e-06)	(5.07e-06)

Table 8. Table 4.8: G ν , l Matérn subscript superscript 𝐺 Matérn 𝜈 𝑙 G^{\text{Mat{\'{e}}rn}}_{\nu,l} for ν = 0.5 𝜈 0.5 \nu=0.5 , l = 0.5 𝑙 0.5 l=0.5 , and N = 10 6 𝑁 superscript 10 6 N=10^{6} points as in Fig. 4.7 .

	$nnz (L) / N^{2}$	$rank (L)$	$t_{SortSparse}$	$t_{Entries}$	$t_{ICHOL(0)}$	$E$
$ρ = 2.0$	1.62e-04	997635	80.60	57.11	52.49	1.57e-02 (1.13e-03)
$ρ = 3.0$	3.76e-04	1000000	173.86	135.61	248.78	2.88e-03 (1.14e-05)
$ρ = 4.0$	6.76e-04	1000000	302.98	247.74	748.62	8.80e-04 (4.97e-06)
$ρ = 5.0$	1.05e-03	1000000	462.98	397.42	1802.44	3.44e-04 (2.54e-06)
$ρ = 6.0$	1.49e-03	1000000	645.56	556.72	3696.31	1.44e-04 (8.76e-07)
$ρ = 7.0$	2.02e-03	1000000	891.08	758.88	6855.23	7.61e-05 (5.66e-07)
$ρ = 8.0$	2.62e-03	1000000	1248.90	990.86	11598.66	4.57e-05 (4.36e-07)

Equations340

Θ_{ij} : = G (x_{i}, x_{j}),

Θ_{ij} : = G (x_{i}, x_{j}),

L : H_{0}^{s} (Ω) \to H^{- s} (Ω)

L : H_{0}^{s} (Ω) \to H^{- s} (Ω)

\int_{Ω} u L v d x = 0 for all u, v \in H_{0}^{s} (Ω) such that supp u \cap supp v = \emptyset.

\int_{Ω} u L v d x = 0 for all u, v \in H_{0}^{s} (Ω) such that supp u \cap supp v = \emptyset.

δ : = \frac{min _{i \neq = j \in I} dist ( x _{i} , { x _{j} } \cup \partial Ω )}{max _{x \in Ω} dist ( x , { x _{i} } _{i \in I} \cup \partial Ω )}

δ : = \frac{min _{i \neq = j \in I} dist ( x _{i} , { x _{j} } \cup \partial Ω )}{max _{x \in Ω} dist ( x , { x _{i} } _{i \in I} \cup \partial Ω )}

\bigl{\|}\Theta-L^{\rho}L^{\rho,\top}\bigr{\|}_{\operatorname{Fro}}\leq\epsilon.

\bigl{\|}\Theta-L^{\rho}L^{\rho,\top}\bigr{\|}_{\operatorname{Fro}}\leq\epsilon.

i_{1} : = arg max_{i \in I} dist (x_{i}, \partial Ω) .

i_{1} : = arg max_{i \in I} dist (x_{i}, \partial Ω) .

i_{k + 1} : = arg max_{i \in I ∖ {i_{1}, \dots, i_{k}}} dist (x_{i}, {x_{i_{1}}, \dots, x_{i_{k}}} \cup \partial Ω) .

i_{k + 1} : = arg max_{i \in I ∖ {i_{1}, \dots, i_{k}}} dist (x_{i}, {x_{i_{1}}, \dots, x_{i_{k}}} \cup \partial Ω) .

l [i_{k}] : = dist (x_{i_{k}}, {x_{i_{1}}, \dots, x_{i_{k - 1}}} \cup \partial Ω),

l [i_{k}] : = dist (x_{i_{k}}, {x_{i_{1}}, \dots, x_{i_{k - 1}}} \cup \partial Ω),

S_{ρ} : = {(i, j) \in I \times I ∣ dist (x_{i}, x_{j}) \leq ρ max (l [i], l [j])} .

S_{ρ} : = {(i, j) \in I \times I ∣ dist (x_{i}, x_{j}) \leq ρ max (l [i], l [j])} .

\bigl{\|}\Theta-L^{(k)}L^{(k),\top}\bigr{\|}\leq C\|\Theta\|k^{-\frac{2s}{d}}\,,

\bigl{\|}\Theta-L^{(k)}L^{(k),\top}\bigr{\|}\leq C\|\Theta\|k^{-\frac{2s}{d}}\,,

L_{ij}^{S_{ρ}} : = L_{ij} 1_{(i, j) \in S_{ρ}} = {L_{ij}, 0, for (i, j) \in S_{ρ}, else,

L_{ij}^{S_{ρ}} : = L_{ij} 1_{(i, j) \in S_{ρ}} = {L_{ij}, 0, for (i, j) \in S_{ρ}, else,

(Θ_{1, 1} Θ_{2, 1} Θ_{1, 2} Θ_{2, 2}) = (Id Θ_{2, 1} (Θ_{1, 1})^{- 1} 0 Id) (Θ_{1, 1} 0 0 Θ_{2, 2} - Θ_{2, 1} (Θ_{1, 1})^{- 1} Θ_{1, 2}) (Id 0 (Θ_{1, 1})^{- 1} Θ_{1, 2} Id),

(Θ_{1, 1} Θ_{2, 1} Θ_{1, 2} Θ_{2, 2}) = (Id Θ_{2, 1} (Θ_{1, 1})^{- 1} 0 Id) (Θ_{1, 1} 0 0 Θ_{2, 2} - Θ_{2, 1} (Θ_{1, 1})^{- 1} Θ_{1, 2}) (Id 0 (Θ_{1, 1})^{- 1} Θ_{1, 2} Id),

E [X_{2} ∣ X_{1} = a]

E [X_{2} ∣ X_{1} = a]

Cov [X_{2} ∣ X_{1}]

ϕ_{i}^{(k)} = j \in I^{(l)} \sum π_{i, j}^{(k, l)} ϕ_{j}^{(l)} .

ϕ_{i}^{(k)} = j \in I^{(l)} \sum π_{i, j}^{(k, l)} ϕ_{j}^{(l)} .

ψ_{i}^{(k)} : = E [ξ [ϕ_{j}^{(k)}, ξ] = δ_{ij} for all j \in I^{(k)}] for i \in I^{(k)}

ψ_{i}^{(k)} : = E [ξ [ϕ_{j}^{(k)}, ξ] = δ_{ij} for all j \in I^{(k)}] for i \in I^{(k)}

ψ_{i}^{(k)} = j \in I^{(k)} \sum Θ_{i, j}^{(k), - 1} L^{- 1} ϕ_{j}^{(k)} for i \in I^{(k)},

ψ_{i}^{(k)} = j \in I^{(k)} \sum Θ_{i, j}^{(k), - 1} L^{- 1} ϕ_{j}^{(k)} for i \in I^{(k)},

χ_{i}^{(k)} : = j \sum W_{ij}^{(k)} ψ_{j}^{(k)} for i \in J^{(k)},

χ_{i}^{(k)} : = j \sum W_{ij}^{(k)} ψ_{j}^{(k)} for i \in J^{(k)},

χ_{i}^{(k)} : = E [ξ [ϕ_{j}^{(k), W}, ξ] = δ_{ij} δ_{k l} for all 1 \leq l \leq k, j \in J^{(l)}] for i \in J^{(k)},

χ_{i}^{(k)} : = E [ξ [ϕ_{j}^{(k), W}, ξ] = δ_{ij} δ_{k l} for all 1 \leq l \leq k, j \in J^{(l)}] for i \in J^{(k)},

\bigl{\langle}\chi_{i}^{(k)},\chi_{j}^{(l)}\bigr{\rangle}=0\text{ for }l\not=k\text{ and }(i,j)\in J^{(k)}\times J^{(l)}\,.

\bigl{\langle}\chi_{i}^{(k)},\chi_{j}^{(l)}\bigr{\rangle}=0\text{ for }l\not=k\text{ and }(i,j)\in J^{(k)}\times J^{(l)}\,.

\overset{ˉ}{Θ}_{k, l} : = W^{(k)} Θ^{(k)} π^{(k, l)} W^{(l), T},

\overset{ˉ}{Θ}_{k, l} : = W^{(k)} Θ^{(k)} π^{(k, l)} W^{(l), T},

\bigl{(}\bar{\Theta}_{k,l}\bigr{)}_{ij}\coloneqq\bigl{[}\phi^{(k),W}_{i},\mathcal{L}^{-1}\phi^{(l),W}_{j}\bigr{]}.

\bigl{(}\bar{\Theta}_{k,l}\bigr{)}_{ij}\coloneqq\bigl{[}\phi^{(k),W}_{i},\mathcal{L}^{-1}\phi^{(l),W}_{j}\bigr{]}.

\overset{ˉ}{Θ} = \overset{ˉ}{L} D \overset{ˉ}{L}^{⊤},

\overset{ˉ}{Θ} = \overset{ˉ}{L} D \overset{ˉ}{L}^{⊤},

\overset{ˉ}{L}_{i, j} : = ⎩ ⎨ ⎧ δ_{i, j}, 0, [ϕ_{i}^{(k)}, χ_{j}^{(k^{'})}], if i, j \in J^{(k)}, if i \in J^{(k)}, j \in J^{(k^{'})}, and k^{'} > k, if i \in J^{(k)}, j \in J^{(k^{'})}, and k^{'} < k .

\overset{ˉ}{L}_{i, j} : = ⎩ ⎨ ⎧ δ_{i, j}, 0, [ϕ_{i}^{(k)}, χ_{j}^{(k^{'})}], if i, j \in J^{(k)}, if i \in J^{(k)}, j \in J^{(k^{'})}, and k^{'} > k, if i \in J^{(k)}, j \in J^{(k^{'})}, and k^{'} < k .

\tilde{Θ} = Θ + σ^{2} Id .

\tilde{Θ} = Θ + σ^{2} Id .

\tilde{Θ} = Θ (σ^{2} A + Id),

\tilde{Θ} = Θ (σ^{2} A + Id),

\tilde{Θ} = L L^{⊤} P^{↕} \tilde{L} \tilde{L}^{⊤} P^{↕},

\tilde{Θ} = L L^{⊤} P^{↕} \tilde{L} \tilde{L}^{⊤} P^{↕},

E\coloneqq\frac{\|LL^{\top}-\Theta\|_{\operatorname{Fro}}}{\|\Theta\|_{\operatorname{Fro}}}\approx\frac{\sqrt{\sum_{k=1}^{m}\bigl{\|}\bigl{(}LL^{\top}-\Theta\bigr{)}_{i_{k}j_{k}}\bigr{\|}^{2}}}{\sqrt{\sum_{k=1}^{m}\|\Theta_{i_{k}j_{k}}\|^{2}}},

E\coloneqq\frac{\|LL^{\top}-\Theta\|_{\operatorname{Fro}}}{\|\Theta\|_{\operatorname{Fro}}}\approx\frac{\sqrt{\sum_{k=1}^{m}\bigl{\|}\bigl{(}LL^{\top}-\Theta\bigr{)}_{i_{k}j_{k}}\bigr{\|}^{2}}}{\sqrt{\sum_{k=1}^{m}\|\Theta_{i_{k}j_{k}}\|^{2}}},

G_{l, ν}^{Mat \overset{e}{ˊ} rn} (x, y) : = \frac{2 ^{1 - ν}}{Γ ( ν )} (\frac{2 ν ∣ x - y ∣}{l})^{ν} K_{ν} (\frac{2 ν ∣ x - y ∣}{l}),

G_{l, ν}^{Mat \overset{e}{ˊ} rn} (x, y) : = \frac{2 ^{1 - ν}}{Γ ( ν )} (\frac{2 ν ∣ x - y ∣}{l})^{ν} K_{ν} (\frac{2 ν ∣ x - y ∣}{l}),

G_{l, α, β}^{Cauchy} (x, y) : = (1 + (\frac{∣ x - y ∣}{l})^{α})^{- \frac{β}{α}} .

G_{l, α, β}^{Cauchy} (x, y) : = (1 + (\frac{∣ x - y ∣}{l})^{α})^{- \frac{β}{α}} .

∥ ϕ ∥_{*} : = 0 \neq = u \in B sup \frac{[ ϕ , u ]}{∥ u ∥} = [ϕ, Gϕ] for ϕ \in B^{*} .

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

f-t-s/nearLinKernel
noneOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Full text

\newsiamremark

remarkRemark \newsiamremarkhypothesisHypothesis

\newsiamthmclaimClaim \newsiamthmexampleExample

\newsiamthmconditionCondition

\newsiamthmconstructionConstruction

\headersInversion of dense kernel matrices Florian Schäfer, T. J. Sullivan, and Houman Owhadi

Compression, inversion, and approximate PCA of dense kernel matrices at near-linear computational complexity

Florian Schäfer California Institute of Technology, MC 305-16, 1200 East California Boulevard, Pasadena, CA 91125, USA,

, Phone: (626) 395-3531, Fax: (626) 578-0124,

Corresponding Author [email protected]

T. J. Sullivan Mathematics Institute and School of Engineering, The University of Warwick, Coventry, CV4 7AL, UK, ; and Zuse Institute Berlin, Takustraße 7, 14195 Berlin, Germany, [email protected]

[email protected]

Houman Owhadi California Institute of Technology

Abstract

Dense kernel matrices $\Theta\in\mathbb{R}^{N\times N}$ obtained from point evaluations of a covariance function $G$ at locations $\{x_{i}\}_{1\leq i\leq N}\subset\mathbb{R}^{d}$ arise in statistics, machine learning, and numerical analysis. For covariance functions that are Green’s functions of elliptic boundary value problems and homogeneously-distributed sampling points, we show how to identify a subset $S\subset\{1,\dots,N\}^{2}$ , with $\#S=\mathcal{O}(N\log(N)\log^{d}(N/\epsilon))$ , such that the zero fill-in incomplete Cholesky factorisation of the sparse matrix $\Theta_{ij}\mathbf{1}_{(i,j)\in S}$ is an $\epsilon$ -approximation of $\Theta$ . This factorisation can provably be obtained in complexity $\mathcal{O}(N\log(N)\log^{d}(N/\epsilon))$ in space and $\mathcal{O}(N\log^{2}(N)\log^{2d}(N/\epsilon))$ in time, improving upon the state of the art for general elliptic operators; we further present numerical evidence that $d$ can be taken to be the intrinsic dimension of the data set rather than that of the ambient space. The algorithm only needs to know the spatial configuration of the $x_{i}$ and does not require an analytic representation of $G$ . Furthermore, this factorization straightforwardly provides an approximate sparse PCA with optimal rate of convergence in the operator norm. Hence, by using only subsampling and the incomplete Cholesky factorization, we obtain, at nearly linear complexity, the compression, inversion and approximate PCA of a large class of covariance matrices. By inverting the order of the Cholesky factorization we also obtain a solver for elliptic PDE with complexity $\mathcal{O}(N\log^{d}(N/\epsilon))$ in space and $\mathcal{O}(N\log^{2d}(N/\epsilon))$ in time, improving upon the state of the art for general elliptic operators.

keywords:

Cholesky factorization, covariance function, gamblet transform, kernel matrix, sparsity, principal component analysis

{AMS}

65F30, 42C40, 65F50, 65N55, 65N75, 60G42, 68Q25, 68W40

1 Introduction

1.1 Dense kernel matrices and the $N^{3}$ -bottleneck

Kernel matrices, i.e. square matrices $\Theta$ of the form

[TABLE]

obtained from pointwise evaluation of a symmetric positive-definite kernel $G$ at a collection of points $\{x_{i}\}_{i\in I}$ in a domain $\Omega\subset\mathbb{R}^{d}$ , play an important role in statistics, machine learning, and scientific computing. In statistics, they are used as covariance matrices of Gaussian process priors. In machine learning, they equip the feature space with a meaningful inner product via the kernel trick [44]. In scientific computing, they appear as Green’s functions (i.e. fundamental solutions) of linear elliptic partial differential equations (PDEs).

For all these applications, it is usually necessary to perform some or all of the following tasks:

(1)

compute $v\mapsto\Theta v$ , given $v\in\mathbb{R}^{I}$ ; 2. (2)

compute $v\mapsto\Theta^{-1}v$ , given $v\in\mathbb{R}^{I}$ ; 3. (3)

compute $\operatorname{log\,det}\Theta$ ; 4. (4)

sample from the normal/Gaussian distribution $\mathcal{N}(0,\Theta)$ ; 5. (5)

approximate eigenspaces corresponding to the leading eigenvalues of $\Theta$ .

The first four of these tasks can be performed by computing the Cholesky factorization of $\Theta$ (i.e. the decomposition $\Theta=LL^{T}$ where $L$ is lower triangular). For many popular covariance functions, most notably those of smooth random processes, the matrices $\Theta$ will be dense. For large $N\coloneqq\#I$ this results in a computational complexity of $\mathcal{O}(N^{3})$ for the Cholesky factorization and a complexity of $\mathcal{O}(N^{2})$ to even store the matrix. When $\Theta$ is sparse, i.e. has relatively few non-zero entries, better complexity can be achieved — the obvious limiting case being $\mathcal{O}(N)$ (i.e. linear) complexity if $\Theta$ is diagonal. However, for practical problems, the cubic scaling restricts dense Cholesky factorization to problems with $N\lessapprox 10^{5}$ . The breadth of kernel matrices’ uses means that there is correspondingly high interest in achieving approximate Cholesky factorization of $\Theta$ at linear or near-linear cost.

1.2 Existing approaches

Many fast methods are available for approximating dense kernel matrices and their applicability depends on specific assumptions made on $\Theta$ . If the precision matrix $\Theta^{-1}$ is sparse and can be approximated directly (e.g. by discretizing a PDE), then sparse linear solvers can be used. These include multigrid solvers [24, 17, 38, 40] and sparse Cholesky factorization methods with nested dissection ordering [31, 30, 57, 32]. This approach has been proposed for problems arising in spatial statistics [56, 71, 72, 70]. In other situations, available methods directly approximate the covariance matrix based on low-rank approximations, sparsity, and hierarchy. Low-rank techniques such as the Nyström approximation [85, 79, 28] or rank-revealing Cholesky factorization [5, 26] seek to approximate $\Theta$ by low-rank matrices whereas sparsity-based methods like covariance tapering [29] seek to approximate $\Theta$ with a sparse matrix by setting entries corresponding to long-range interactions to zero. These two approximations can also be combined to obtain sparse low-rank approximations [75, 68, 77, 6, 80], which can be interpreted as imposing a particular graphical structure on the Gaussian process. When $\Theta$ is neither sufficiently sparse nor of sufficiently low rank, these approaches can be implemented in a hierarchical manner. For low-rank methods, this leads to hierarchical ( $\mathcal{H}$ - and $\mathcal{H}^{2}$ -) matrices [42, 39, 41], hierarchical off-diagonal low rank (HODLR) matrices [3, 4], and hierarchically semiseparable (HSS) matrices [19, 86, 54] that rely on computing low-rank approximations of sub-blocks of $\Theta$ corresponding to far-field interactions on different scales. The interpolative factorization developed by [43] combines hierarchical low-rank structure with the sparsity obtained from an elimination ordering of nested-dissection type. Hierarchical low-rank structure was originally developed as an algebraic abstraction of the fast multipole method of [35]. In order to construct hierarchical low-rank approximations from entries of the kernel matrix efficiently, both deterministic and randomized algorithms have been proposed [9, 60]. For many popular covariance functions, including Green’s functions of elliptic PDEs [8], hierarchical matrices allow for (near-)linear-in- $N$ complexity algorithms for the inversion and approximation of $\Theta$ , at exponential accuracy. Wavelet-based methods [13, 33], using the separation and truncation of interactions on different scales, can be seen as a hierarchical application of sparse approximation approaches. The resulting algorithms have near-linear computational complexity and rigorous error bounds for asymptotically smooth covariance functions. [25] use operator-adapted wavelets to compress the expected solution operators of random elliptic PDEs. In [50], although no rigorous accuracy estimates are provided, the authors establish the near-linear computational complexity of algorithms resulting from the multi-scale generalization of probabilistically motivated sparse and low-rank approximations [75, 68, 77, 6, 80].

1.3 Our main result and and overview of the paper

Our main result is to show that a small modification of the Cholesky factorization algorithm is both accurate and scalable, when applied to kernel matrices obtained from kernels $G$ identified as Green’s functions of elliptic PDEs and a (roughly) homogeneously distributed cloud of points. Such kernels are oftentimes used as covariance functions of smooth Gaussian processes (to enforce a smoothness prior on the function to be recovered/interpolated) and therefore a large class of popular kernels fall into this category. The cheap, accurate, approximate Cholesky factors provided by our method thereby serve tasks (1–4) from Section 1.1. We furthermore show that by reversing the elimination order we obtain a fast direct solver for elliptic PDEs.

Contrary to the present belief that fast solvers for elliptic integral operators require the use of hierarchical low-rank structure or wavelets with a high order of vanishing moments, we show that state-of-the-art performance can be obtained just by zero fill-in Cholesky factorization (which just amounts to skipping some steps in the Cholesky factorization algorithm — wavelets are only used in the detailed rigorous analysis of the algorithm). While there is a huge literature on the sparse Cholesky factorization of sparse matrices, we are not aware of any prior literature on the sparse Cholesky factorization of dense matrices.

For elliptic PDEs with arbitrary $L^{\infty}$ -coefficients, $\mathcal{H}$ -matrices can be used to compute $\epsilon$ -approximate Cholesky factors of both differential and integral operators in computational complexity $\mathcal{O}\left(N\log^{2}\left(N\right)\log^{2d+2}\left(\epsilon^{-1}\right)\right)$ [8, 39, 42, 7]. $\mathcal{H}^{2}$ -matrices can improve these complexities to $\mathcal{O}\left(N\log\left(N\right)\log^{2d+2}\left(\epsilon^{-1}\right)\right)$ [41, 14, 15]. The “fast gamblet transform” of [64, 65] can invert stiffness matrices of arbitrary elliptic operators in computational complexity $\mathcal{O}\left(\log^{2d+1}\left(\epsilon^{-1}\right)\right)$ . Our computational complexities of $\mathcal{O}\left(N\log^{2d}\left(N/\epsilon\right)\right)$ for the Cholesky factorization of differential operators and $\mathcal{O}\left(N\log^{2}(N)\log^{2d}\left(N/\epsilon\right)\right)$ for the Cholesky factorization of integral operators improve upon the state of the art while using a much simpler algorithm.

Our method relies upon a cleverly-constructed elimination ordering and sparsity pattern, which we use in the incomplete Cholesky factorization of the matrix $\Theta$ . Simplified versions of these constructions are given in Section 2; Section 3 gives a overview, without detailed proof, of why the method yields the desired results. In particular, Section 2.4 shows how the method provides a sparse approximate principal component analysis (PCA), thereby serving task (5).

Section 4 presents detailed numerical experiments that illustrate the power of our method, and Section 5 gives the mathematical proofs of correctness and accuracy vs. complexity. Section 8 contains concluding remarks, and some technical results are deferred to an Appendix.

2 Overview of the algorithm and its setting

In this introductory section we give a brief overview of the setting in which our theoretical results apply (the class of kernels associated to elliptic operators) and highlight its main features. All detailed numerical experiments and analysis will be deferred to Sections 4 and 5 respectively.

2.1 The class of elliptic operators

In order to establish rigorous, a priori, complexity-vs.-accuracy estimates in Section 5 we will assume that $G$ is the Green’s function of an elliptic operator $\mathcal{L}$ of order $2s>d$ ( $s,d\in\mathbb{N}$ ), defined on a bounded domain $\Omega\subset\mathbb{R}^{d}$ with Lipschitz boundary, and acting on $H^{s}_{0}(\Omega)$ , the Sobolev space of (zero boundary value) functions having derivatives of order $s$ in $L^{2}(\Omega)$ . More precisely, writing $H^{-s}(\Omega)$ for the dual space of $H^{s}_{0}(\Omega)$ with respect to the $L^{2}(\Omega)$ scalar product, our rigorous estimates will be stated for an arbitrary linear bijection

[TABLE]

that is symmetric (i.e. $\int_{\Omega}u\mathcal{L}v\,\mathrm{d}x=\int_{\Omega}v\mathcal{L}u\,\mathrm{d}x$ ), positive (i.e. $\int_{\Omega}u\mathcal{L}u\,\mathrm{d}x\geq 0$ ), and local in the sense that

[TABLE]

Let $\|\mathcal{L}\|\coloneqq\sup_{u\in H_{0}^{s}}\|\mathcal{L}u\|_{H^{-s}}/\|u\|_{H^{s}_{0}}$ and $\|\mathcal{L}^{-1}\|\coloneqq\sup_{f\in H^{-s}}\|\mathcal{L}^{-1}f\|_{H^{s}_{0}}/\|f\|_{H^{-s}}$ denote the operator norms of $\mathcal{L}$ and $\mathcal{L}^{-1}$ . The complexity and accuracy estimates for our algorithm will depend on (and only on) $d,s,\Omega,\|\mathcal{L}\|$ , $\|\mathcal{L}^{-1}\|$ , and the parameter

[TABLE]

which is a measure of the homogeneity of the distribution of the cloud of points $x_{i}$ .

Since our algorithm only requires the locations of the points $x_{i}$ and is oblivious to the exact knowledge of $G$ , for our numerical experiments in Section 4 we will consider (2.1), general elliptic operators with or without boundary conditions (these include Matérn kernels with fractional values of $s$ ) and exponential kernels.

2.2 Zero fill-in incomplete Cholesky factorization (ICHOL(0))

A simple approach to decreasing the computational complexity of Cholesky factorization is the zero fill-in incomplete Cholesky factorization [62] (ICHOL(0)). When performing Gaussian elimination using ICHOL(0), we treat all entries of both the input matrix and the output factors outside a prescribed sparsity pattern $S\subset I\times I$ as zero and correspondingly ignore all operations in which they are involved. Figure 2.1 shows a comparison of ordinary Cholesky factorization and ICHOL(0). Our approach to kernel matrices consists of applying Algorithm 2 with an elimination ordering $\prec$ and a sparsity pattern $S$ that are chosen based on the locations of the $x_{i}$ ; 5.38 gives the details of this elimination ordering and sparsity pattern.

Write $\|\hbox to5.71527pt{\hss$ \cdot $\hss}\|_{\operatorname{Fro}}$ for the Frobenius matrix norm and $C$ for a constant depending only on $d$ , $\Omega$ , $s$ , $\|\mathcal{L}\|$ , $\|\mathcal{L}^{-1}\|$ , and $\delta$ . To simplify notation, the asymptotic bounds in this paper are stated in the case where the logarithmic factors are at least one. Our main result is the following:

Theorem 2.1.

Let $\mathcal{L}$ and $\delta$ be defined as in (2.1) and (2.3). For $\rho\geq C\log(N/\epsilon)$ , the sparse Cholesky factor $L^{\rho}$ , obtained from Algorithm 2 with the elimination ordering $\prec_{\rho}$ and sparsity pattern $\tilde{S}_{\rho}\subset I\times I$ described in 5.38, satisfies

[TABLE]

*The selection of the ordering and sparsity pattern, as well as Algorithm 2, can be performed in computational complexity $C\rho^{2d}N\log^{2}N$ in time and $C\rho^{d}N\log N$ in space. In particular, we can obtain an $\epsilon$ -accurate approximation in Frobenius norm in complexity $CN\log^{2}(N)\log(N/\epsilon)^{2d}$ in time and $CN\log(N)\log(N/\epsilon)^{d}$ in space. *

Remark 2.2.

*For problems arising in Gaussian process regression, there will typically be no domain $\Omega$ on the boundary of which the process is conditioned to be zero; equivalently, $\Omega$ will be all of $\mathbb{R}^{d}$ . This introduces an additional error, but we still observe good approximation of the covariances even of points close to the boundary (see Section 4.2 for a detailed discussion). *

We will now present a simplified version of the elimination ordering and sparsity pattern (compared to the one mentioned in Theorem 2.1). Although the proof of Theorem 2.1 does not cover the stability of ICHOL(0) under this simplified version (rather, it covers the one described in 5.38), extensive numerical experiments suggest that ICHOL(0) remains stable under this simplified version, and since it is also user-friendly we recommend this as the “go-to” version for a simple, practical implementation.111Although more complex, the ordering used in Theorem 2.1 has more potential for optimization by exploiting parallelism and dense linear algebra operations.

2.3 The elimination ordering and sparsity pattern

We use a maximum-minimum distance ordering (maximin ordering) [36] as the elimination ordering. This ordering is obtained by successively picking the point $x_{i}$ that is furthest away from $\partial\Omega$ and the points that were already picked. If $\partial\Omega=\emptyset$ , then we select an arbitrary $i\in I$ as first index to eliminate; otherwise, we choose the first index as

[TABLE]

Then, for the first $k$ indices of the ordering already chosen, we choose

[TABLE]

until we have ordered all the $N$ points (see Fig. 2.2).

Let

[TABLE]

be the distance between $x_{i_{k}}$ and $\partial\Omega$ and the earlier points in the ordering. For $\rho>0$ , let $S_{\rho}\subset I\times I$ be the sparsity pattern defined by

[TABLE]

Here, $\rho$ parameterizes a trade-off between computational efficiency and accuracy. For a given $\rho$ , the sparsity pattern will have $C\rho^{d}N\log N$ entries and the Cholesky factorization will require $C\rho^{2d}N\log^{2}N$ floating-point operations. Figure 2.3 shows the sparsity pattern for $\rho=1$ . While a naïve implementation requires $\mathcal{O}(N^{2})$ distance evaluations, Theorem 4.1 shows that Algorithm 3 delivers this sparsity pattern at computational complexity $C\rho^{d}N\log^{2}N$ .

2.4 Sparse approximate PCA

The sparse Cholesky factorization described in Section 2 is also rank revealing in the sense that the low-rank approximation obtained by using only the first $k$ columns of the Cholesky factorization achieves an accuracy within a constant factor of optimal rank- $k$ approximation (measured in operator norm). This is illustrated by Fig. 2.4 and the following theorem:

Theorem 2.3.

In the setting of Theorem 2.1, let $L^{(k)}$ be the rank- $k$ matrix defined by the first $k$ columns of the (dense) Cholesky factor $L$ of $\Theta$ . Then

[TABLE]

*where $\|\Theta\|$ is the operator norm of $\Theta$ and $C>0$ depends only on $d$ , $\Omega$ , $s$ , $\|\mathcal{L}\|$ , $\|\mathcal{L}^{-1}\|$ , and $\delta$ . *

The rank- $k$ approximation estimate (2.9) is a numerical homogenization accuracy estimate similar those obtained in [59, 67, 64, 65, 46]. Numerical homogenization basis functions can be identified by the last $k$ rows of the lower triangular Cholesky factor of $A\coloneqq\Theta^{-1}$ , obtained with the reverse elimination ordering described in Section 6.2.

3 Why it works — justification of the method

The method described in Section 2 combines two crude approximations. First, it discards all but $\mathcal{O}(\rho^{d}N\log N)$ entries of the dense $N\times N$ matrix $\Theta$ . Second, it skips all but $\mathcal{O}(\rho^{2d}N\log^{2}N)$ operations of the Cholesky factorization of $\Theta$ (which has complexity $\mathcal{O}(N^{3})$ ). The obvious question is: why is the resulting approximation of $\Theta$ accurate for $\rho\gtrsim\log N$ ?

3.1 Sparse Cholesky factors of dense matrices

The first part of the answer is that the Cholesky factors of $\Theta$ decay exponentially quickly away from the sparsity pattern $S_{\rho}$ when the maximin ordering is used as the elimination ordering. This decay is illustrated in Fig. 3.1 and by the following Theorem 3.1. Write $C$ for a constant depending only on $d$ , $\Omega$ , $s$ , $\|\mathcal{L}\|$ , $\|\mathcal{L}^{-1}\|$ , and $\delta$ .

Theorem 3.1.

In the setting of Theorem 2.1, let $L$ be the full Cholesky factor of $\Theta$ in the maximin ordering of Section 2. Then, for $\rho\geq C\log(N/\epsilon)$ , $S_{\rho}$ as defined in Section 2, and

[TABLE]

*the inequality $\left\|\Theta-L^{S_{\rho}}L^{S_{\rho},\top}\right\|_{\operatorname{Fro}}\leq\epsilon$ holds. *

Algorithm 2 computes the exact Cholesky factorization under the assumption that the entries of $L$ lying outside $S_{\rho}$ are zero. Theorem 3.1 shows that this assumption holds true up to an approximation error that decays exponentially in $\rho$ , which supports the claim of accuracy of Algorithm 2 for $\rho\gtrsim\log N$ . We will now explain the exponential decay of $L$ based on a probabilistic interpretation of Gaussian elimination.

3.2 Gaussian elimination, conditioning of Gaussian random variables,

and the screening effect

The dense (block-)Cholesky factorization of a matrix $\Theta$ can be seen as the recursive application of the matrix identity

[TABLE]

where, at each step of the outermost loop, the above identity is applied to the Schur complement $\Theta_{2,2}-\Theta_{2,1}\left(\Theta_{1,1}\right)^{-1}\Theta_{1,2}$ obtained at the previous step. If the Schur complements appearing during the factorization are sparse, then the final Cholesky factorization will also be sparse.

For $X=(X_{1},X_{2})\sim\mathcal{N}(0,\Theta)$ , the well-known identities

[TABLE]

imply that the sparsity of Cholesky factors of $\Theta$ is equivalent to conditional independence of Gaussian vectors with covariance matrix $\Theta$ . In the spatial statistics literature, it is well known that many smooth Gaussian processes are subject to the screening effect [81]. This effect, illustrated in Fig. 3.2, means that the value of the process at a given site, conditioned on the values at nearby sites, is only weakly dependent on the values at distant sites.

Consider now the $k$ th step of Cholesky factorization in the ordering described in Section 2. Any pair $x_{i},x_{j}$ with $\operatorname{dist}\left(x_{i},x_{j}\right)\gtrapprox l[k]$ will have points between them that have already been eliminated, as illustrated in Fig. 3.3. Thus, the screening effect suggests that their correlation will be weak, which supports choosing $\rho l[k]$ as a truncation radius.

3.3 Cholesky factorization and operator-adapted wavelets

Cholesky factorization in the maximin ordering is intimately related to computing operator-adapted wavelets. In Section 5 we will use this connection to prove the accuracy of our approximation.

Operator-adapted wavelets.

[64] and [65] introduced a novel class of operator-adapted wavelets called gamblets (see also [66]). For an operator $\mathcal{L}$ defined as in (2.1), gamblets can be identified as conditional expectations of the Gaussian process $\xi\sim\mathcal{N}\left(0,\mathcal{L}^{-1}\right)$ . To construct the gamblets up to level $q\in\mathbb{N}$ we start with a hierarchy of measurement functions $\{\phi^{(k)}_{i}\}_{1\leq k\leq q,i\in I^{(k)}}\subset H^{-s}(\Omega)$ ; heuristically, $k$ labels a scale, and $i$ a location at that scale. These measurement functions are linearly nested in the sense that, for $k<l$ ,

[TABLE]

for some rank- $|I^{(k)}|$ matrices $\pi^{(k,l)}\in\mathbb{R}^{I^{(k)}\times I^{(l)}}$ . Writing $[\hbox to5.71527pt{\hss$ \cdot $\hss},\hbox to5.71527pt{\hss$ \cdot $\hss}]$ for the duality product between $H^{-s}(\Omega)$ and $H_{0}^{s}(\Omega)$ , the conditional expectations

[TABLE]

act as $\mathcal{L}$ -adapted pre-wavelets. These pre-wavelets can be identified as optimal recovery splines in the sense of [63] through the representation formula

[TABLE]

where $\Theta^{(k),-1}_{i,j}$ is the $(i,j)$ th entry of the inverse $\Theta^{(k),-1}$ of the matrix $\Theta^{(k)}\in\mathbb{R}^{I^{(k)}\times I^{(k)}}$ with entries $\Theta^{(k)}_{i,j}\coloneqq\int_{\Omega}\phi_{i}^{(k)}\mathcal{L}^{-1}\phi_{j}^{(k)}\,\mathrm{d}x$ . The linear nesting of the $\phi_{i}^{(k)}$ across scales implies that the linear spaces $\mathfrak{V}^{(k)}\coloneqq\operatorname{span}\{\psi_{i}^{(k)}\mid i\in I^{(k)}\}$ are nested (i.e. $\mathfrak{V}^{(k-1)}\subset\mathfrak{V}^{(k)}$ ). The multi-resolution decomposition $\mathfrak{V}^{(q)}\coloneqq\mathfrak{V}^{(1)}\oplus\mathfrak{W}^{(2)}\oplus\cdots\oplus\mathfrak{W}^{(q)}$ is then obtained by defining $\mathfrak{W}^{(k)}$ as the orthogonal complement $\mathfrak{W}^{(k)}$ of $\mathfrak{V}^{(k-1)}$ in $\mathfrak{V}^{(k)}$ with respect to the energy scalar product $\langle u,v\rangle\coloneqq\int_{\Omega}u\mathcal{L}v\,\mathrm{d}x$ . Basis functions for $\mathfrak{W}^{(k)}$ are identified (for $2\leq k\leq q$ ) by

[TABLE]

or, equivalently, by

[TABLE]

with $\phi^{(k),W}_{i}\coloneqq\sum_{j\in I^{(k)}}W_{i,j}^{(k)}\phi^{(k)}_{j}$ , where $J^{(k)}\cong\bigl{(}I^{(k)}\setminus I^{(k-1)}\bigr{)}$ and $W^{(k)}$ is a $J^{(k)}\times I^{(k)}$ matrix such that $\mathop{\textup{Im}}W^{(k),\top}=\mathop{\textup{Ker}}\pi^{(k-1,k)}$ (writing $W^{(k),\top}$ for the transpose of $W^{(k)}$ ). See Fig. 3.4 for an illustration.

For simplicity we write $J^{(1)}\coloneqq I^{(1)}$ and $\chi_{i}^{(1)}\coloneqq\psi_{i}^{(1)}$ . Write $B^{(k)}$ for the $J^{(k)}\times J^{(k)}$ stiffness matrices $B^{(k)}\coloneqq\bigl{\langle}\chi_{i}^{(k)},\chi_{j}^{(k)}\bigr{\rangle}$ . The gamblets $\chi_{i}^{(k)}$ are $\mathcal{L}$ -adapted wavelets in the sense that, under sufficient conditions on the $\phi_{i}^{(k)}$ , they satisfy the following three properties:

•

Scale orthogonality in the energy scalar product, i.e.

[TABLE]

This leads to the block-diagonalization of the operator (with the $B^{(k)}$ as diagonal blocks).

•

Uniform Riesz stability in the energy norm: the condition numbers of the blocks $B^{(k)}$ are uniformly bounded in $k$ .

•

Exponential decay, which leads to sparse blocks $B^{(k)}$ : the gamblets $\chi^{(k)}_{i}$ exhibit exponential decay on the scale associated with $k$ .

Although the scale-orthogonality property (3.10) is always satisfied, the two others (exponential decay and uniform Riesz stability) depend on the properties of $\mathcal{L}$ and the $\phi_{i}^{(k)}$ . In the setting of the localization of numerical homogenization basis functions (where $\mathcal{L}$ is an elliptic PDE and the measurements $\phi_{i}^{(k)}$ are local and possibly not explicitly introduced), rigorous exponential decay estimates were pioneered in [59] and generalized in [52, 64, 46, 65]; see Section 5.3.2 for detailed comparisons. For $\phi_{i}^{(k)}$ spanning the space of local polynomials of order up to $s-1$ , bounded condition numbers are shown by [64, 65]. The homogenization results obtained in the special case $q=2$ [59, 67, 46] are closely related to the lower bound on the spectrum of $B^{(2)}$ (see Section 5.3.3).

Relation to Cholesky factorization.

To explain the connection between gamblets and Cholesky factorization, let $J\coloneqq J^{(1)}\cup\cdots\cup J^{(q)}$ , let $W^{(1)}$ be the $I^{(1)}\times I^{(1)}$ identity matrix, let $\pi^{(k,k)}$ be the $I^{(k)}\times I^{(k)}$ identity matrix, and let $\bar{\Theta}$ be the $J\times J$ symmetric matrix with $J^{(k)}\times J^{(l)}$ block defined for $k\leq l$ by

[TABLE]

or equivalently by

[TABLE]

Then, the block-Cholesky factorization of $\bar{\Theta}$ satisfies the identity

[TABLE]

where $D$ is a block-diagonal matrix with the $J^{(k)}\times J^{(k)}$ diagonal block equal to $B^{(k),-1}$ and

[TABLE]

Therefore, computing gamblets associated to the operator $\mathcal{L}$ and measurement functions $\phi_{i}$ is equivalent to computing a block-Cholesky factorization of $\Theta$ in the multiresolution basis given by the $\phi_{i}^{(k),W}$ .

The Cholesky decomposition of $\Theta$ (1.1) belongs to this setting. Indeed, although the maximin ordering of Section 2 has no explicit multiscale structure, this structure can be introduced, as described in Fig. 3.5, by decomposing $x_{1},\ldots,x_{N}$ into a nested hierarchy $\{x_{i}\}_{i\in I^{(1)}}\subset\{x_{i}\}_{i\in I^{(2)}}\subset\cdots\subset\{x_{i}\}_{i\in I^{(q)}}$ , and choosing $\phi_{i}^{(k)}=\boldsymbol{\delta}(\hbox to5.71527pt{\hss$ \cdot $\hss}-x_{i})$ for $i\in I^{(k)}$ and $k\in\{1,\ldots,q\}$ , where $\boldsymbol{\delta}$ denotes the unit (unscaled) Dirac delta function. Under this choice, $\pi^{(k,k+1)}_{i,j}=1$ for $j\in I^{(k)}$ and $\pi^{(k,k+1)}_{i,j}=0$ for $j\not\in I^{(k)}$ . Letting $J^{(k)}$ label the indices in $I^{(k)}/I^{(k-1)}$ and choosing $W^{(k)}_{i,j}=1$ for $j\in I^{(k)}/I^{(k-1)}$ and $W^{(k)}_{i,j}=0$ for $j\in I^{(k-1)}$ implies $\Theta=\bar{\Theta}$ . The exponential decay of $\bar{L}$ and $D^{-1}$ follows from known results [65] on exponential decay of the $\chi_{j}^{(k)}$ . The uniform bound on the condition number of the $B^{(k)}$ is proved in Section 5.3.3. The exponential decay and uniform bound on the condition numbers of the blocks $B^{(k)}$ imply the exponential decay of the Cholesky factors $\hat{L}$ of $D$ and hence of $L=\bar{L}\hat{L}$ . The approximation error estimate (2.4) is then obtained by matching the sparsity set $S$ with the near-sparse structure of $L$ .

4 Implementation and numerical results

4.1 Selection of the sparsity pattern and ordering

This section introduces an $\mathcal{O}(\rho^{d}N\log^{2}N)$ -complexity algorithm (Algorithm 3) for selecting the sparsity pattern and ordering used as inputs in Algorithm 2. This algorithm does not explicitly query the position of the $\{x_{i}\}_{i\in I}$ and only uses pairwise distances by processing points one by one by updating a mutable binary heap, keeping track of the point to be processed at each step. With this approach, our proposed algorithm is oblivious to the dimension $d$ of the ambient space and, in particular, can automatically exploit low-dimensional structure in the point cloud $\{x_{i}\}_{i\in I}$ . In order to avoid computing all $\mathcal{O}(N^{2})$ pairwise distances, as illustrated in Fig. 4.1, Algorithm 3 uses the sparsity pattern obtained on the coarser scales to restrict computation at the finer scales to local neighborhoods.

Theorem 4.1.

*The output of Algorithm 3 is the ordering and sparsity pattern described in Section 2. Furthermore, in the setting of Theorem 3.1, if the oracles $\operatorname{\mathtt{dist}}(\hbox to5.71527pt{\hss$ \cdot $\hss},\hbox to5.71527pt{\hss$ \cdot $\hss})$ and $\operatorname{\mathtt{dist}}_{\partial\Omega}(\hbox to5.71527pt{\hss$ \cdot $\hss})$ can be queried in complexity $\mathcal{O}(1)$ , then the complexity of Algorithm 3 is bounded by $C\rho^{d}N\log^{2}N$ , where $C$ is a constant depending only on $d$ , $\Omega$ and $\delta$ . *

Theorem 4.1 is proved in Appendix A. As discussed therein, in the case $\Omega=\mathbb{R}^{d}$ , Algorithm 3 has the advantage that its computational complexity depends only on the intrinsic dimension of the dataset, which can be much smaller than $d$ .

4.2 The case of the whole space ( $\Omega=\mathbb{R}^{d}$ )

Many applications in Gaussian process statistics and machine learning are in the $\Omega=\mathbb{R}^{d}$ setting. In that setting, the Matérn family of kernels (4.5) is a popular choice that is equivalent to using the whole-space Green’s function of an elliptic PDE as covariance function [83, 84]. Let $\bar{\Omega}$ be a bounded domain containing the $\{x_{i}\}_{i\in I}$ . The case $\Omega=\mathbb{R}^{d}$ is not covered in Theorem 3.1 because in this case the screening effect is weakened near the boundary of $\bar{\Omega}$ by the absence of measurements points outside of $\bar{\Omega}$ . Therefore, distant points close to the boundary of $\bar{\Omega}$ will have stronger conditional correlations than similarly distant points in the interior of $\bar{\Omega}$ (see Fig. 4.2). As observed by [70] and [20], Markov random field (MRF) approaches that use a discretization of the underlying PDE face similar challenges at the boundary. While the weakening of the exponential decay at the boundary worsens the accuracy of our method, the numerical results of Section 4.4 (which are all obtained without imposing boundary conditions) suggest that its overall impact is limited. In particular, as shown in Fig. 4.2, it does not cause significant artifacts in the quality of the approximation near the boundary. This differs from the significant boundary artifacts of MRF methods, which have to be mitigated against by a careful calibration of boundary conditions [70, 20]. Although the numerical results presented in this section are mostly obtained with $x_{i}\sim\mathop{\textup{UNIF}}([0,1]^{d})$ , in many practical applications, the density of measurement points will slowly (rather than abruptly) decrease towards zero near the boundary of the sampled domain, which drastically decreases the boundary errors shown above. Accuracy can also be enhanced by adding artificial points $\{x_{i}\}_{i\in\tilde{I}}$ at the boundary. By applying the Cholesky factorization to $\{x_{i}\}_{i\in I\cup\tilde{I}}$ , and then restricting the resulting matrix to $I\times I$ , we can obtain a very accurate approximate matrix-vector multiplication. Although not in the form of a Cholesky factorization, this approximation can be efficiently inverted using iterative methods such as conjugate gradient [78] preconditioned with the Cholesky factorization obtained from the original set of points.

4.3 Nuggets and measurement errors

In the Gaussian process regression setting it is common to to model measurement error by adding a nugget $\sigma^{2}\textup{Id}$ to the covariance matrix:

[TABLE]

The addition of a diagonal matrix diminishes the screening effect and thus the accuracy of Algorithm 2. This problem can be avoided by rewriting the modified covariance matrix $\tilde{\Theta}$ as

[TABLE]

where $A\coloneqq\Theta^{-1}$ . As noted in Section 6.2, $A$ can be interpreted as a discretized partial differential operator and has near-sparse Cholesky factors in the reverse elimination ordering. Adding a multiple of the identity to $A$ amounts to adding a zeroth-order term to the underlying PDE and thus preserves the sparsity of the Cholesky factors. This leads to the sparse decomposition

[TABLE]

where $P^{\updownarrow}$ is the order-reversing permutation and $\tilde{L}$ is the Cholesky factor of $P^{\updownarrow}(\sigma^{2}A+\textup{Id})P^{\updownarrow}$ . Fig. 4.3 shows that the exponential decay of these Cholesky factors is robust with respect $\sigma$ .

This idea can be turned into an algorithm by first approximately computing $L$ using Algorithm 2; then using $L$ to approximate $A$ , which can be done in near-linear complexity by exploiting sparsity; and then approximating $\tilde{L}$ , again using Algorithm 2. While this algorithm is asymptotically efficient, our preliminary results suggest that the additional inversion step significantly increases the constants featured in the approximation accuracy. Therefore, when low accuracy is sufficient, we instead recommend simply applying Algorithm 2 to the matrix $\Theta$ . This preserves the original approximation accuracy and the matrix inversion can then efficiently be performed using iterative methods such as conjugate gradient (CG) [78] by taking advantage of the fast matrix-vector multiplication obtained from the sparse factorization. For small values of $\sigma$ (which would lead to slow convergence of CG) we can directly apply Algorithm 2 to $\tilde{\Theta}$ . For large values of $\sigma$ , $\tilde{\Theta}$ will be well conditioned and the convergence of $CG$ is fast. For intermediate values of $\sigma$ , we can apply Algorithm 2 to $\tilde{\Theta}$ and use the resulting factors as a preconditioner for CG. Sampling from $\mathcal{N}(0,\tilde{\Theta})$ can be done by adding independent samples from $\mathcal{N}(0,\Theta)$ and $\mathcal{N}(0,\sigma^{2}\textup{Id})$ . Approximations of the log-determinant could be obtained either by applying Algorithm 2 directly to $\tilde{\Theta}$ (with some loss of accuracy) or by combining iterative methods [74, 27] with the fast matrix-vector multiplication obtained from the sparse factorization of $\Theta$ . Just like CG, these methods benefit from the fact that we can work with well-conditioned matrices for small and large $\sigma$ . A detailed investigation of the efficiency of the above mentioned strategies for computing with nuggets is beyond the scope of this work.

4.4 Numerical results

We will now present numerical evidence in support of our results. All experiments reported below were run on a workstation using an Intel®Core™i7-6400 CPU with 4.00GHz and 64 GB of RAM. The time-critical parts of the code are run on a single thread, leaving the exploration of parallelism to future work. The Julia scripts implementing the experiments can be found online under https://github.com/f-t-s/nearLinKernel. In the following, $\mathop{\textup{nnz}}(L)$ denotes the number of nonzero entries of the lower-triangular factor $L$ ; $t_{\texttt{SortSparse}}$ denotes the time taken by Algorithm 3 to compute the maximin ordering $\prec$ and sparsity pattern $S_{\rho}$ ; $t_{\texttt{Entries}}$ denotes the time taken to compute the entries of $\Theta$ on $S_{\rho}$ ; and $t_{\texttt{ICHOL(0)}}$ denotes the time taken to perform Algorithm 2 (ICHOL(0)), all measured in seconds. The relative error in Frobenius norm is approximated by

[TABLE]

where the $m=500000$ pairs of indices $i_{k},j_{k}\sim\mathop{\textup{UNIF}}(I)$ are independently and uniformly distributed in $I$ . This experiment is repeated 50 times and the resulting mean and standard deviation (in brackets) are reported. For measurements in $[0,1]^{d}$ , in order to isolate the boundary effects, we also consider the quantity $\bar{E}$ which is defined as $E$ , but with only those sample $i_{k},j_{k}$ for which $x_{i_{k}},x_{j_{k}}\in[0.05,0.95]^{d}$ . Most of our experiments will use the Matérn class of covariance functions [61], defined by

[TABLE]

where $K_{\nu}$ is the modified Bessel function of second kind [1, Section 9.6] and $\nu$ , $l$ are parameters describing the degree of smoothness, and the length-scale of interactions, respectively [69]. In Fig. 4.5, the Matérn kernel is plotted for different degrees of smoothness. The Matérn covariance function is used in many branches of statistics and machine learning to model random fields with finite order of smoothness [37, 69].

As observed by [83, 84], the Matérn kernel is the Green’s function of an elliptic PDE of possibly fractional order $2(\nu+d/2)$ in the whole space. Therefore, for $2(\nu+d/2)\in\mathbb{N}$ , the Matérn kernel falls into the framework of our theoretical results, up to the behavior at the boundary discussed in Section 4.2. Since the locations of our points will be chosen at random, some of the points will be very close to each other, resulting in an almost singular matrix $\Theta$ that can become nonpositive under the approximation introduced by ICHOL(0). If Algorithm 2 encounters a nonpositive pivot $A_{ii}$ , then we set the corresponding column of $L$ to zero, resulting in a low-rank approximation of the original covariance matrix. We report the rank of $L$ in our experiments and note that we obtain a full-rank approximation for moderate values of $\rho$ .

We begin by investigating the scaling of our algorithm as $N$ increases. To this end, we consider $\nu=0.5$ (the exponential kernel), $l=0.2$ and choose $N$ randomly distributed points in $[0,1]^{d}$ for $d\in\{2,3\}$ . The results are summarized in Table 4.1 and Table 4.4, and in Fig. 4.4, and confirm the near-linear computational complexity of our algorithm.

Next, we investigate the trade-off between computational efficiency and accuracy of the approximation. To this end, we choose $d=2$ , $\nu=1.0$ and $d=3$ , $\nu=0.5$ , corresponding to fourth-order equations in two and three dimensions. We choose $N=10^{6}$ data points $x_{i}\sim\mathop{\textup{UNIF}}([0,1]^{d})$ and apply our method with different values of $\rho$ . The results of these experiments are tabulated in Tables 4.3 and 4.4 and the impact of $\rho$ on the approximation error is visualized in Fig. 4.4.

While our theoretical results only cover integer-order elliptic PDEs, we observe no practical difference between the numerical results for Matérn kernels corresponding to integer- and fractional-order smoothness. As an illustration, for the case $d=3$ , we provide approximation results for $\nu$ ranging around $\nu=0.5$ (corresponding to a fourth-order elliptic PDE) and $\nu=1.5$ (corresponding to a sixth-order elliptic PDE). As seen in Table 4.5, the results vary continuously as $\nu$ changes, with no qualitative differences between the behavior for integer- and fractional-order PDEs. To further illustrate the robustness of our method, we consider the Cauchy class of covariance functions introduced in [34]

[TABLE]

As far as we are aware, the Cauchy class has not been associated to an elliptic PDE. Furthermore, it does not have exponential decay in the limit $|x-y|\to\infty$ , which allows us to emphasize the point that the exponential decay of the error is not due to the exponential decay of the covariance function itself. Table 4.6 gives the results for $(l,\alpha,\beta)=(0.4,0.5,0.025)$ and $(l,\alpha,\beta)=(0.2,1.0,0.2)$ .

In Gaussian process regression, the ambient dimension $d$ is typically too large to ensure computational efficiency of our algorithm. However, since our algorithm only requires access to pairwise distances between points, it can take advantage of low intrinsic dimension of the dataset. We might be concerned that in this case, interaction through the higher dimensional ambient space will disable the screening effect. As a first demonstration that this is not the case, we will draw $N=10^{6}$ points in $[0,1]^{2}$ and equip them with a third component according to $x_{i}^{(3)}\coloneqq-\delta_{z}\sin(6x_{i}^{(1)})\cos(2(1-x_{i}^{(2)}))+\xi_{i}10^{-3}$ , for $\xi_{i}$ i.i.d. standard Gaussian. Figure 4.6 shows the resulting point sets for different values of $\delta_{z}$ , and Table 4.7 shows that the approximation is robust to increasing values of $\delta_{z}$ .

An appealing feature of our method is that it can be formulated in terms of the pairwise distances alone. This means that the algorithm will automatically exploit any low-dimensional structure in the dataset. In order to illustrate this feature, we artificially construct a dataset with low-dimensional structure by randomly rotating four low-dimensional structures into a $20$ -dimensional ambient space (see Fig. 4.7). Table 4.8 shows that the resulting approximation is even better than the one obtained in dimension $3$ , illustrating that our algorithm did indeed exploit the low intrinsic dimension of the dataset.

5 Analysis of the algorithm

5.1 General Setting

We will start the analysis in a more general setting than that of Section Section 2.1. Let $\mathcal{B}$ be a separable Banach space with dual space $\mathcal{B}^{\ast}$ , and write $[\hbox to5.71527pt{\hss$ \cdot $\hss},\hbox to5.71527pt{\hss$ \cdot $\hss}]$ for the duality product between $\mathcal{B}^{\ast}$ and $\mathcal{B}$ . Let $\mathcal{L}\colon\mathcal{B}\to\mathcal{B}^{\ast}$ be a linear bijection and let $G\coloneqq\mathcal{L}^{-1}$ . Assume $\mathcal{L}$ to be symmetric and positive (i.e. $[\mathcal{L}u,v]=[\mathcal{L}v,u]$ and $[\mathcal{L}u,u]\geq 0$ for $u,v\in\mathcal{B}$ ). Let $\|\hbox to5.71527pt{\hss$ \cdot $\hss}\|$ be the quadratic (energy) norm defined by $\|u\|^{2}\coloneqq[\mathcal{L}u,u]$ for $u\in\mathcal{B}$ and let $\|\hbox to5.71527pt{\hss$ \cdot $\hss}\|_{\ast}$ be its dual norm defined by

[TABLE]

Let $\{\phi_{i}\}_{i\in I}$ be linearly independent elements of $\mathcal{B}^{\ast}$ (known as measurement functions) and let $\Theta\in\mathbb{R}^{I\times I}$ be the symmetric positive-definite matrix defined by

[TABLE]

We assume that we are given $q\in\mathbb{N}$ and a partition $I=\bigcup_{1\leq k\leq q}J^{(k)}$ of $I$ . We represent $I\times I$ matrices as $q\times q$ block matrices according to this partition. Given an $I\times I$ matrix $M$ we write $M_{k,l}$ for the $(k,l)$ th block of $M$ and $M_{k_{1}:k_{2},l_{1}:l_{2}}$ for the sub-matrix of $M$ defined by blocks ranging from $k_{1}$ to $k_{2}$ and $l_{1}$ to $l_{2}$ . Unless specified otherwise we write $L$ for the lower-triangular Cholesky factor of $\Theta$ and define

[TABLE]

We interpret the $\{J^{(k)}\}_{1\leq k\leq q}$ as labelling a hierarchy of scales with $J^{(1)}$ representing the coarsest and $J^{(q)}$ the finest. We write $I^{(k)}$ for $\bigcup_{1\leq k^{\prime}\leq k}J^{(k^{\prime})}$ .

Throughout this section we assume that the ordering of the set $I$ of indices is compatible with the partition $I=\bigcup_{k=1^{q}}J^{(k)}$ , i.e. $k<l$ , $i\in J^{(k)}$ and $j\in J^{(l)}$ together imply $i\prec j$ . We will write $L$ or $\operatorname{chol}(\Theta)$ for the Cholesky factor of $\Theta$ in that ordering.

5.2 Main examples

We will prove the main results of this section in the setting where $\mathcal{L}$ is defined as in Section 2.1 and the $\phi_{i}$ are chosen as in Examples 5.1 and 5.2. We will assume (without loss of generality after rescaling) that $\operatorname{diam}(\Omega)\leq 1$ . As described in Fig. 3.5, successive points of the maximin ordering can be gathered into levels, so that, after appropriate rescaling of the measurements, the Cholesky factorization in the maximin ordering falls in the setting of Example 5.1.

Example 5.1.

Let $s>d/2$ . For $h,\delta\in(0,1)$ let $\{x_{i}\}_{i\in I^{(1)}}\subset\{x_{i}\}_{i\in I^{(2)}}\subset\cdots\subset\{x_{i}\}_{i\in I^{(q)}}$ be a nested hierarchy of points in $\Omega$ that are homogeneously distributed at each scale in the sense of the following three inequalities:

(1)

$\sup_{x\in\Omega}\min_{i\in I^{(k)}}|x-x_{i}|\leq h^{k}$ , 2. (2)

$\min_{i\in I^{(k)}}\inf_{x\in\partial\Omega}|x-x_{i}|\geq\delta h^{k}$ , and 3. (3)

$\min_{i,j\in I^{(k)}:i\neq j}|x_{i}-x_{j}|\geq\delta h^{k}$ .

Let $J^{(1)}\coloneqq I^{(1)}$ and $J^{(k)}\coloneqq I^{(k)}\setminus I^{(k-1)}$ for $k\in\{2,\ldots,q\}$ . Let $\boldsymbol{\delta}$ denote the unit Dirac delta function and choose

[TABLE]

Given subsets $\tilde{I},\tilde{J}\subset I$ we extend a matrix $M\in\mathbb{R}^{\tilde{I}\times\tilde{J}}$ to an element of $\mathbb{R}^{I\times J}$ by padding it with zeros.

Example 5.2.

(See Fig. 5.1.) For $h,\delta\in(0,1)$ , let $(\tau_{i}^{(k)})_{i\in I^{(k)}}$ be uniformly Lipschitz convex sets forming a regular nested partition of $\Omega$ in the following sense. For $k\in\{1,\ldots,q\}$ , $\Omega=\bigcup_{i\in I^{(k)}}\tau_{i}^{(k)}$ is a disjoint union except for the boundaries. $I^{(k)}$ is a nested set of indices, i.e. $I^{(k)}\subset I^{(k+1)}$ for $k\in\{1,\ldots,q-1\}$ . For $k\in\{2,\ldots,q\}$ and $i\in I^{(k-1)}$ , there exists a subset $c_{i}\subset I^{(k)}$ such that $i\in c_{i}$ and $\tau_{i}^{(k-1)}=\bigcup_{j\in c_{i}}\tau_{j}^{(k)}$ . Assume that each $\tau_{i}^{(k)}$ contains a ball $B_{\delta h^{k}}(x_{i}^{(k)})$ of center $x_{i}^{(k)}$ and radius $\delta h^{k}$ , and is contained in the ball $B_{h^{k}}(x_{i}^{(k)})$ . For $k\in\{2,\ldots,q\}$ and $i\in I^{(k-1)}$ , let the submatrices $\mathfrak{w}^{(k),i}\in\mathbb{R}^{(c_{i}\setminus\{i\})\times c_{i}}$ satisfy $\sum_{j\in c_{i}}\mathfrak{w}^{(k),i}_{m,j}\mathfrak{w}^{(k),i}_{n,j}|\tau_{j}^{(k)}|=\delta_{mn}$ and $\sum_{j\in c_{i}}\mathfrak{w}^{(k),i}_{l,j}|\tau_{j}^{(k)}|=0$ for each $l\in c_{i}\setminus\{i\}$ , where $|\tau_{i}^{(k)}|$ denotes the volume of $\tau_{i}^{(k)}$ . Let $J^{(1)}\coloneqq I^{(1)}$ and $J^{(k)}\coloneqq I^{(k)}\setminus I^{(k-1)}$ for $k\in\{2,\ldots,q\}$ . Let $W^{(1)}$ be the $J^{(1)}\times I^{(1)}$ matrix defined by $W^{(1)}_{ij}\coloneqq\delta_{ij}$ . Let $W^{(k)}$ be the $J^{(k)}\times I^{(k)}$ matrix defined by $W^{(k)}\coloneqq\sum_{i\in I^{(k-1)}}\mathfrak{w}^{(k),i}$ for $k>2$ , we set

[TABLE]

*and define $[\phi_{i},u]\coloneqq\int_{\Omega}\phi_{i}u\,\mathrm{d}x$ . In order to keep track of the distance between the different $\phi_{i}$ of Example 5.2, we choose an arbitrary set of points $\{x_{i}\}_{i\in I}\subset\Omega$ with the property that $x_{i}\in\operatorname{supp}(\phi_{i})$ for each $i\in I$ . *

5.3 Exponential decay of Cholesky factors

Our bound on the ICHOL(0) approximation error will be based on the following exponential decay estimate on the entries of the Cholesky factor $L$ of $\Theta$ :

[TABLE]

for a constant $\gamma>0$ and a suitable distance measure $d(\hbox to5.71527pt{\hss$ \cdot $\hss},\hbox to5.71527pt{\hss$ \cdot $\hss})\colon I\times I\to\mathbb{R}$ .

5.3.1 Algebraic Identities and roadmap

The following block-Cholesky decomposition of $\Theta$ will be used to obtain the estimate (5.6).

Lemma 5.3.

*We have $\Theta=\bar{L}D\bar{L}^{T}$ , with $\bar{L}$ and $D$ defined by *

[TABLE]

*In particular, if $\tilde{L}$ is the lower-triangular Cholesky factor of $D$ , then the lower-triangular Cholesky factor $L$ of $\Theta$ is given by $L=\bar{L}\tilde{L}$ . *

Proof 5.4.

*To obtain Lemma 5.3 we successively apply Lemma 5.5 to $\Theta$ (see Appendix B for details). Lemma 5.5 summarizes classical identities satisfied by Schur complements. *

Lemma 5.5 ([87, Chapter 1.1]).

Let $\Theta=\left(\begin{smallmatrix}\Theta_{1,1}&\Theta_{1,2}\\ \Theta_{2,1}&\Theta_{2,2}\end{smallmatrix}\right)$ be symmetric positive definite and $A=\left(\begin{smallmatrix}A_{1,1}&A_{1,2}\\ A_{2,1}&A_{2,2}\end{smallmatrix}\right)$ its inverse. Then

[TABLE]

where

[TABLE]

Based on Lemma 5.3, (5.6) can be established by ensuring that:

(1)

the matrices $A^{(k)}$ (and hence also $B^{(k)}$ ) decay exponentially according to $d(\hbox to5.71527pt{\hss$ \cdot $\hss},\hbox to5.71527pt{\hss$ \cdot $\hss})$ ; 2. (2)

the matrices $B^{(k)}$ have uniformly bounded condition numbers; 3. (3)

the products of exponentially decaying matrices decay exponentially; 4. (4)

the inverses of well-conditioned exponentially decaying matrices decay exponentially; 5. (5)

the Cholesky factors of the inverses of well-conditioned exponentially decaying matrices decay exponentially; and 6. (6)

if a $q\times q$ block lower-triangular matrix $\bar{L}$ with unit block-diagonal decays exponentially, then so does its inverse.

We will carry out this program in the setting of Examples 5.1 and 5.2 and prove that (5.6) holds with

[TABLE]

To prove (1), the matrices $\Theta^{(k)}$ , $A^{(k)}$ (interpreted as coarse-grained versions of $G$ and $\mathcal{L}$ ), and $B^{(1)}$ will be identified as stiffness matrices of the $\mathcal{L}$ -adapted wavelets described in Section 3.3. This identification is established on the general identities $\Theta_{i,j}^{(k)}=[\phi_{i},G\phi_{j}]$ for $i,j\in I^{(k)}$ , $A^{(k)}=(\Theta^{(k)})^{-1}$ , $A^{(k)}_{i,j}=[\mathcal{L}\psi_{i}^{(k)},\psi_{j}^{(k)}]$ and $B^{(k)}_{i,j}=[\mathcal{L}\chi_{i}^{(k)},\chi_{j}^{(k)}]$ where the $\psi_{i}^{(k)}$ and $\chi_{i}^{(k)}$ are defined as in (3.7) and (3.8).

5.3.2 Exponential decay of $A^{(k)}$

Our proof of the exponential decay of $L$ will be based on that of $A^{(k)}$ as expressed in the following condition:

{condition}

Let $\gamma,C_{\gamma}\in\mathbb{R}_{+}$ be constants such that for $1\leq k\leq q$ and $i,j\in I^{(k)}$ ,

[TABLE]

The matrices $A^{(k)}$ are coarse-grained versions of the local operator $\mathcal{L}$ and as such inherit some of its locality in the form of exponential decay. Such exponential localization results were first obtained by [59] for the coarse-grained operators obtained from local orthogonal decomposition (LOD) applied to second-order elliptic PDEs with rough coefficients. [64] gives similar results for measurement functions chosen as in Example 5.2. [46] extend the results on exponential decay to higher-order operators satisfying a strong ellipticity condition. These results were obtained using similar mass chasing techniques that are difficult to extend to general higher-order operators. [52] present a simpler proof of the exponential decay of the LOD basis functions of [59] based on the exponential convergence of subspace iteration methods. [65] extend this technique (by presenting necessary and sufficient conditions expressed as frame inequalities in dual spaces) to elliptic PDEs of arbitrary (integer) order and new classes of (possibly non-conforming) measurements, including those of Example 5.1 and Example 5.2. More recently, [18] show localization results for the fractional partial differential operators by using the Caffarelli–Silvestre extension. The results of [65] are sufficient to show that Section 5.3.2 holds true in the setting of Example 5.1 and Example 5.2.

Theorem 5.6 ([65]).

In Example 5.1, the matrices $A^{(k)}$ satisfy

[TABLE]

and in Example 5.2 they satisfy

[TABLE]

*with the constants $C_{\gamma}$ and $\gamma$ depending only on $\|\mathcal{L}\|$ , $\|\mathcal{L}^{-1}\|$ , $s$ , $d$ , $\Omega$ , and $\delta$ . In particular, they satisfy Section 5.3.2 with the constants described above. *

Proof 5.7.

Our Example 5.1 is equivalent to Example 2.29 of [65]. In [65, Theorem 2.25 and Theorem 2.26] it is shown that in the gamblets $\{\psi_{i}^{(k)}\}_{i\in I^{(k)}}$ computed in this setting decay exponentially on the length-scale $h^{k}$ , with respect to the energy norm. By [65, Theorem 3.8] we have $A^{(k)}_{ij}=[\psi_{i}^{(k)},\mathcal{L}\psi_{j}^{(k)}]$ and, therefore, the exponential decay of gamblets implies the exponential decay of the $A^{(k)}$ .

We further note that Example 5.2 is equivalent to Example 2.27 in [65]. Therefore, by the same theorems, as above, the results of [65] imply exponential decay of the $A^{(k)}$ in this setting222We point out that the block $A^{(k)}_{m,l}$ in our notation is $W^{(m)}\pi^{(m,k)}A^{(k)}\pi^{(k,l)}W^{(l),\top}$ in the notation of [65]..

*See also [66, Theorem 15.45] for a detailed proof and [66, Theorem 15.43] for required sufficient lower bounds on $A^{(k)}_{ii}$ . *

5.3.3 Bounded condition numbers

In this section, we will bound the condition numbers of $B^{(k)}$ based on the following condition, which we will show to be satisfied for Examples 5.1 and 5.2.

{condition}

Let $H\in(0,1),C_{\Phi}\geq 1$ be constants such that for $1\leq k<l\leq q$ ,

[TABLE]

Theorem 5.8.

Section 5.3.3* implies that, for all $1\leq k\leq q$ ,*

[TABLE]

and, for $\kappa\coloneqq H^{-2}C_{\Phi}^{2}$ ,

[TABLE]

Proof 5.9.

The lower bound in (5.19) follows from (5.18) and

[TABLE]

*The upper bound in (5.19) follows from (5.17) and $B^{(k)}=\bigl{(}\bigl{(}\Theta^{(k)}\bigr{)}^{-1}\bigr{)}_{k,k}$ . *

The following theorem shows that (5.18) is a Poincaré inequality closely related to the accuracy of numerical homogenization basis functions [59, 67, 46] and (5.17) is an inverse Sobolev inequality related to the regularity of the discretization of $\mathcal{L}$ :

Theorem 5.10.

Section 5.3.3* holds true if the constants $C_{\Phi}\geq 1$ and $H\in(0,1)$ satisfy*

(1)

$\frac{1}{C_{\Phi}}H^{2k}\leq\frac{\|\phi\|_{\ast}^{2}}{|\alpha|^{2}}$ , for $\alpha\in\mathbb{R}^{I^{(k)}}$ and $\phi=\sum_{i\in I^{(k)}}\alpha_{i}\phi_{i}$ ; and 2. (2)

$\min_{\varphi\in\operatorname{span}(\phi_{i})_{i\in I^{(k-1)}}}\frac{\|\phi-\varphi\|_{\ast}^{2}}{|\alpha|^{2}}\leq C_{\Phi}H^{2(k-1)}$ , for $\alpha\in\mathbb{R}^{J^{(l)}}$ , $k<l\leq q$ , and $\phi=\sum_{i\in J^{(l)}}\alpha_{i}\phi_{i}$ .

Proof 5.11.

Inequality (5.17) is a direct consequence of the first assumption of the theorem, whereas (5.18) follows from the variational property [87, Theorem 5.1] of the Schur complement:

[TABLE]

We will now show that Examples 5.1 and 5.2 satisfy the conditions of Theorem 5.10. For simplicity, for $\tilde{\Omega}\subset\Omega$ and $\phi\in H^{-s}(\Omega)$ we still write $\phi$ for the unique element $\tilde{\phi}\in H^{-s}(\tilde{\Omega})$ such that $[\tilde{\phi},u]=[\phi,u]$ for $u\in H_{0}^{s}(\tilde{\Omega})$ . The following Fenchel conjugate identity [16, Ex. 3.27, p. 93] will be useful throughout this section.

[TABLE]

The first condition can be verified similarly as is done in [65].

Lemma 5.12.

Let $\Theta$ be given as in Examples 5.1 and 5.2. Then there exists a constant $C$ depending only on $\delta$ , $s$ , and $d$ , such that

[TABLE]

*for $C_{\Phi}=\|\mathcal{L}\|C$ , $\alpha\in\mathbb{R}^{I^{(k)}}$ , and $\phi=\sum_{i}\alpha_{i}\phi_{i}$ . *

Proof 5.13.

*The proof can be found in Appendix B. *

In order to verify the second condition in Theorem 5.10, we will construct a $\varphi$ such that $\phi-\varphi$ integrates to zero against polynomials of order at most $s-1$ on domains of size $h^{k}$ . Then an application of the Bramble–Hilbert lemma [21] will yield the desired factor $h^{ks}$ . To avoid scaling issues we define, for $1\leq k\leq q$ and $i\in I^{(k)}$ ,

[TABLE]

noting that $\operatorname{span}\{\phi^{(k)}_{i}\mid i\in I^{(k)}\}=\operatorname{span}\{\phi_{i}\mid i\in I^{(k)}\}$ . To obtain estimates independent of the regularity of $\Omega$ , for the simplicity of the proof and without loss of generality, we will partially work in the extended space $\mathbb{R}^{d}$ (rather than on $\Omega$ ). We write $v$ for the zero extension of $v\in H_{0}^{s}(\Omega)$ to $H^{s}(\mathbb{R}^{d})$ and $\phi_{i}^{(k)}$ for the extension of $\phi_{i}^{(k)}\in H^{-s}(\Omega)$ to an element of the dual space of $H_{\operatorname{loc}}^{s}(\mathbb{R}^{d})$ . We introduce new measurement functions in the complement of $\Omega$ as follows. For $1\leq k\leq q$ we consider countably infinite index sets $\tilde{I}^{(k)}\supset I^{(k)}$ . We choose points $(x_{i})_{i\in\tilde{I}^{(q)}\setminus I^{(q)}}$ satisfying

[TABLE]

We then define, for $1\leq k\leq q$ and $i\in\tilde{I}^{(k)}$ , $\phi_{i}^{(k)}\coloneqq\delta_{x_{i}}$ for Example 5.1, and $\phi_{i}^{(k)}\coloneqq\frac{\mathbf{1}_{B_{\delta h^{k}}(x_{i})}}{|B_{\delta h^{k}}(x_{i})|}$ for Example 5.2. Let $\mathcal{P}^{s-1}$ denote the linear space of polynomials of degree at most $s-1$ (on $\mathbb{R}^{d}$ ).

Lemma 5.14.

Let $\Theta$ be as in Example 5.1 or Example 5.2. Given $\rho\in(2,\infty)$ and $1\leq k<l\leq q$ let $w\in\mathbb{R}^{J^{(l)}\times\tilde{I}^{(k)}}$ be such that

[TABLE]

and $w_{ij}\neq 0\Rightarrow\operatorname{supp}\left(\phi_{j}^{(k)}\right)\subset B_{\rho h^{k}}(x_{i})$ . Then, for $\alpha\in\mathbb{R}^{J^{(l)}}$ , $\phi\coloneqq\sum_{i\in J^{(l)}}\alpha_{i}\phi_{i}$ and $\varphi\coloneqq\sum_{i\in J^{(l)},j\in I^{(k)}}\alpha_{i}w_{ij}\phi_{j}^{(k)}$ satisfy

[TABLE]

*with $\omega_{l,k}\coloneqq\sup_{i\in J^{(l)}}\sum_{j\in\tilde{I}^{(k)}}|w_{ij}|$ and $\|\phi\|_{\ast}\coloneqq\sup_{u\in H^{s}_{0}(\Omega)}[\phi,u]/[\mathcal{L}u,u]^{\frac{1}{2}}$ as in (5.1). *

We proceed by proving Lemma 5.14 in the setting of Example 5.1. The proof in the setting of Example 5.2 can be found in Appendix B. For $u\in H^{s}(\Omega)$ write $\mathrm{D}^{0}u\coloneqq u$ and for $1\leq k\leq s$ write $\mathrm{D}^{k}u$ for the vector of partial derivatives of $u$ of order $k$ , i.e. $\mathrm{D}^{k}u\coloneqq\Bigl{(}\frac{\partial^{k}u}{\partial_{i_{1}}\cdots\partial_{i_{k}}}\Bigr{)}_{i_{1},\ldots,i_{k}=1,\ldots,d}$ . The proof of Lemma 5.14 will use the following version of the Bramble–Hilbert lemma:

Lemma 5.15 ([21]).

Let $\Omega\subset\mathbb{R}^{d}$ be convex and let $\phi$ be a sublinear functional on $H^{s}(\Omega)$ for $s\in\mathbb{N}$ such that

(1)

there exists a constant $\tilde{C}$ such that, for all $u\in H^{s}(\Omega)$ ,

[TABLE] 2. (2)

and $\phi(p)=0$ for all $p\in\mathcal{P}^{s-1}$ .

Then, for all $u\in H^{s}(\Omega)$ ,

[TABLE]

The following lemma is obtained from Lemma 5.15:

Lemma 5.16.

For $1\leq k<l\leq q$ and $i\in J^{(l)}$ , let $\phi_{i},w_{ij}$ be as in Lemma 5.14 and Example 5.2 and define $\varphi_{i}\coloneqq\sum_{j\in I^{(k)}}w_{ij}\phi_{j}^{(k)}$ . Then there exists a constant $C(d,s)$ such that, for all $v\in H_{0}^{s}(\Omega)$ ,

[TABLE]

Proof 5.17.

We apply Lemma 5.15 to the linear functional $u\mapsto\int_{B_{\rho h^{k}}}(\phi_{i}-\varphi_{i})u$ . Since the second requirement of Lemma 5.15 is fulfilled by definition, it remains to bound $\tilde{C}$ . We only execute the proof for Example 5.1; the proof for Example 5.2 is analogous. We first note that while the sum in the definition of $\varphi_{i}$ only ranges over $j\in I^{(k)}$ , we can increase it to run over all of $j\in\tilde{I}^{(k)}$ , since for $j\in\tilde{I}^{(k)}\setminus I^{(k)}$ , the support of $\phi^{(k)}_{j}$ is disjoint from that of $v\in H^{s}_{0}(\Omega)$ . Let $u\in H^{s}(\Omega)$ . Writing $C(d,s)$ for the continuity constant of the embedding of $H^{s}(B_{1}(0))$ into $C_{b}(B_{1}(0))$ , the inequalities

[TABLE]

and

[TABLE]

imply that

[TABLE]

Therefore the first condition of Lemma 5.15 holds with

[TABLE]

*and we conclude the proof by writing $C(d,s)$ for any constant depending only on $d$ and $s$ . *

We can now conclude the proof of Lemma 5.14.

Proof 5.18 (Proof of Lemma 5.14).

Write $\varphi\coloneqq\sum_{i\in J^{(l)}}\alpha_{i}\varphi_{i}$ and $\varphi_{i}\coloneqq\sum_{j\in I^{(k)}}w_{ij}\phi_{j}^{(k)}$ . Equation (5.24) implies that

[TABLE]

The packing inequality $\sum_{i\in J^{(l)}}\|\mathrm{D}^{s}v\|_{L^{2}\left(B_{\rho h^{k}}(x_{i})\right)}^{2}\leq C(d)\left(h^{k-l}\rho/\delta\right)^{d}\|v\|_{H_{0}^{s}(\Omega)}^{2}$ together with Lemma 5.16 yields

[TABLE]

Applying the inequality $2ax-bx^{2}\leq a^{2}/b$ to each summand yields

[TABLE]

Since, for all $f\in H^{-s}(\Omega)$ ,

[TABLE]

*we have $\|\phi-\varphi\|_{\ast}\leq\sqrt{\|\mathcal{L}^{-1}\|}\|\phi-\varphi\|_{H^{-s}(\Omega)}$ , and this completes the proof. *

The following geometric lemma shows that the assumption (5.28) of Lemma 5.14 can be satisfied with a uniform bound on the value of $\rho$ and the norm of weights $w_{i,j}$ .

Lemma 5.19.

There exists constants $\rho(d,s)$ and $C(d,s,\delta)$ such that for all $1\leq k<l\leq q$ there exists weights $w\in\mathbb{R}^{J^{(l)}\times\tilde{I}^{(k)}}$ satisfying (5.28) and (with $\omega_{l,k}$ defined as in Lemma 5.14)

[TABLE]

Proof 5.20.

For Example 5.1, (5.28) is equivalent to

[TABLE]

where $\tilde{I}^{(k)}_{\rho}\coloneqq\{j\in\tilde{I}^{(k)}\mid x_{j}\in B(x_{i},\rho h^{k})\}$ .

Fix $i\in J^{(l)}$ , let $\lambda>0$ and write $x_{j}^{\lambda}\coloneqq\frac{x_{j}-x_{i}}{\lambda}$ . Write $\mathbf{0}\coloneqq(0,\ldots,0)\in\mathbb{R}^{d}$ . Since the function $p(\hbox to5.71527pt{\hss$ \cdot $\hss})\mapsto p(\frac{\hbox to4.00069pt{\hss$ \cdot $\hss}-x_{i}}{\lambda})$ is surjective on $\mathcal{P}^{s-1}$ , (5.43) is satisfied if

[TABLE]

For a multiindex $n=(n_{1},\ldots,n_{d})\in\mathbb{N}^{d}$ and a point $z=(z_{1},\ldots,z_{d})\in\mathbb{R}^{d}$ , write $z^{n}\coloneqq\prod_{m=1}^{d}z_{m}^{n_{m}}$ . Use the convention $\mathbf{0}^{n}=0$ if $n\not=\mathbf{0}$ and $\mathbf{0}^{\mathbf{0}}=1$ . To satisfy (5.44) it is sufficient to identify a subset $\sigma$ of $\tilde{I}^{(k)}_{\rho}$ and $w_{i,\hbox to4.00069pt{\hss$ \cdot $\hss}}\in\mathbb{R}^{\tilde{I}^{(k)}}$ such that $\#\sigma=s^{d}$ , $w_{i,j}=0$ for $j\not\in\sigma$ , and

[TABLE]

Let $\mathbb{V}^{\lambda}\in\mathbb{R}^{\{0,1,\dots,s-1\}^{d}\times\sigma}$ be the $s^{d}\times s^{d}$ matrix defined by

[TABLE]

for a multiindex $n\in\mathbb{N}^{d}$ and a point $x\in\mathbb{R}^{d}$ $x^{n}\coloneqq\prod_{m=1}^{d}x^{n_{m}}$ . Let $\mathbf{w}\in\mathbb{R}^{\sigma}$ be defined by $\mathbf{w}_{j}\coloneqq w_{i,j}$ for $j\in\sigma$ . Equation (5.45) is then equivalent to

[TABLE]

where $\mathbf{e}\in\mathbb{R}^{\{0,1,\dots,s-1\}^{d}}$ is defined by $\mathbf{e}_{n}\coloneqq\mathbf{0}^{n}$ for $n\in\{0,1,\dots,s-1\}^{d}$ . We will now identify $\mathbf{w}$ by inverting (5.47). To achieve this while keeping the norm of $\mathbf{w}$ under control we will seek to identify the subset $\sigma$ and $\lambda>0$ such that $\sigma_{\min}(\mathbb{V}^{\lambda})$ (the minimal singular value of $\mathbb{V}^{\lambda}$ ) is bounded from below by a constant depending only on $s$ and $d$ .

For $\alpha\geq 0$ let $(\epsilon_{j})_{j\in\{0,1,\dots,s-1\}^{d}}$ be elements of $\mathbb{R}^{d}$ satisfying $|\epsilon_{j}|\leq\alpha$ for all $j\in\{0,1,\dots,s-1\}^{d}$ . Let $\mathbf{1}\coloneqq(1,\ldots,1)\in\mathbb{R}^{d}$ and, for $j\in\{0,1,\dots,s-1\}^{d}$ , let $z_{j}\coloneqq\mathbf{1}+j+\epsilon_{j}$ . Observe that for $\alpha=0$ the points $z_{j}$ are on a regular grid. Let $\bar{\mathbb{V}}^{\alpha}\in\mathbb{R}^{\{0,1,\dots,s-1\}^{d}\times\{0,1,\dots,s-1\}^{d}}$ be the $s^{d}\times s^{d}$ matrix defined by $\bar{\mathbb{V}}^{\alpha}_{n,j}\coloneqq\left(z_{j}\right)^{n}$ . Let $V$ be the $s\times s$ Vandermonde matrix defined by $V_{i,j}=i^{j}$ . Writing $\sigma_{\min}(V)$ for the minimal singular value of $V$ we have, for $\alpha=0$ , by [45, Theorem 4.2.12],

[TABLE]

Since univariate polynomial interpolation on $s$ points with polynomials of degree $s-1$ is uniquely solvable, we have $\sigma_{\min}\left(V\right)>0$ and $\sigma_{\min}(\bar{\mathbb{V}}^{0})>C(d,s)>0$ . Therefore, the continuity of the minimal singular value with respect the entries of $\bar{\mathbb{V}}^{\alpha}$ implies that there exists $\alpha^{\ast},\sigma^{\ast}>0$ depending only on $s,d$ such that $\alpha\leq\alpha^{\ast}$ implies $\sigma_{\min}(\bar{\mathbb{V}}^{\alpha})>\sigma^{\ast}$ . Since (by construction) the $(x_{i})_{i\in\tilde{I}^{(k)}}$ form a covering of $\mathbb{R}^{d}$ of radius $h^{k}$ , the $(x_{i}^{\lambda})_{i\in\tilde{I}^{(k)}}$ form a covering of $\mathbb{R}^{d}$ of radius $h^{k}/\lambda$ and for each $n\in\{0,1,\dots,s-1\}^{d}$ there exists an $x_{j_{n}}^{\lambda}$ that is at distance at most $h^{k}/\lambda$ from $n$ . Let $\sigma\coloneqq\{j_{n}\mid n\in\{0,1,\dots,s-1\}^{d}\}\subset\tilde{I}^{(k)}$ be the collection of corresponding labels. It follows from $|x_{j_{n}}^{\lambda}|\leq\sqrt{d}s+h^{k}/\lambda$ that $|x_{j_{n}}-x_{i}|\leq\lambda\sqrt{d}s+h^{k}$ , and $\sigma\subset\tilde{I}^{(k)}_{\rho}$ for $\rho>1+\lambda\sqrt{d}s/h^{k}$ . Selecting $\lambda=h^{k}/\alpha^{\ast}$ implies that $\sigma_{\min}(\mathbb{V}^{\lambda})>\sigma^{\ast}$ and $\sigma\subset\tilde{I}^{(k)}_{\rho}$ for $\rho>1+\sqrt{d}s/\alpha^{\ast}$ . Defining

[TABLE]

*the weights $w_{ij}$ satisfy $\omega_{kl}\leq C(s,d)h^{ld/2}$ and (5.28) with a $\rho$ depending only on $s$ and $d$ . This concludes the proof for Example 5.1. The proof is similar for Example 5.2 with minor changes (the bound on $\omega$ also depends on $\delta$ ). *

The following lemma concerns the satisfaction of the second condition of Theorem 5.10:

Lemma 5.21.

In the setting of Examples 5.1 and 5.2, there exists some constant $C(d,s,\delta)>0$ such that, for $2\leq k<l\leq q$ , $\alpha\in\mathbb{R}^{J^{(l)}}$ and $\phi=\sum_{i}\alpha_{i}\phi_{i}$ ,

[TABLE]

Proof 5.22.

*Apply Lemma 5.14 with the bounds on $\rho$ and $\omega$ obtained in Lemma 5.19. *

The following theorem is a direct consequence of Theorem 5.10, Lemma 5.12 and Lemma 5.21.

Theorem 5.23.

*In the setting of Examples 5.1 and 5.2 there exists a constant $C(d,s,\delta)$ such that Section 5.3.3 is fulfilled with $C_{\Phi}\coloneqq\max(\|\mathcal{L}\|,\|\mathcal{L}^{-1}\|)C(d,s,\delta)$ and $H\coloneqq h^{s}$ . *

5.3.4 Propagation of exponential decay

We will now derive the exponential decay of the Cholesky factors $L$ by combining the algebraic identities of Lemma 5.3 with the bounds on the condition numbers of the $B^{(k)}$ (implied by Section 5.3.3) and the exponential decay of the $A^{(k)}$ (specified in Section 5.3.2). The core of our proof is based on a combination/extension of the results of [23, 49, 11, 10, 53, 12] on decay algebras. The pseudodistance $d(\hbox to5.71527pt{\hss$ \cdot $\hss},\hbox to5.71527pt{\hss$ \cdot $\hss})$ appearing in (5.6) is not a pseudometric because it does not satisfy the triangle inequality. However, to prove (5.6) we we will only need the following weaker version of the triangle inequality:

Definition 5.24.

A function $d\colon I\times I\longrightarrow\mathbb{R}_{+}$ is called a hierarchical pseudometric if

(1)

$d(i,i)=0,\text{ for all }i\in I$ ; 2. (2)

$d(i,j)=d(j,i),\text{ for all }i,j\in I$ ; 3. (3)

for all $1\leq k\leq q$ , $d(\hbox to5.71527pt{\hss$ \cdot $\hss},\hbox to5.71527pt{\hss$ \cdot $\hss})$ restricted to $J^{(k)}\times J^{(k)}$ is a pseudometric; 4. (4)

for all $1\leq k\leq l\leq m\leq q$ and $i\in J^{(k)},s\in J^{(l)},j\in J^{(m)}$ , we have $d(i,j)\leq d(i,s)+d(s,j)$ .

Note that the $d(\hbox to5.71527pt{\hss$ \cdot $\hss},\hbox to5.71527pt{\hss$ \cdot $\hss})$ specified in (5.13) for Examples 5.1 and 5.2 is a hierarchical pseudometric. For a hierarchical pseudometric $d(\hbox to5.71527pt{\hss$ \cdot $\hss},\hbox to5.71527pt{\hss$ \cdot $\hss})$ and $\gamma\in\mathbb{R}_{+}$ , let

[TABLE]

The following theorem states the main result of this section:

Theorem 5.25 (Exponential decay of the Cholesky factors).

Assume that $\Theta$ fulfils Sections 5.3.2 and 5.3.3 with the constants $\gamma,C_{\gamma},H,C_{\Phi}$ and the hierarchical pseudometric $d(\hbox to5.71527pt{\hss$ \cdot $\hss},\hbox to5.71527pt{\hss$ \cdot $\hss})$ . Then

[TABLE]

*where $C_{R}\coloneqq\max\left\{1,\frac{2C_{\gamma}C_{\Phi}}{1+\kappa}\right\}$ , $r\coloneqq\frac{1-\kappa^{-1}}{1+\kappa^{-1}}$ , $\tilde{\gamma}\coloneqq\frac{-\log(r)}{1+\log(c_{d}(\gamma/2))+\log(C_{R})-\log(r)}\frac{\gamma}{2}$ , and $\kappa=H^{-2}C_{\Phi}^{2}$ is defined as in Theorem 5.8. *

The remaining part of this section will present the proof of Theorem 5.25. We will use the following lemma on the stability of exponential decay under matrix multiplication, the proof of which is a minor modification of that of [49].

Lemma 5.26.

Let $I$ be an index set that is partitioned as $I=J^{(1)}\cup\cdots J^{(q)}$ and let $d\colon I\times I\to\mathbb{R}_{\geq 0}$ satisfy

[TABLE]

Let $M^{(k)}\in\mathbb{R}^{J^{(k)}\times J^{(k+1)}}$ be such that $|M^{(k)}_{i,j}|\leq C\exp(-\gamma d(i,j))$ for $1\leq k\leq q-1$ and let

[TABLE]

Then, for $1\leq n\leq q-1$ ,

[TABLE]

Proof 5.27.

Set $i_{1}\coloneqq i$ , $i_{n+1}\coloneqq j$ . Then

[TABLE]

The proof of the following lemma (on the stability of exponential decay under matrix inversion for well conditioned matrices) is nearly identical to that of [49] (we only keep track of constants; see also [23] for a related result on the inverse of sparse matrices).

Lemma 5.28.

Let $A\in\mathbb{R}^{I\times I}$ be symmetric and positive definite with $|A_{i,j}|\leq C\exp(-\gamma d(i,j))$ for some $C,\gamma>0$ and a metric $d(\hbox to5.71527pt{\hss$ \cdot $\hss},\hbox to5.71527pt{\hss$ \cdot $\hss})$ on $I$ . It holds true that

[TABLE]

*where $c_{d}(\gamma/2)\coloneqq\sup_{j\in I}\sum_{i\in I}\exp\left(-\frac{\gamma}{2}d(i,j)\right)$ , $C_{R}\coloneqq\max\left\{1,\frac{2C}{\|A\|+\|A^{-1}\|^{-1}}\right\}=\max\left\{1,\frac{2C\|A^{-1}\|}{1+\kappa}\right\}$ , $r\coloneqq\frac{1-\frac{1}{\|A\|\|A^{-1}\|}}{1+\frac{1}{\|A\|\|A^{-1}\|}}=\frac{1-\kappa^{-1}}{1+\kappa^{-1}}$ , and $\kappa\coloneqq\|A\|\|A^{-1}\|$ is the condition number of $A$ . *

Proof 5.29.

*On a compact set not containing [math], the function $x\mapsto x^{-1}$ can be accurately approximated by low-order polynomials in $x$ . Then, the spread of the exponential decay can be controlled by Lemma 5.26. See Appendix B for details. *

By representing Schur complements as matrix inverses, Lemma 5.28 can also be used to show that the Cholesky factors of well-conditioned exponentially-decaying matrices are exponentially decaying. The following lemma appears in a similar form in [12] for banded matrices and in [53] without explicit constants.

Lemma 5.30.

Let $B\in\mathbb{R}^{I\times I}\simeq\mathbb{R}^{N\times N}$ be symmetric and positive definite with condition number $\kappa$ and such that $\left|B_{i,j}\right|\leq C\exp(-\gamma d(i,j))$ for some constant $C>0$ and some metric $d$ on $I$ . Let $L$ be the Cholesky factor (in an arbitrary order) of $B^{-1}$ ( $B^{-1}=LL^{T}$ ). Then

[TABLE]

*where $c_{d}(\gamma/2)\coloneqq\sup_{j\in I}\sum_{i\in I}\exp\left(-\frac{\gamma}{2}d(i,j)\right)$ , $C_{R}\coloneqq\max\left\{1,\frac{2C\|B^{-1}\|}{1+\kappa}\right\}$ , and $r\coloneqq\frac{1-\kappa^{-1}}{1+\kappa^{-1}}$ . *

Proof 5.31.

Lemma 5.5* implies that the Schur complements of $B^{-1}$ can be expressed as inverses of sub-matrices of $B$ . The result then follows from Lemma 5.28 (see B.5 for details). *

The last ingredient needed to prove the exponential decay of the Cholesky factors of $\Theta$ is the following lemma showing the stability of exponential decay under inversion for block-lower-triangular matrices (this operation appears in the definition of $\bar{L}$ in (5.7)):

Lemma 5.32.

Let $I$ be an index set that is partitioned as $I=J^{(1)}\cup\cdots J^{(q)}$ and assume that the matrix $L\in\mathbb{R}^{I\times I}$ is block-lower triangular with respect to this partition, with identity matrices as diagonal blocks. If $d(\hbox to5.71527pt{\hss$ \cdot $\hss},\hbox to5.71527pt{\hss$ \cdot $\hss})$ is a hierarchical pseudometric such that $|L_{ij}|\leq C\exp\left(-\gamma d(i,j)\right)$ (for some $C\geq 1$ and $\gamma>0$ ) then it holds true that

[TABLE]

*with $c_{d}(\gamma)\coloneqq\sup_{1\leq k\leq l\leq q}\sup_{j\in J^{(l)}}\sum_{i\in J^{(k)}}\exp\left(-\gamma d(i,j)\right)$ . *

Proof 5.33.

The Neumann series of a $q\times q$ block-lower-triangular matrix with identity matrices on the (block) diagonal can be written as

[TABLE]

*Since the sum terminates in $q$ steps, the thickening of the exponential decay can be bounded using Lemma 5.26. See B.6 for details. *

By applying the above results to the decomposition obtained in Lemma 5.3, we conclude the proof of Theorem 5.25. See B.7 for details.

5.4 Complexity and error estimates

The results of the previous sections allow us to prove the following theorem on the exponential decay of the Cholesky factors and the accuracy of their truncation:

Theorem 5.34.

In the setting of Examples 5.1 and 5.2 there exist constants $C,\gamma,\alpha>0$ depending only on $d$ , $\Omega$ , $s$ , $\|\mathcal{L}\|$ , $\|\mathcal{L}^{-1}\|$ , $h$ , and $\delta$ , such that the entries of the Cholesky factor $L$ of $\Theta$ satisfy

[TABLE]

where $d\colon I\times I\to\mathbb{R}$ is the hierarchical pseudometric defined by

[TABLE]

As a consequence, writing

[TABLE]

*with $S\supset S_{d,\rho}\coloneqq\{(i,j)\mid d(i,j)\leq\rho\}$ , we have $\bigl{\|}\Theta-L^{S}L^{S,\top}\bigr{\|}_{\operatorname{Fro}}\leq\epsilon$ for $\rho\geq\tilde{C}(C,\gamma)\log(N/\epsilon)$ . Furthermore, writing $E\coloneqq\Theta-L^{S}L^{S,\top}$ , using the $\epsilon$ -perturbation $\Theta-E$ of $\Theta$ as the input to Algorithm 2 returns $L^{S}$ as the output. *

Proof 5.35.

Theorems 5.6* and 5.23 imply that Sections 5.3.2 and 5.3.3 are fulfilled with constants depending only on $d$ , $s$ , $\|\mathcal{L}\|$ , $\|\mathcal{L}^{-1}\|$ , $h$ , and $\delta$ . Theorem 5.25 concludes the exponential decay of $L$ . The accuracy of the truncated factors follows directly from the exponential decay. *

Theorem 3.1 is a direct consequence of Theorem 5.34.

Proof 5.36 (Proof of Theorem 3.1).

As described in Section 3.3, the maximin ordering can be represented as a hierarchical ordering satisfying the conditions of Example 5.1. The result follows from Theorem 5.34 by observing that the sparsity pattern $S_{\rho}$ specified in Section 2 satisfies

[TABLE]

*Scaling the weights of the measurement functions $\phi_{i}$ to $1$ increases the error by a factor that is at most polynomial in $N$ , which can be subsumed into the $\log(N)$ -dependence of $\rho$ by increasing the constants in the decay estimates. *

While accurate (per Theorem 5.34), it is computationally inefficient to compute the full Cholesky factor first (with Algorithm 1) and then truncate it according to $S_{\rho}$ . Instead, we want to directly compute an approximation of $L$ from the incomplete factorization Algorithm 2, whose complexity is bounded by the following theorem:

Theorem 5.37.

*In the setting of Examples 5.1 and 5.2, there exists a constant $C(d,\delta)$ , such that, for $S\subset\{(i,j)\mid d(i,j)\leq\rho\}$ , the application of Algorithm 2 has computational complexity $C(d,\delta)Nq\rho^{d}$ in space and $C(d,\delta)Nq^{2}\rho^{2d}$ in time. In particular, $q\propto\log N/\ln\frac{1}{h^{d}}$ implies the upper bounds of $C(d,\delta,h)\rho^{d}N\log N$ on the space complexity, and of $C(d,\delta,h)\rho^{2d}N\log^{2}N$ on the time complexity. *

Proof 5.38.

Defining $m\coloneqq\max_{j\in I,1\leq k\leq q}\#\{i\in J^{(k)}\mid i\prec j\text{ and }d(i,j)\leq\rho\}$ , $|x_{i}-x_{j}|\geq\delta^{-1}h^{l}$ for $i,j\in I^{(l)}$ implies that $m\leq C(d,\delta)\rho^{d}$ . Therefore $\#\{i\in I\mid i\prec j\text{ and }d(i,j)\leq\rho\}\leq qmN$ implies the bound on space complexity.

*Consider the structure of the nested for-loops of Algorithm 2 and observe that, for every $k$ in the innermost loop, the number of distinct $(i,j)$ satisfying $i\prec j\prec k$ , $(j,k)\in S$ and $(i,j)\in S$ is at most $(qm)^{2}$ . This implies the upper bound $N(qm)^{2}$ on the time complexity. *

Theorems 5.34 and 5.37 imply that the application of Algorithm 2 to $\Theta-E$ (the $\epsilon$ -perturbation of $\Theta$ described in Theorem 5.34) returns an $\epsilon$ -accurate Cholesky factorization of $\Theta$ in computational complexity $\mathcal{O}(N\log^{2}(N)\log^{2d}(N/\epsilon))$ . In practice we do not have access to $E$ , so we need to rely on the stability of Algorithm 2 to deduce that $\Theta$ and $\Theta-E$ (used as inputs) would yield similar outputs, for sufficiently small $E$ . Even though such a stability property of ICHOL(0) would also be required by prior works on incomplete LU-factorization such as [33], we did not find this type of result in the literature. We also found it surprisingly difficult to prove (and were unable to do so) for the maximin ordering and sparsity pattern, although we always observed stability of Algorithm 2 in practice, for reasonable values of $\rho$ . We can however prove stability of Algorithm 2 when using a slight modification of the ordering and sparsity pattern that compromises neither the computational complexity nor the accuracy of the factorization. The modified ordering and sparsity pattern, being inspired by the concepts of red-black orderings [48] and supernodal factorizations [73, 58] also allows one to take advantage of parallelism and dense linear algebra operations and could therefore be used to improve the practical performance of the algorithm. For $r>0$ , $1\leq k\leq q$ and $i\in J^{(k)}$ , write

[TABLE]

{construction}

[Supernodal multicolor ordering and sparsity pattern]

Let $\Theta\in\mathbb{R}^{I\times I}$ with $I\coloneqq\bigcup_{1\leq k\leq q}J^{(k)}$ and let $d(\hbox to5.71527pt{\hss$ \cdot $\hss},\hbox to5.71527pt{\hss$ \cdot $\hss})$ be a hierarchical pseudometric. For $\rho\geq 1$ , define the supernodal multicolor ordering $\prec_{\rho}$ and sparsity pattern $S_{\rho}$ as follows. For each $k\in\{1,\ldots,q\}$ , select a subset $\tilde{J}^{(k)}\subset J^{(k)}$ of indices such that

[TABLE]

Assign every index in $J^{(k)}$ to the element of $\tilde{J}^{(k)}$ closest to it, using an arbitrary method to break ties. That is, writing $j\leadsto\tilde{j}$ for the assignment of $j$ to $\tilde{j}$ ,

[TABLE]

for all $j\in J^{(k)}$ and $\tilde{j}\in\tilde{J}^{(k)}$ such that $j\leadsto\tilde{j}$ . Define $\tilde{I}\coloneqq\bigcup_{1\leq k\leq q}\tilde{J}^{(k)}$ and define the auxiliary sparsity pattern $\tilde{S}_{\rho}\subset\tilde{I}\times\tilde{I}$ by

[TABLE]

Define the sparsity pattern $S_{\rho}\subset I\times I$ as

[TABLE]

and call the elements of $\tilde{J}^{(k)}$ supernodes. Color each $\tilde{j}\in\tilde{J}^{(k)}$ in one of $p^{(k)}$ colors such that no $\tilde{i},\tilde{j}\in\tilde{J}^{(k)}$ with $\left(\tilde{i},\tilde{j}\right)\in\tilde{S}_{\rho}$ have the same color. For $i\in J^{(k)}$ write $\text{node}(i)$ for the $\tilde{i}\in\tilde{J}^{(k)}$ such that $i\leadsto\tilde{i}$ and write $\text{color}(\tilde{i})$ for the color of $\tilde{i}$ . Define the supernodal multicolor ordering $\prec_{\rho}$ by reordering the elements of $I$ such that

(1)

$i\prec_{\rho}j$ for $i\in J^{(k)}$ , $j\in J^{(l)}$ and $k<l$ ; 2. (2)

within each level $J^{(k)}$ , we order the elements of supernodes colored in the same color consecutively, i.e. given $i,j\in J^{(k)}$ such that $\text{color}(\text{node}(i))\not=\text{color}(\text{node}(j))$ , $i\prec_{\rho}j\implies i^{\prime}\prec_{\rho}j^{\prime}$ for $\text{color}(\text{node}(i^{\prime}))=\text{color}(\text{node}(i))$ , and $\text{color}(\text{node}(j^{\prime}))=\text{color}(\text{node}(j))$ ; and 3. (3)

the elements of each supernode appear consecutively, i.e. given $i,j\in J^{(k)}$ such that $\text{node}(i)\not=\text{node}(j)$ , $i\prec_{\rho}j\implies i^{\prime}\prec_{\rho}j^{\prime}$ for $\text{node}(i^{\prime})=\text{node}(i)$ , and $\text{node}(j^{\prime})=\text{node}(j)$ .

Starting from a hierarchical ordering and sparsity pattern, the modified ordering and sparsity pattern can be obtained efficiently:

Lemma 5.39.

*In the setting of Examples 5.1 and 5.2, given $\{(i,j)\mid d(i,j)\leq\rho\}$ , there exist constants $C$ and $p_{\max}$ depending only on the dimension $d$ and the cost of computing $d(\hbox to5.71527pt{\hss$ \cdot $\hss},\hbox to5.71527pt{\hss$ \cdot $\hss})$ such that the ordering and sparsity pattern presented in 5.38 can be constructed with $p^{(k)}\leq p_{\max}$ , for each $1\leq k\leq q$ , in computational complexity $Cq\rho^{d}N$ . *

Proof 5.40.

*The aggregation into supernodes can be done via a greedy algorithm by keeping track of all nodes that are not already within distance $\rho/2$ of a supernode and removing them one-at-a-time. We can then go through $\rho$ -neighbourhoods and remove points within distance $\rho/2$ from our list of candidates for future supernodes. To create the coloring, we use the greedy graph coloring of [47] on the undirected graph $G$ with vertices $\tilde{J}^{(k)}$ and edges $\bigl{\{}(\tilde{i},\tilde{j})\in\tilde{S}_{\rho}\,\big{|}\,\tilde{i},\tilde{j}\in\tilde{J}^{(k)}\bigr{\}}$ . Defining $\deg(G)$ as the maximum number of edges connected to any vertex of $G$ , the computational complexity of greedy graph coloring is bounded above by $\deg(G)\#\left(J^{(k)}\right)$ and the number of colors used by $\deg(G)+1$ . A sphere-packing argument shows that $\deg(G)$ is at most a constant depending only on the dimension $d$ , which yields the result. *

Theorem 5.41.

In the setting of Examples 5.1 and 5.2, there exists a constant $C$ depending only on $d,s,\|\mathcal{L}\|,\|\mathcal{L}^{-1}\|$ , $h$ , and $\delta$ such that, given the ordering $\prec_{\rho}$ and sparsity pattern $S_{\rho}$ defined as in 5.38 with $\rho\geq C\log(N/\epsilon)$ , the incomplete Cholesky factor $L$ obtained from Algorithm 2 has accuracy

[TABLE]

*Furthermore, Algorithm 2 has complexity of at most $CN\rho^{2d}\log^{2}N$ in time and at most $CN\rho^{d}\log N$ in space. *

Proof 5.42.

*The triangle inequality implies that $S_{\rho}\subset\{(i,j)\mid d(i,j)\leq 2\rho\}$ and hence the bound on the complexity of Algorithm 2 follows from Theorem 5.37. The approximation property of the incomplete factors follows from the last part of Theorem 5.34 and a stability result for the incomplete Cholesky factorization with the supernodal multicolor ordering and sparsity pattern detailed in Appendix C. *

This allows us to prove the main theorem presented in the introduction.

Proof 5.43 (Proof of Theorem 2.1).

Theorem 2.1* follows from Theorem 5.41 since rescaling the weights of the measurements to $1$ increases bounds on errors by at most a multiplicative polynomial factor in $N$ . By increasing the constant, this factor can be subsumed in the $N$ -dependence of $\rho$ . *

We have now established the results on exponential decay of the Cholesky factors of $\Theta$ and the accuracy of Algorithm 2. Before proceeding to the next section, we will quickly establish a result on low-rank approximation of the Cholesky factors.

Theorem 5.44 (Approximate PCA).

In the setting of Theorem 3.1, take $\rho=\infty$ and let $L^{(k)}$ be the matrix formed by the leading $k$ columns of the Cholesky factors of $\Theta$ in the maximin ordering. Let $l[i_{k}]$ be as in (2.7). Then there exists a constant $C$ depending only on $\|\mathcal{L}\|$ , $\|\mathcal{L}^{-1}\|$ , $d$ , and $s$ such that

[TABLE]

Proof 5.45.

*Write $I=I_{1}\cup I_{2}$ with $I_{1}\coloneqq\left\{i_{1},\dots,i_{k}\right\}$ and $I_{2}\coloneqq I\setminus I_{1}$ . By Lemma 5.5, the approximation error made by keeping only the first $k$ columns of the Cholesky factorization is equal to the Schur complement $\Theta_{2,2}-\Theta_{2,1}\Theta_{1,1}^{-1}\Theta_{1,2}$ . Consider the implicit hierarchy of the maximin ordering as in Fig. 3.5 with $h=1/2$ and let $p\in\{1,\ldots,q\}$ be such that $2^{-p}\leq l[k]/l[1]\leq 2^{-p+1}$ . Write $I=I_{a}\cup I_{b}$ with $I_{a}\coloneqq I^{(p)}$ and $I_{b}\coloneqq I\setminus I^{(p)}$ . The variational property (5.22) implies that $\Theta_{2,2}-\Theta_{2,1}\Theta_{1,1}^{-1}\Theta_{1,2}\leq\Theta_{b,b}-\Theta_{b,a}\Theta_{a,a}^{-1}\Theta_{a,b}$ . Theorem 5.23 (with $h=1/2$ obtained from the implicit hierarchy of Fig. 3.5) implies that $\Theta_{b,b}-\Theta_{b,a}\Theta_{a,a}^{-1}\Theta_{a,b}\leq C(\frac{1}{2})^{2s(p-1)-d}$ (the extra multiplicative $(\frac{1}{2})^{-d}$ term arises because the measurement functions are scaled by $h^{kd/2}$ in Example 5.1 with $h=\tfrac{1}{2}$ ). We conclude the proof using $2^{-p-1}\leq l[k+1]/l[1]\leq 2^{-p+1}$ . *

6 Extensions and byproducts

6.1 The cases $s\leq d/2\text{ or }s\notin\mathbb{N}$

Theorem 3.1 requires that $s>d/2$ to ensure that the elements of $H^{s}(\Omega)$ are continuous (by the Sobolev embedding theorem) and that pointwise evaluations of the Green’s function are well defined. The accuracy estimate of Theorem 3.1 can be extended to $s\leq d/2$ by replacing pointwise evaluations of the Green’s function by local averages and using variants of the Haar pre-wavelets of Example 5.2 instead of variants of the subsampled Diracs of Example 5.1 to decompose $\Theta$ as in (3.13). Numerical experiments also suggest that the exponential decay of Cholesky factors still holds for $s\leq d/2$ if the local averages of Example 5.2 are sub-sampled as in Example 5.1, whereas the low-rank approximation becomes sub-optimal. As illustrated in Table 4.5, for Matérn kernels we observe no difference (in accuracy vs. complexity) between integer and non-integer values of $s$ .

6.2 Sparse factorization of $A=\Theta^{-1}$

Let $LL^{\top}=\Theta$ be the Cholesky factorization of the covariance matrix $\Theta$ . Writing $P^{\updownarrow}$ for the order-reversing permutation,

[TABLE]

Since $P^{\updownarrow}L^{-\top}P^{\updownarrow}$ is lower triangular, it is the Cholesky factor of $\Theta^{-1}$ in the reverse elimination ordering. Furthermore, since $L^{-\top}=AL$ and both $A$ and $L$ are exponentially decaying, the Cholesky factors of $A$ are also exponentially decaying if the Gaussian elimination is performed using the reverse of Section 2’s ordering. In fact, the following, stronger, theorem holds:

Theorem 6.1.

In the setting of Theorem 3.1, let

[TABLE]

let $L$ be the Cholesky factor of $A$ in the reverse ordering, and define

[TABLE]

*Then there exists a constant $C$ depending only on $d$ , $\Omega$ , $s$ , $\|\mathcal{L}\|$ , $\|\mathcal{L}^{-1}\|$ and $\delta$ such that for $\rho\geq C\log(N/\epsilon)$ , we have $\bigl{\|}PAP-L^{\mathring{S}_{\rho}}L^{\mathring{S}_{\rho},\top}\bigr{\|}_{\operatorname{Fro}}\leq\epsilon$ . *

Using this result and the fact that $\#\mathring{S}_{\rho}$ has $\mathcal{O}(\rho^{d}+1)$ nonzero entries per column, one can prove that using Algorithm 2 with a supernodal ordering as described in 5.38 yields an $\epsilon$ -approximate Cholesky factorization of $A$ in computational complexity $\mathcal{O}(N\log(N/\epsilon)^{2d})$ in time and $\mathcal{O}(N\log(N/\epsilon)^{d})$ in space. The matrix $A$ is essentially a discretized elliptic partial differential operator, and analogous results can be obtained in the setting where $A$ is obtained as a discretization of $\mathcal{L}$ with regular finite elements and and $\Theta$ is the inverse of that discretized operator. Numerical experiments suggest that exponential decay properties also hold for discretized second-order elliptic equations in two or three dimensions (where $s=1\leq d/2$ ) when using subsampling as in Example 5.1; see [76, Section 3.1] for a special case of this result on regular meshes. Thus, by computing the incomplete Cholesky factorization, we obtain a direct solver for general elliptic PDEs with complexity $\mathcal{O}(N\log(N/\epsilon)^{2d})$ in time and $\mathcal{O}(N\log(N/\epsilon)^{d})$ in space. To the best of our knowledge, this is the best asymptotic complexity reported for such a solver in the literature (for elliptic PDEs with rough coefficients and rigorous a priori estimates of complexity vs. accuracy). It is not surprising that we obtain a fast solver for elliptic PDEs because our work is based on the fast solvers introduced in [64, 65], which in turn can be shown to be a block-wise version of the Cholesky factorization in nonstandard form introduced by [33], where the inverses of diagonal blocks are computed using iterative methods. By instead applying the Cholesky factorization in nonstandard form, the logarithmic factor in the complexity of the gamblet transform can be improved. However, the error estimates of [64] and [65] improve significantly upon those in [33] by establishing that exponential accuracy can be obtained with a finite number of vanishing moments even for rough coefficients. The present work further extends the results on Cholesky factorization to the setting of multiresolution schemes based on subsampling (without any vanishing moments). For such multiresolution basis the nonstandard form just reduces to computing an ordinary incomplete Cholesky factorization with the smaller sparsity pattern $\mathring{S}_{\rho}$ , thus greatly simplifying the implementation. We note that by using direct inversion methods similar to [55] it would be possible in principle to directly compute $\epsilon$ -approximations of the Cholesky factors of $\Theta^{-1}$ from $\mathcal{O}(N\log(N/\epsilon)^{d})$ entries of $\Theta$ at computational cost of $\mathcal{O}(N\log(N/\epsilon)^{2d})$ , but we defer a more detailed investigation to future work.

7 Comparison to related work

7.1 $\mathcal{H}$ -matrix approximations from sparse Cholesky factorization

The $\mathcal{H}$ -matrix data structure [39] uses low-rank approximations for blocks $\Theta_{\bar{I}\bar{J}}$ ( $\bar{I},\bar{J}\subset I$ ) fulfilling the admissibility condition

[TABLE]

The approximation property of the incomplete Cholesky factorization in maximin ordering (Theorem 3.1) directly implies bounds on the spectral decay of admissible blocks in the $\mathcal{H}$ -matrix framework, as can be seen from the representation

[TABLE]

of the Cholesky factorization of $\Theta$ . If $L$ is sparse according to the sparsity pattern obtained in Section 2 then $L_{:i}\otimes L_{:i}$ can contribute to the rank of the sub-matrix $\Theta_{\bar{I}\bar{J}}$ only if

[TABLE]

The number of $i\in I$ satisfying (7.3) is at most $C(\eta,d)\rho^{d}\log N$ , which recovers (up to constants) the same rank bounds as obtained in [8] for second-order elliptic PDEs with rough coefficients. However the converse is not true and most hierarchical matrix representations can not be written in terms of a sparse Cholesky factorization of $\Theta$ . For example, adding a diagonal matrix to $\Theta$ does not affect the ranks of admissible blocks, but it diminishes the screening effect and thus the approximation property of the incomplete Cholesky factorization as obtained in Section 2 (see Section 4.3).

7.2 Comparison to Cholesky factorization in wavelet bases

[33] compute sparse Cholesky factorizations of (discretized) differential/integral operators represented in a wavelet basis. Using a fine-to-coarse elimination ordering, they establish that the resulting Cholesky factors decay polynomially with an exponent matching the number of vanishing moments of the underlying wavelet basis.

For differential operators, this coincides algorithmically with the Cholesky factorization described in Section 6.2 and the gamblet transform of [64] and [65], whose estimates guarantee exponential decay. In particular [33] numerically observe a uniform bound on $\operatorname{cond}(B^{(k)})$ which they relate to the approximate sparsity of their proposed Cholesky factorization.

For integral operators, [33] use a fine-to-coarse ordering and we use a coarse-to-fine ordering. While their results rely on the approximate sparsity of the integral operator represented in the wavelet basis, our approximation remains accurate for multiresolution bases (e.g. the maximin ordering in Section 2) in which $\Theta$ is dense, which avoids the $\mathcal{O}(N^{2})$ complexity of a basis transform (or the implementation of adaptive quadrature rules to mitigate this cost).

7.3 Vanishing moments

Let $\mathcal{P}^{s-1}(\tau)$ denote the set of polynomials of order at most $s-1$ that are supported on $\tau\subset\Omega$ . [64] and [65] show that (5.18) and (5.17) hold when $\mathcal{L}$ is an elliptic partial differential operator of order $s$ (as described in Section 2.1) and the measurements are local polynomials of order up to $s-1$ (i.e. $\phi_{i,\alpha}=1_{\tau_{i}}p_{\alpha}$ with $p_{\alpha}\in\mathcal{P}^{s-1}(\tau_{i})$ ). Using these $\phi_{i,\alpha}$ as measurements is equivalent to using wavelets $\phi_{i}$ satisfying the vanishing moment condition

[TABLE]

The requirement for vanishing moments has three important consequences. First, it requires that the order of the operator be known a priori, so that a suitable number of vanishing moments can be ensured. Second, ensuring a suitable number of vanishing moments greatly increases the complexity of the implementation. Third, in order to provide vanishing moments, the measurements $\phi_{i}$ , $i\in J^{(k)}$ , have to be obtained from weighted averages over domains of size of order $h^{k}$ . Therefore, even computing the first entry of the matrix $\Theta$ in the multiresolution basis will have complexity $\mathcal{O}(N^{2})$ , since it requires taking an average over almost all of $I\times I$ . One of the main analytical result of this paper is to show that these vanishing moment conditions and local averages are not necessary for higher order operators (which, in particular, enables the generalization of the gamblet transform to hierarchies of measurements defined as in Examples 5.1 and 5.2).

7.4 Comparison to Multiresolution Approximation (M-RA)

In spatial statistics, the method most closely related to ours is the M-RA of [50] where a Gaussian process is approximated by a sum, at different scales, of predictive processes described in [6]. Following the intuition of the screening effect, these processes are assumed to be block-independent with respect to a domain decomposition at the respective scale, allowing for near-linear computational complexity. Although the specific multiresolution scheme and its accuracy are a function of the specific choice of basis functions and of the knots to be conditioned upon at each scale, no systematic strategy and no theoretical error bounds are provided for best accuracy. We suspect that no scheme relying on block-sparsity assumptions can also guarantee exponential accuracy in near-linear computational complexity, though we note that the taper-M-RA introduced by [51], independently of and after the first version of the present article, does not impose conditional block-independence and could therefore be made exponentially accurate. While our present work and that of [50] are both motivated by a hierarchical exploitation of the screening effect, we identify a concrete and simple algorithm that has a guaranteed exponential accuracy for a wide range of kernel matrices.

8 Conclusions

We have shown that the dense covariance matrices obtained from a wide range of covariance functions associated to smooth Gaussian processes have almost sparse Cholesky factors. Using this property, these matrices can be inverted in near-linear computational complexity just by applying zero fill-in incomplete Cholesky factorization with an a priori ordering and sparsity pattern. Sparse Cholesky factorization of sparse matrices is by now a classical field, but we are not aware of prior work on the sparse factorization of dense matrices, other than for the purpose of preconditioning. While our algorithm is subject to the curse of high dimensionality like other hierarchy-based methods, it is able to exploit low dimensionality in the data without any user intervention. Our results are motivated by the probabilistic interpretation of Cholesky factorization and proved rigorously by using and generalizing recent results on operator-adapted wavelets. By reversing the elimination order, we also obtain a fast direct solver for elliptic PDEs whose rigorous a priori accuracy-vs.-complexity estimates advance the current state of the art for general elliptic PDEs.

Acknowledgments

FS and HO gratefully acknowledge support by the Air Force Office of Scientific Research and the DARPA EQUiPS Program (award number FA9550-16-1-0054, Computational Information Games) and the Air Force Office of Scientific Research (award number FA9550-18-1-0271, Games for Computation and Learning). TJS has been supported by the Freie Universität Berlin within the Excellence Initiative of the German Research Foundation. This collaboration has been facilitated by the Statistical and Applied Mathematical Sciences Institute through the National Science Foundation award number DMS-1127914. Any opinions, findings, and conclusions or recommendations expressed in this material are those of the authors and do not necessarily reflect the views of the above-named institutes and agencies. We would like to thank C. Oates and P. Schröder for helpful discussions, and C. Scovel for many helpful comments and suggestions.

Appendix A Correctness and computational complexity of the maximum-minimum distance ordering

Recall the variables used in Algorithm 3: the integer array $P$ contains the minimax ordering; the real array $l[i]$ contains the distances of each point to the points that are already included in the minimax ordering; and the arrays of integer arrays $c$ and $p$ will contain the entries of the sparsity pattern in the sense that

[TABLE]

We begin by showing correctness of the algorithm.

Theorem A.1.

The ordering and sparsity pattern produced by Algorithm 3 coincide with those described in Section 1. Furthermore, whenever the while-loop in 22 is entered,

(1)

for all $i\in P$ , $l[i]$ is as defined in Section 1; 2. (2)

the array $c[P[1]]$ contains all $1\leq j\leq N$ and for all other $i$ in $P$ , $c[i]$ contains exactly those $1\leq j\leq N$ that satisfy $\operatorname{\mathtt{dist}}(i,j)\leq\rho l[i]$ ; and 3. (3)

for all $1\leq j\leq N$ , $p[j]$ consists of $P[1]$ and all those $i\in P$ that satisfy $\operatorname{\mathtt{dist}}(i,j)\leq\rho l[i]$ .

Proof A.2.

*It is easy to see that if the for-loop in 27 were running over all $1\leq j\leq N$ , then the algorithm would yield the correct result. We claim that the restriction of the running variable to $\{j\in c[k]\mid\operatorname{\mathtt{dist}}(j,k)\leq\operatorname{\mathtt{dist}}(i,k)+\rho l[i]\}$ does not change the result of the algorithm. The proof will proceed by induction. Let us assume that the Algorithm 3 has been correct up to a given time that 27 is visited. Then, by choice of $k$ and the triangle inequality, any $j$ that is omitted by the for-loop must satisfy $\operatorname{\mathtt{dist}}(i,j)>\rho l[i]$ . Since $i$ was chosen to have maximal minimal distance among the points remaining in $H$ , and $\rho>1$ , this means that adding $i$ to the maximin ordering can not decrease the maximal minimal distance of $j$ . Thus, skipping the $\mathtt{decrease!}$ operation does not change the choice of $P$ and $l$ . Similarly, $\operatorname{\mathtt{dist}}(i,j)>\rho l[i]$ implies that the if-statement inside of the for-loop is false, meaning that skipping $j$ does not change the update of $c$ or $p$ , from which the result follows. *

Having established Theorem A.1, we will now use $\prec$ , $l[i]$ , and $i_{k}$ to refer to the maximin ordering, the length-scale of the point with index $i$ , and the $k$ th index in the maximin ordering. We will now bound the complexity of Algorithm 3 in the setting of Theorem 2.1.

Theorem A.3.

*In the setting of Theorem 2.1, there exists a constant depending $C$ depending only on $d$ , $\Omega$ , and $\delta$ , such that, for $\rho>C$ , Algorithm 3 has computational complexity $C\rho^{d}N\log N$ in space and $CN\bigl{(}\rho^{d}\log^{2}N+C_{\operatorname{\mathtt{dist}}_{\partial\Omega}}\bigr{)}$ in time, where $C_{\operatorname{\mathtt{dist}}_{\partial\Omega}}$ is the computational complexity of invoking the function $\operatorname{\mathtt{dist}}_{\partial\Omega}$ . *

Proof A.4.

*As a first step, we will upper-bound the number of iterations of the for-loop in 27 throughout the algorithm. To simplify the notation, $C$ will denote a positive constant that depends on $d$ , $\Omega$ and $\delta$ that may change throughout the proof. We claim that there exists $1\leq k_{\min}\leq N$ depending only on $d$ , $\Omega$ , and $\delta$ , such that, for all $i\succ i_{k_{\min}}$ , by the time it appears in the while-loop at 22, there exists an index $k\prec i$ such that $l[k]\geq 2l[j]$ and $\operatorname{\mathtt{dist}}(i,k)\leq Cl=Cl[i]$ . Indeed, since $\Omega$ has Lipschitz boundary, it satisfies an interior cone condition [2] in the sense that there exist $\theta\in(0,2\pi]$ and $r>0$ such that every point $x\in\Omega$ is the tip of a spherical cone within $\Omega$ with opening angle $\theta$ and radius $r$ . This spherical cone contains a ball with radius $r_{\gamma}$ , which depends only on $\theta$ and $r$ . Let $\gamma_{i}$ be such a cone with tip $x_{i}$ . By a scaling argument, the spherical cone $\gamma_{i}\cap B_{\tilde{r}}(x_{i})$ then contains a ball of radius $r_{\gamma}(\tilde{r}/r)$ , for all $\tilde{r}<r$ . For any $i\in I$ and any ball $B\subset\Omega$ with radius at least $4l[i]/\delta$ , there exists a $k\prec i$ such that $l[k]\geq 2l[i]$ and $x_{k}\in B$ . Thus, for $l[i]\leq\delta r_{\gamma}/4$ , there exists a $k\prec i$ with $x_{k}\subset\gamma_{i}\cap B_{2l[i]}(x_{i})$ . By a sphere-packing argument, we can find a $k_{\min}$ such that, for all $i\succ i_{k_{\min}}$ , $l[i]\leq\delta r_{\gamma}/4$ , which yields the claim. Because of the above, for $\rho>C$ , there exists a point satisfying the constraint in 25 with $\operatorname{\mathtt{dist}}(k,i)\leq 2Cl[i]$ . Thus, the number of times the for-loop in 27 is visited for a given index $i$ is bounded above by $C_{i}\coloneqq\#\{j\in I\mid\operatorname{\mathtt{dist}}(i,j)\leq 2(C+\rho)l[i]\}$ . By a sphere-packing argument, $C_{i_{m}}\leq C(N/m)\rho^{d}$ , for a constant $C$ depending only on $d$ , $\Omega$ , and $\delta$ . Summing the above over $1\leq m\leq N$ yields the upper bound $C\rho^{d}N\log N$ . The most costly step in the for-loop in 27 is the decrease! operation requiring the restoration of the heap property, which has computational complexity $\mathcal{O}(\log N)$ . Thus, the overall computational complexity is at most $CN\bigl{(}\rho^{d}\log^{2}N+C_{\operatorname{\mathtt{dist}}_{\partial\Omega}}\bigr{)}$ . The bound on the space complexity follows, since each iteration of the for-loop consumes $\mathcal{O}(1)$ memory. *

Proof A.5 (Proof of Theorem 4.1).

Theorem 4.1* follows from Theorems A.1 and A.3. *

Algorithm 3 uses only pairwise distances between points, and thus automatically adapts to low-dimensional structure in the $\{x_{i}\}_{i\in I}$ . Indeed, for $\Omega=\mathbb{R}^{d}$ , the computational complexity of Algorithm 3 depends only on the intrinsic dimension of the dataset.

{condition}

[Intrinsic dimension]

There exist constants $C_{\tilde{d}},\tilde{d}>0$ , independent of $N$ , such that, for all $r,R>0$ and $x\in\mathbb{R}^{d}$ ,

[TABLE]

We say that the point set $\{x_{i}\}_{i\in I}$ has intrinsic dimension $\tilde{d}$ .

{condition}

[Polynomial Scaling]

There exists a polynomial $\boldsymbol{p}$ for which

[TABLE]

Theorem A.6.

*Let $\Omega=\mathbb{R}^{d}$ and $\rho\geq 2$ . Then the computational complexity of Algorithm 3 is at most $C\rho^{\tilde{d}}N\log N$ in space and $CN\bigl{(}\log(N)\rho^{\tilde{d}}(\log N+C_{\operatorname{\mathtt{dist}}})+C_{\operatorname{\mathtt{dist}}_{\partial\Omega}}\bigr{)}$ in time, for a constant $C=C\bigl{(}C_{\tilde{d}},\tilde{d},\boldsymbol{p}\bigr{)}$ depending only on the constants in A.5 and A.5. *

Proof A.7.

*The proof is analogous to that of Theorem A.3. The main difference is that the claim on the existence of $k_{\min}$ is replaced by the fact — which follows directly from the definition of the maximin ordering — that, for all $i$ , there exists a $k\prec i$ such that $l[k]\geq 2l[i]$ and $\operatorname{\mathtt{dist}}(k,i)\leq 2l[i]$ . In particular, any $\rho\geq 2$ leads to near-linear computational complexity. *

Appendix B Proofs of Section 5

Proof B.1 (Proof of Lemma 5.3).

The main idea is to recursively apply Lemma 5.5. First, applying Lemma 5.5 with first block $\Theta_{1:q-1,1:q-1}$ and second block $\Theta_{q,q}$ yields

[TABLE]

We now repeat this operation recursively. After the $k$ th step, the central matrix has an upper-left block consisting of $\Theta^{(q-k)}$ . We then apply Lemma 5.5 to this upper-left block, with the splitting given by $\Theta_{1:q-k-1,1:q-k-1}$ and $\Theta_{q-k,q-k}$ . This reduces the central matrix more and more towards the block-diagonal matrix $D$ , while splitting off a triangular factor to either side. Doing this up to the $(q-1)$ th step yields the following identity:

[TABLE]

We now combine the lower-triangular factors, obtaining

[TABLE]

Here, we have used the formulae for the inverses and products of elementary lower-triangular matrices [82, pp.150–151],

[TABLE]

*where $\boldsymbol{e}_{k}$ is the $k$ th standard Euclidean basis row vector, with $k<l$ . *

Proof B.2 (Proof of Lemma 5.12).

We prove the result in the setting of Example 5.1, since the proof for Example 5.2 is similar. The inequality $\|\phi\|_{\ast}^{2}\geq\frac{1}{\|\mathcal{L}\|}\|\phi\|_{H^{-s}(\Omega)}^{2}$ and (5.24) imply that

[TABLE]

*The identity $\|\phi_{i}\|_{H^{-s}\left(B_{(\delta/2)h^{k}}^{2}(x_{i})\right)}^{2}=h^{2sk}(\delta/2)^{2s-d}\|\boldsymbol{\delta}(\hbox to5.71527pt{\hss$ \cdot $\hss}-0)\|_{H^{-s}\left(B_{1}(0)\right)}^{2}$ concludes the proof with $C_{\Phi}\coloneqq\|\mathcal{L}\|(\delta/2)^{d-2s}\left\|\boldsymbol{\delta}(\hbox to5.71527pt{\hss$ \cdot $\hss}-0)\right\|_{H^{-s}\left(B_{1}(0)\right)}^{-1}$ and $H\coloneqq h^{s}$ . *

Proof B.3 (Proof of Lemma 5.14 in the case of Example 5.2).

Let $\zeta$ be a set of points such that $\left\{B_{\rho h^{k}}(z)\right\}_{z\in\zeta}$ covers $\Omega$ , and such that $\sup_{x\in\Omega}\#\left\{z\in\zeta:x\in B_{2\rho h^{k}}(z)\right\}\leq C(d)$ . For $i\in J^{(l)}$ and $z\in\zeta$ , we write $i\leadsto z$ if $z$ is the element of $\zeta$ closest to $i$ (using an arbitrary way to break ties). For $1\leq k<l\leq q$ , $\phi\coloneqq\sum_{i\in J^{(l)}}\alpha_{i}\phi_{i}$ , $\varphi\coloneqq\sum_{i\in J^{(l)}}\alpha_{i}\varphi_{i}$ and $\varphi_{i}\coloneqq\sum_{j\in I^{(k)}}w_{ij}\phi_{j}^{(k)}$ we have

[TABLE]

The Bramble–Hilbert lemma [22] and the vanishing moment property (5.28) of $\phi_{i}-\varphi_{i}$ yield that

[TABLE]

Summing over all $z\in\zeta$ and choosing the constant $C$ appropriately yields

[TABLE]

Since the $\phi_{i}$ are $L^{2}$ -orthogonal to each other and $\|\phi_{i}\|_{L^{2}}^{2}\leq C$ ,

[TABLE]

Inserting the definition of the $\varphi_{i}$ yields

[TABLE]

We will now use the fact that on $\mathbb{R}^{n}$ , we have the norm inequalities $n^{-1/2}|\hbox to5.71527pt{\hss$ \cdot $\hss}|_{1}\leq|\hbox to5.71527pt{\hss$ \cdot $\hss}|_{2}\leq|\hbox to5.71527pt{\hss$ \cdot $\hss}|_{1}$ . By a sphere-packing argument, for any $z\in\zeta$ , we have $\left\{i\in J^{(l)}\,\middle|\,i\leadsto z\right\}\leq C(d)(\rho/\delta)^{d}h^{d(k-l)}$ Thus, the number of summands in the innermost sum is at most $C(d)(\rho/\delta)^{d}h^{(k-l)d}$ and using the above norm inequalites, we obtain

[TABLE]

*Putting the above together yields the result. *

Proof B.4 (Proof of Lemma 5.28).

Define

[TABLE]

and observe that $\|R\|=r$ . Since $A=\frac{\|A\|+\|A^{-1}\|^{-1}}{2}\left(\textup{Id}-R\right)$ , it follows from a Neumann series argument that $A^{-1}=\frac{2}{\|A\|+\|A^{-1}\|^{-1}}\sum_{k=0}^{\infty}R^{k}$ . The positive definiteness of $A$ implies that

[TABLE]

Let $C_{R}\coloneqq\max\left\{1,\frac{2C}{\|A\|+\|A^{-1}\|^{-1}}\right\}$ . Lemma 5.26 implies that

[TABLE]

Combining the above estimates yields

[TABLE]

By choosing

[TABLE]

and $n+1\coloneqq\lceil\nu\rceil$ , we obtain

[TABLE]

This yields the upper bound

[TABLE]

Optimising the term on line (B.39) over $\left(1+\log\left(c_{d}\left(\gamma/2\right)\right)+\log(C_{R})\right)$ yields

[TABLE]

Proof B.5 (Proof of Lemma 5.30).

In this proof we will use the notation $k{:}l$ to denote the individual indices from $k$ to $l$ , as opposed to matrix blocks. We will establish the result by showing that, for all $1\leq k\leq N$ , the $k$ th column of $L$ (when considered as an element of $\mathbb{R}^{I\times I}$ by zero padding) satisfies the exponential decay stated in the lemma. Let $S^{(k)}\coloneqq B_{k:N,k:N}-B_{k:N,1:k-1}(B_{1:k-1,1:k-1})^{-1}B_{1:k-1,k:N}$ . Then $L_{k:N,k}=S^{(k)}_{:,1}/\sqrt{S^{(k)}_{k,k}}$ . Lemma 5.5 implies that $S^{(k)}=(B_{k:N,k:N})^{-1}$ , and hence Lemma 5.28 yields that

[TABLE]

*Here we used the facts that the spectrum of $B_{k:n,k:n}$ is contained in $[\lambda_{\min}(B),\lambda_{\max}(B)]$ and that the right-hand side of the above estimate is increasing in $r$ and $C_{R}$ . The estimate $S^{(k)}_{k,k}\geq\frac{1}{\|S^{(k),-1}\|}\geq\frac{1}{\|B\|}$ completes the proof. *

Proof B.6 (Proof of Lemma 5.32).

For any matrix $T$ for which the Neumann series $\sum_{k=0}^{\infty}T^{k}$ converges in the operator norm, we have $\left(\textup{Id}-T\right)^{-1}=\sum_{k=0}^{\infty}T^{k}$ . Therefore, $L^{-1}=\sum_{k=0}^{\infty}(\textup{Id}-L)^{k}$ if the right-hand side series is convergent. Since $\textup{Id}-L$ has the block-lower-triangular structure

[TABLE]

it follows that $\textup{Id}-L$ is $q$ -nilpotent, i.e. $(\textup{Id}-L)^{q}=0$ and the Neuman series terminates after the first $q$ summands. Using this we will now show that the exponential decay of $L$ is preserved under inversion. To this end, consider the $(k,l)$ th block of $(\textup{Id}-L)^{p}$ and observe that

[TABLE]

where the inequality follows from Lemma 5.26. Summing (B.43) over $p$ , we obtain, for $i\neq j$ ,

[TABLE]

*which concludes the proof of the lemma. *

With the above results on the propagation of exponential decay we can now conclude the proof of Theorem 5.25.

Proof B.7 (Proof of Theorem 5.25).

Applying Lemma 5.28, Section 5.3.2, and the condition number bound in Section 5.3.3 yields the following estimate for $B^{(k),-1}$ :

[TABLE]

with $C_{R}=\max\left\{1,\frac{2C_{\gamma}\bigl{\|}B^{(k),-1}\bigr{\|}}{1+\kappa}\right\}$ and $r=\frac{1-\kappa^{-1}}{1+\kappa^{-1}}$ . Lemma 5.5 and Section 5.3.3 yield

[TABLE]

Using these estimates, we obtain

[TABLE]

where $\tilde{C}_{R}=\max\left\{1,\frac{2C_{\gamma}C_{\Phi}}{1+\kappa}\right\}$ , $r=\frac{1-\kappa^{-1}}{1+\kappa^{-1}}$ and $\tilde{\gamma}\coloneqq\frac{-\log(r)}{\left(1+\log\left(c_{d}\left(\gamma/2\right)\right)+\log\left(\tilde{C}_{R}\right)-\log(r)\right)}\frac{\gamma}{2}$ . Applying Lemma 5.26 to the products $B^{(i),-1}A^{(i)}_{ij}$ appearing in the definition of $\bar{L}^{-1}$ in Lemma Lemma 5.3, we obtain

[TABLE]

Lemma 5.32* now yields the following decay bound for $\bar{L}$ :*

[TABLE]

For a positive-definite matrix $M$ , let $\operatorname{chol}\left(M\right)$ denote its lower-triangular Cholesky factor and set $L^{(k)}\coloneqq\operatorname{chol}\left(B^{(k),-1}\right)$ . Following the same procedure as in the bound of the decay of $B^{(k)}$ yields the decay bound

[TABLE]

Applying Lemma 5.26 to the product $\bar{L}\operatorname{chol}(D)=\operatorname{chol}(\Theta)$ yields the decay bound

[TABLE]

Appendix C Proof of accuracy of incomplete Cholesky factorization in the supernodal multicolor ordering

We will now bound the approximation error of the Cholesky factors obtained from Algorithm 2, using the supernodal multicolor ordering and sparsity pattern described in 5.38. For $\tilde{i},\tilde{j}\in\tilde{I}$ , let $\Theta_{\tilde{i},\tilde{j}}$ be the submatrix $(\Theta_{ij})_{i\in\tilde{i},j\in\tilde{j}}$ and let $\sqrt{M}$ be the (dense and lower-triangular) Cholesky factor of a matrix $M$ .

First observe that Algorithm 2 with supernodal multicolor ordering $\prec_{\rho}$ and sparsity pattern $S_{\rho}$ is equivalent to the block-incomplete Cholesky factorization described in Algorithm 4 where the function $\texttt{Restrict!}(\Theta,S_{\rho})$ sets all entries of $\Theta$ outside of $S_{\rho}$ to zero.

We will now reformulate the above algorithm using the fact that the elimination of nodes of the same color, on the same level of the hierarchy, happens consecutively. Let $p$ be the maximal number of colors used on any level of the hierarchy. We can then write $I=\bigcup_{1\leq k\leq q,1\leq l\leq p}J^{(k,l)}$ , where $J^{(k,l)}$ is the set of indices on level $k$ colored in the color $l$ . Let $\Theta_{(k,l),(m,n)}$ be the restriction of $\Theta$ to $J^{(k,l)}\times J^{(m,n)}$ and write $(m,n)\prec(k,l)\iff m<k\text{ or }(m=k\text{ and }n<l)$ . We can then rewrite Algorithm 4 as

For $1\leq k\leq q,1\leq l\leq p$ and a matrix $M\in\mathbb{R}^{I\times I}$ with $M_{(:,:),(m,n)},M_{(m,n),(:,:)}=0$ for all $(m,n)\prec(k,l)$ , let $\mathbb{S}\left[M\right]$ be the matrix obtained by applying $\texttt{Restrict!}(M,S_{\rho})$ followed by the Schur complementation $M\leftarrow M-M_{(:,:),(k,l)}\bigl{(}M_{(k,l),(k,l)}\bigr{)}^{-1}M_{(k,l),(:,:)}$ . We now prove a stability estimate for the operator $\mathbb{S}$ . Let $M_{k,(m,n)}$ be the restriction of a matrix $M\in\mathbb{R}^{I\times I}$ to $J^{(k)}\times J^{(m,n)}$ .

Lemma C.1.

For $1\leq k^{\circ}\leq q$ and $1\leq l^{\circ}\leq p$ let $\Theta,E\in\mathbb{R}^{I\times I}$ be such that

[TABLE]

and (writing $\Theta_{k,l}$ for the $J^{(k)}\times J^{(l)}$ submatrix of $\Theta$ and $\lambda_{\max}$ for maximal singular values) define

[TABLE]

If

[TABLE]

then the following perturbation estimate holds:

[TABLE]

Proof C.2.

Write $\tilde{\Theta}$ , $\tilde{E}$ for the versions of $\Theta$ , $E$ set to zero outside of $S_{\rho}$ . For $k^{\circ}\leq k,l\leq q$ ,

[TABLE]

where the second equality follows from the matrix identity

[TABLE]

Now recall that, for all $A\in\mathbb{R}^{n\times m},B\in\mathbb{R}^{m\times s}$ , $\|M\|\leq\|M\|_{\operatorname{Fro}}$ and $\|AB\|_{\operatorname{Fro}}\leq\|A\|\|B\|_{\operatorname{Fro}}$ . Therefore, $\|(A+E)^{-1}\|\leq 2/\lambda_{\min}$ and $\|A+E\|\leq 2\lambda_{\max}$ . Combining these estimates and using the triangle inequality yields

[TABLE]

Recursive application of the above lemma gives a stability result for the incomplete Cholesky factorization.

Lemma C.3.

For $\rho>0$ , let $\prec_{\rho}$ and $S_{\rho}$ be a supernodal ordering and sparsity pattern such that the maximal number of colors used on each level is at most $p$ . Let $L^{S_{\rho}}$ be an invertible lower-triangular matrix with nonzero pattern $S_{\rho}$ and define $M\coloneqq L^{S_{\rho}}L^{S_{\rho},\top}$ . Assume that $M$ satisfies Section 5.3.3 with constant $\kappa$ . Then there exists a universal constant $C$ such that, for all $0<\epsilon<\frac{\lambda_{\min}(M)}{2q^{2}(C\kappa)^{2qp}}$ and all $E\in\mathbb{R}^{I\times I}$ with $\|E\|_{\operatorname{Fro}}\leq\epsilon$ ,

[TABLE]

*where $\tilde{L}^{(S_{\rho})}$ is the Cholesky factor obtained by applying Algorithm 5 to $M+E$ . *

Proof C.4.

*The result follows from applying Lemma C.1 at each step of Algorithm 5. *

We can prove Theorem 5.41 by using the stability result obtained above.

Proof C.5 (Proof of Theorem 5.41).

Theorem 5.34* implies that by choosing $\rho\geq\tilde{C}\log(N/\epsilon)$ there exists a lower-triangular matrix $\tilde{L}^{S_{\rho}}$ with sparsity pattern $S_{\rho}$ such that $\bigl{\|}\Theta-\tilde{L}^{S_{\rho}}\tilde{L}^{S_{\rho},\top}\bigr{\|}_{\operatorname{Fro}}\leq\epsilon$ . Theorem 5.23 implies that the Example 5.1 and Example 5.2 satisfy $\lambda_{\min}\geq 1/\operatorname{poly}(N)$ . Therefore, choosing $\rho\geq\tilde{C}\log N$ ensures that $\epsilon<\frac{\lambda_{\min}(\Theta)}{2}$ and thus that $\tilde{\Theta}\coloneqq\tilde{L}^{S_{\rho}}\tilde{L}^{S_{\rho},\top}$ satisfies Section 5.3.3 with constant $2C_{\Phi}$ , where $C_{\Phi}$ is the corresponding constant for $\Theta$ . By possibly changing the constant $\tilde{C}$ again, $\rho\geq\tilde{C}\log N$ also ensures that*

[TABLE]

where $C$ is the constant of Lemma C.3, since $q\approx\log N$ and, by Lemma 5.39, $p$ is bounded independently of $N$ . Thus, by Lemma C.3, the Cholesky factor $L^{S_{\rho}}$ obtained from applying Algorithm 5 to $\Theta=\tilde{\Theta}+\bigl{(}\Theta-\tilde{\Theta}\bigr{)}$ satisfies

[TABLE]

*where $\kappa$ is the constant with which $\Theta$ satisfies Section 5.3.3 and the polynomial depends only on $C$ , $\kappa$ , and $p$ . Since, for the ordering $\prec_{\rho}$ and sparsity pattern $S_{\rho}$ , the Cholesky factors obtained via Algorithms 2 and 5 coincide, we obtain the result. *

Bibliography87

The reference list from the paper itself. Each links out to its DOI / PubMed record.

1[1] M. Abramowitz and I. A. Stegun , Handbook of Mathematical Functions with Formulas, Graphs, and Mathematical Tables , vol. 55 of National Bureau of Standards Applied Mathematics Series, U.S. Government Printing Office, Washington, D.C., 1964.
2[2] R. A. Adams and J. J. F. Fournier , Sobolev Spaces , vol. 140 of Pure and Applied Mathematics (Amsterdam), Elsevier/Academic Press, Amsterdam, second ed., 2003.
3[3] S. Ambikasaran and E. Darve , An 𝒪 ( N log ⁡ N ) 𝒪 𝑁 𝑁 \mathcal{O}(N\log N) fast direct solver for partial hierarchically semi-separable matrices , J. Sci. Comput., 57 (2013), pp. 477–501, https://doi.org/10.1007/s 10915-013-9714-z . · doi ↗
4[4] S. Ambikasaran, D. Foreman-Mackey, L. Greengard, D. W. Hogg, and M. O’Neil , Fast direct methods for Gaussian processes , IEEE Trans. Pattern Anal. Mach. Intell., 38 (2016), pp. 252–265, https://doi.org/10.1109/TPAMI.2015.2448083 . · doi ↗
5[5] F. R. Bach and M. I. Jordan , Kernel independent component analysis , J. Mach. Learn. Res., 3 (2003), pp. 1–48, https://doi.org/10.1162/153244303768966085 . · doi ↗
6[6] S. Banerjee, A. E. Gelfand, A. O. Finley, and H. Sang , Gaussian predictive process models for large spatial data sets , J. R. Stat. Soc. Ser. B Stat. Methodol., 70 (2008), pp. 825–848, https://doi.org/10.1111/j.1467-9868.2008.00663.x . · doi ↗
7[7] M. Bebendorf , Hierarchical Matrices , vol. 63 of Lecture Notes in Computational Science and Engineering, Springer-Verlag, Berlin, 2008, https://doi.org/10.1007/978-3-540-77147-0 . · doi ↗
8[8] M. Bebendorf and W. Hackbusch , Existence of ℋ ℋ \mathcal{H} -matrix approximants to the inverse FE-matrix of elliptic operators with L ∞ superscript 𝐿 L^{\infty} -coefficients , Numer. Math., 95 (2003), pp. 1–28, https://doi.org/10.1007/s 00211-002-0445-6 . · doi ↗

TL;DR

Contribution

Findings

Abstract

Peer Reviews

Code & Models

Videos

Compression, inversion, and approximate PCA of dense kernel matrices at near-linear computational complexity

Abstract

keywords:

1 Introduction

1.1 Dense kernel matrices and the N3N^{3}N3-bottleneck

1.2 Existing approaches

1.3 Our main result and and overview of the paper

2 Overview of the algorithm and its setting

2.1 The class of elliptic operators

2.2 Zero fill-in incomplete Cholesky factorization (ICHOL(0))

Theorem 2.1**.**

Remark 2.2**.**

2.3 The elimination ordering and sparsity pattern

2.4 Sparse approximate PCA

Theorem 2.3**.**

3 Why it works — justification of the method

3.1 Sparse Cholesky factors of dense matrices

Theorem 3.1**.**

3.2 Gaussian elimination, conditioning of Gaussian random variables,

3.3 Cholesky factorization and operator-adapted wavelets

Operator-adapted wavelets.

Relation to Cholesky factorization.

4 Implementation and numerical results

4.1 Selection of the sparsity pattern and ordering

Theorem 4.1**.**

4.2 The case of the whole space (Ω=Rd\Omega=\mathbb{R}^{d}Ω=Rd)

4.3 Nuggets and measurement errors

4.4 Numerical results

5 Analysis of the algorithm

5.1 General Setting

5.2 Main examples

Example 5.1**.**

Example 5.2**.**

5.3 Exponential decay of Cholesky factors

5.3.1 Algebraic Identities and roadmap

Lemma 5.3**.**

Proof 5.4**.**

Lemma 5.5** ([87, Chapter 1.1]).**

5.3.2 Exponential decay of A(k)A^{(k)}A(k)

Theorem 5.6** ([65]).**

Proof 5.7**.**

5.3.3 Bounded condition numbers

Theorem 5.8**.**

Proof 5.9**.**

Theorem 5.10**.**

Proof 5.11**.**

Lemma 5.12**.**

Proof 5.13**.**

Lemma 5.14**.**

Lemma 5.15** ([21]).**

Lemma 5.16**.**

Proof 5.17**.**

Proof 5.18** (Proof of Lemma 5.14).**

Lemma 5.19**.**

Proof 5.20**.**

Lemma 5.21**.**

Proof 5.22**.**

Theorem 5.23**.**

5.3.4 Propagation of exponential decay

Definition 5.24**.**

Theorem 5.25** (Exponential decay of the Cholesky factors).**

Lemma 5.26**.**

Proof 5.27**.**

Lemma 5.28**.**

Proof 5.29**.**

Lemma 5.30**.**

Proof 5.31**.**

Lemma 5.32**.**

Proof 5.33**.**

5.4 Complexity and error estimates

Theorem 5.34**.**

Proof 5.35**.**

Proof 5.36** (Proof of Theorem 3.1).**

1.1 Dense kernel matrices and the $N^{3}$ -bottleneck

Theorem 2.1.

Remark 2.2.

Theorem 2.3.

Theorem 3.1.

Theorem 4.1.

4.2 The case of the whole space ( $\Omega=\mathbb{R}^{d}$ )

Example 5.1.

Example 5.2.

Lemma 5.3.

Proof 5.4.

Lemma 5.5 ([87, Chapter 1.1]).

5.3.2 Exponential decay of $A^{(k)}$

Theorem 5.6 ([65]).

Proof 5.7.

Theorem 5.8.

Proof 5.9.

Theorem 5.10.

Proof 5.11.

Lemma 5.12.

Proof 5.13.

Lemma 5.14.

Lemma 5.15 ([21]).

Lemma 5.16.

Proof 5.17.

Proof 5.18 (Proof of Lemma 5.14).

Lemma 5.19.

Proof 5.20.

Lemma 5.21.

Proof 5.22.

Theorem 5.23.

Definition 5.24.

Theorem 5.25 (Exponential decay of the Cholesky factors).

Lemma 5.26.

Proof 5.27.

Lemma 5.28.

Proof 5.29.

Lemma 5.30.

Proof 5.31.

Lemma 5.32.

Proof 5.33.

Theorem 5.34.

Proof 5.35.

Proof 5.36 (Proof of Theorem 3.1).

Theorem 5.37.

Proof 5.38.

Lemma 5.39.

Proof 5.40.

Theorem 5.41.

Proof 5.42.

Proof 5.43 (Proof of Theorem 2.1).

Theorem 5.44 (Approximate PCA).

Proof 5.45.

6.1 The cases $s\leq d/2\text{ or }s\notin\mathbb{N}$

6.2 Sparse factorization of $A=\Theta^{-1}$

Theorem 6.1.

7.1 $\mathcal{H}$ -matrix approximations from sparse Cholesky factorization

Theorem A.1.

Proof A.2.

Theorem A.3.

Proof A.4.

Proof A.5 (Proof of Theorem 4.1).

Theorem A.6.

Proof A.7.

Proof B.1 (Proof of Lemma 5.3).

Proof B.2 (Proof of Lemma 5.12).

Proof B.3 (Proof of Lemma 5.14 in the case of Example 5.2).

Proof B.4 (Proof of Lemma 5.28).

Proof B.5 (Proof of Lemma 5.30).

Proof B.6 (Proof of Lemma 5.32).

Proof B.7 (Proof of Theorem 5.25).

Lemma C.1.

Proof C.2.

Lemma C.3.

Proof C.4.

Proof C.5 (Proof of Theorem 5.41).