A Sketched Finite Element Method for Elliptic Models

Robert Lung; Yue Wu; Dimitris Kamilis; Nick Polydorides

arXiv:1907.09852·math.NA·April 22, 2020

A Sketched Finite Element Method for Elliptic Models

Robert Lung, Yue Wu, Dimitris Kamilis, Nick Polydorides

PDF

TL;DR

This paper introduces a sketched finite element method for elliptic PDEs that uses random sampling based on leverage scores to significantly speed up computations while maintaining accuracy.

Contribution

It proposes a novel algorithm combining low-dimensional projection and randomized sketching with leverage score sampling for efficient high-dimensional elliptic PDE solutions.

Findings

01

Achieves nearly optimal performance with leverage score sampling

02

Provides theoretical bounds on error and complexity

03

Demonstrates two orders of magnitude speedup in simulations

Abstract

We consider a sketched implementation of the finite element method for elliptic partial differential equations on high-dimensional models. Motivated by applications in real-time simulation and prediction we propose an algorithm that involves projecting the finite element solution onto a low-dimensional subspace and sketching the reduced equations using randomised sampling. We show that a sampling distribution based on the leverage scores of a tall matrix associated with the discrete Laplacian operator, can achieve nearly optimal performance and a significant speedup. We derive an expression of the complexity of the algorithm in terms of the number of samples that are necessary to meet an error tolerance specification with high probability, and an upper bound for the distance between the sketched and the high-dimensional solutions. Our analysis shows that the projection not only reduces…

Tables3

Table 1. Table 1. Numerical results for the tests performed with p ∼ 𝒰 ( [ 10 − 1 , 10 2 ] ) similar-to 𝑝 𝒰 superscript 10 1 superscript 10 2 p\sim\mathcal{U}([10^{-1},10^{2}]) . The quantities above are averages over 100 runs with different p 𝑝 p realisations. The results show the impact of c 𝑐 c and ρ 𝜌 \rho on the various error components and the computing times. Note that for a sufficiently large c 𝑐 c the total error is only marginally larger than the projection error, which manifest the regularising effect of the projection on the sketching induced error.

$ρ$	$c$ [ $10^{6}$ ]	time [s]	$c^{'} / 3 k$	$\frac{‖ Π u_{opt} - u_{opt} ‖}{‖ u_{opt} ‖}$	$‖ {\hat{G}}^{- 1} G - I ‖$	$\frac{‖ {\hat{u}}_{reg} - u_{reg} ‖}{‖ u_{reg} ‖}$	$\frac{‖ {\hat{u}}_{reg} - u_{opt} ‖}{‖ u_{opt} ‖}$
50	0.5	0.43	0.04	0.07	1.60	0.07	0.09
50	1	0.78	0.06	0.07	1.07	0.05	0.08
100	0.5	0.49	0.04	0.03	3.99	0.11	0.11
100	1	0.80	0.06	0.03	2.30	0.06	0.07
100	5	3.22	0.11	0.03	0.77	0.02	0.04

Table 2. Table 2. Numerical results for the tests with lognormal random field drawn from a Whittle-Matérn model with a smooth covariance. The algorithm yields solutions with less than 10% error with as few as 50 basis functions. Similar to the uniformly random case in table 1 , the total errors are sustained close to the projection errors when ‖ G ^ − 1 G − I ‖ < 1 norm superscript ^ 𝐺 1 𝐺 𝐼 1 \|\hat{G}^{-1}G-I\|<1 .

$ρ$	$c$ [ $10^{6}]$	time [s]	$c^{'} / 3 k$	$\frac{‖ Π u_{opt} - u_{opt} ‖}{‖ u_{opt} ‖}$	$‖ {\hat{G}}^{- 1} G - I ‖$	$\frac{‖ {\hat{u}}_{reg} - u_{reg} ‖}{‖ u_{reg} ‖}$	$\frac{‖ {\hat{u}}_{reg} - u_{opt} ‖}{‖ u_{opt} ‖}$
25	0.5	0.52	0.04	0.15	0.73	0.05	0.17
50	1	0.52	0.06	0.07	0.95	0.04	0.08
50	5	3.51	0.12	0.07	0.35	0.02	0.07
100	1	0.85	0.06	0.03	1.97	0.05	0.06
100	5	3.51	0.12	0.03	0.65	0.04	0.04

Table 3. Table 3. Numerical results for the non-smooth parameter field. In this case the algorithm requires a far more extensive basis, and thus considerably more samples and computing time to yield solutions within the required 10% error margin.

$ρ$	$c$ $[10^{6}]$	time [s]	$c^{'} / 3 k$	$\frac{‖ Π u_{opt} - u_{opt} ‖}{‖ u_{opt} ‖}$	$‖ {\hat{G}}^{- 1} G - I ‖$	$\frac{‖ {\hat{u}}_{reg} - u_{reg} ‖}{‖ u_{reg} ‖}$	$\frac{‖ {\hat{u}}_{reg} - u_{opt} ‖}{‖ u_{opt} ‖}$
1000	1	2.67	0.06	0.07	4.61	0.01	0.26
1000	5	5.96	0.12	0.05	1.25	0.01	0.26
2000	1	4.87	0.06	0.02	77.36	0.02	0.08
2000	5	9.95	0.12	0.03	9.64	0.01	0.08

Equations201

- \nabla \cdot p \nabla u = f in Ω,

- \nabla \cdot p \nabla u = f in Ω,

u = g^{(D)} on \partial Ω,

u = g^{(D)} on \partial Ω,

0 < p_{m i n} \leq p \leq p_{m a x} < \infty on Ω \cup \partial Ω,

0 < p_{m i n} \leq p \leq p_{m a x} < \infty on Ω \cup \partial Ω,

\int_{Ω} d x \nabla u \cdot p \nabla v = \int_{Ω} d x f v,

\int_{Ω} d x \nabla u \cdot p \nabla v = \int_{Ω} d x f v,

\mathcal{H}^{1}(\Omega)\doteq\Bigl{\{}u\in L^{2}(\Omega)\Bigl{|}\frac{\partial u}{\partial x_{q}}\in L^{2}(\Omega),\quad q=1,\ldots,d\Bigr{\}},

\mathcal{H}^{1}(\Omega)\doteq\Bigl{\{}u\in L^{2}(\Omega)\Bigl{|}\frac{\partial u}{\partial x_{q}}\in L^{2}(\Omega),\quad q=1,\ldots,d\Bigr{\}},

\mathcal{H}^{1}_{U}\doteq\Bigl{\{}u\in\mathcal{H}^{1}(\Omega)\Bigl{|}u=g^{(D)}\;\mathrm{on}\;\partial\Omega\Bigr{\}},\quad\mathcal{H}^{1}_{0}\doteq\Bigl{\{}v\in\mathcal{H}^{1}(\Omega)\Bigl{|}v=0\;\mathrm{on}\;\partial\Omega\Bigr{\}}.

\mathcal{H}^{1}_{U}\doteq\Bigl{\{}u\in\mathcal{H}^{1}(\Omega)\Bigl{|}u=g^{(D)}\;\mathrm{on}\;\partial\Omega\Bigr{\}},\quad\mathcal{H}^{1}_{0}\doteq\Bigl{\{}v\in\mathcal{H}^{1}(\Omega)\Bigl{|}v=0\;\mathrm{on}\;\partial\Omega\Bigr{\}}.

\int_{Ω} d x \nabla u \cdot p \nabla v = \int_{Ω} d x f v, \forall v \in H_{0}^{1} .

\int_{Ω} d x \nabla u \cdot p \nabla v = \int_{Ω} d x f v, \forall v \in H_{0}^{1} .

S_{Ω}^{1} ≐ span {ϕ_{1}, \dots, ϕ_{n}, \dots, ϕ_{n + n_{\partial}}}

S_{Ω}^{1} ≐ span {ϕ_{1}, \dots, ϕ_{n}, \dots, ϕ_{n + n_{\partial}}}

u = i = 1 \sum n u_{i} ϕ_{i} + i = n + 1 \sum n + n_{\partial} u_{i} ϕ_{i} .

u = i = 1 \sum n u_{i} ϕ_{i} + i = n + 1 \sum n + n_{\partial} u_{i} ϕ_{i} .

Ω_{ℓ} \in T_{Ω} \sum \int_{Ω_{ℓ}} d x \nabla u \cdot p \nabla v = Ω_{ℓ} \in T_{Ω} \sum \int_{Ω_{ℓ}} d x f v, \forall v \in S_{Ω}^{1} .

Ω_{ℓ} \in T_{Ω} \sum \int_{Ω_{ℓ}} d x \nabla u \cdot p \nabla v = Ω_{ℓ} \in T_{Ω} \sum \int_{Ω_{ℓ}} d x f v, \forall v \in S_{Ω}^{1} .

p_{ℓ} = \frac{1}{∣ Ω _{ℓ} ∣} \int_{Ω_{ℓ}} d x p, and f_{ℓ} = \frac{1}{∣ Ω _{ℓ} ∣} \int_{Ω_{ℓ}} d x f, ℓ = 1, \dots, k

p_{ℓ} = \frac{1}{∣ Ω _{ℓ} ∣} \int_{Ω_{ℓ}} d x p, and f_{ℓ} = \frac{1}{∣ Ω _{ℓ} ∣} \int_{Ω_{ℓ}} d x f, ℓ = 1, \dots, k

\sum_{j=1}^{n}\Bigl{(}\sum_{\Omega_{\ell}\in\mathcal{T}_{\Omega}}\int_{\Omega_{\ell}}\mathrm{d}x\,\nabla\phi_{i}\cdot p_{\ell}\nabla\phi_{j}\Bigr{)}u_{j}=\sum_{\Omega_{\ell}\in\mathcal{T}_{\Omega}}\int_{\Omega_{\ell}}\mathrm{d}x\,f_{\ell}\phi_{i},\quad i=1,\ldots,n.

\sum_{j=1}^{n}\Bigl{(}\sum_{\Omega_{\ell}\in\mathcal{T}_{\Omega}}\int_{\Omega_{\ell}}\mathrm{d}x\,\nabla\phi_{i}\cdot p_{\ell}\nabla\phi_{j}\Bigr{)}u_{j}=\sum_{\Omega_{\ell}\in\mathcal{T}_{\Omega}}\int_{\Omega_{\ell}}\mathrm{d}x\,f_{\ell}\phi_{i},\quad i=1,\ldots,n.

A u = b,

A u = b,

Y = Z D

Y = Z D

A = ℓ = 1 \sum k Y_{ℓ}^{T} Y_{ℓ} = Y^{T} Y,

A = ℓ = 1 \sum k Y_{ℓ}^{T} Y_{ℓ} = Y^{T} Y,

u_{opt} = u_{LS} = ar g u \in R^{n} min ∥ Y u - (Y^{T})^{†} b ∥^{2},

u_{opt} = u_{LS} = ar g u \in R^{n} min ∥ Y u - (Y^{T})^{†} b ∥^{2},

u_{LS} = (Y^{T} Y)^{- 1} Y^{T} (Y^{T})^{†} b = A^{- 1} Y^{T} (Y^{T})^{†} b = A^{- 1} b = u_{opt} .

u_{LS} = (Y^{T} Y)^{- 1} Y^{T} (Y^{T})^{†} b = A^{- 1} Y^{T} (Y^{T})^{†} b = A^{- 1} b = u_{opt} .

\overset{u}{^}_{LS} = ar g u \in R^{n} min ∥ \hat{Y} u - (\hat{Y}^{T})^{†} b ∥^{2},

\overset{u}{^}_{LS} = ar g u \in R^{n} min ∥ \hat{Y} u - (\hat{Y}^{T})^{†} b ∥^{2},

\hat{Y}^{T} \hat{Y} u = b,

\hat{Y}^{T} \hat{Y} u = b,

Y^{T} Y u = b + (Y^{T} Y (\hat{Y}^{T} \hat{Y})^{- 1} - I) b = \hat{b} .

Y^{T} Y u = b + (Y^{T} Y (\hat{Y}^{T} \hat{Y})^{- 1} - I) b = \hat{b} .

\frac{∥ b ^ - b ∥}{∥ b ∥} \leq ∥ Y^{T} Y (\hat{Y}^{T} \hat{Y})^{- 1} - I ∥

\frac{∥ b ^ - b ∥}{∥ b ∥} \leq ∥ Y^{T} Y (\hat{Y}^{T} \hat{Y})^{- 1} - I ∥

S_{ρ} ≐ {Ψ r ∣ r \in R^{ρ}},

S_{ρ} ≐ {Ψ r ∣ r \in R^{ρ}},

Ψ r_{opt} = Π u_{opt} .

Ψ r_{opt} = Π u_{opt} .

Π u_{opt} \approx u_{reg} = ar g u \in S_{ρ} min ∥ Y u - (Y^{T})^{†} b ∥^{2},

Π u_{opt} \approx u_{reg} = ar g u \in S_{ρ} min ∥ Y u - (Y^{T})^{†} b ∥^{2},

r^{\prime}=\arg\min_{r\in\mathbb{R}^{\rho}}\bigl{\|}A\Psi r-b\bigr{\|}^{2},

r^{\prime}=\arg\min_{r\in\mathbb{R}^{\rho}}\bigl{\|}A\Psi r-b\bigr{\|}^{2},

r^{'} = (Ψ^{T} A^{2} Ψ)^{- 1} Ψ^{T} A b = Ψ^{T} u + (Ψ^{T} A^{2} Ψ)^{- 1} Ψ^{T} A^{2} (I - Π) u,

r^{'} = (Ψ^{T} A^{2} Ψ)^{- 1} Ψ^{T} A b = Ψ^{T} u + (Ψ^{T} A^{2} Ψ)^{- 1} Ψ^{T} A^{2} (I - Π) u,

r_{reg} = ar g r \in R^{ρ} min ∥ Y Ψ r - (Y^{T})^{†} b ∥^{2} .

r_{reg} = ar g r \in R^{ρ} min ∥ Y Ψ r - (Y^{T})^{†} b ∥^{2} .

ar g u \in S_{ρ} min ∥ Y u - (Y^{T})^{†} b ∥^{2} = ar g u \in S_{ρ} min ∥ Y Π u - (Ψ^{T} Y^{T})^{†} Ψ^{T} b ∥^{2} .

ar g u \in S_{ρ} min ∥ Y u - (Y^{T})^{†} b ∥^{2} = ar g u \in S_{ρ} min ∥ Y Π u - (Ψ^{T} Y^{T})^{†} Ψ^{T} b ∥^{2} .

Ψ^{T} Y^{T} Y Ψ r = Ψ^{T} Y^{T} (Y^{T})^{†} b ⟺ r_{reg} = (Ψ^{T} Y^{T} Y Ψ)^{- 1} Ψ^{T} b .

Ψ^{T} Y^{T} Y Ψ r = Ψ^{T} Y^{T} (Y^{T})^{†} b ⟺ r_{reg} = (Ψ^{T} Y^{T} Y Ψ)^{- 1} Ψ^{T} b .

ar g r \in R^{ρ} min ∥ Y ΠΨ r - (Ψ^{T} Y^{T})^{†} Ψ^{T} b ∥^{2},

ar g r \in R^{ρ} min ∥ Y ΠΨ r - (Ψ^{T} Y^{T})^{†} Ψ^{T} b ∥^{2},

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Full text

A sketched finite element method for elliptic models

Robert Lung

School of Engineering

University of Edinburgh

UK

[email protected]

,

Yue Wu

Mathematical Institute

University of Oxford

Oxford

UK

[email protected]

,

Dimitris Kamilis

School of Engineering

University of Edinburgh

EH9 3JL Edinburgh

UK

[email protected]

and

Nick Polydorides

School of Engineering

University of Edinburgh

EH9 3JL Edinburgh

UK & The Alan Turing Institute

London, UK

[email protected]

Abstract.

We consider a sketched implementation of the finite element method for elliptic partial differential equations on high-dimensional models. Motivated by applications in real-time simulation and prediction we propose an algorithm that involves projecting the finite element solution onto a low-dimensional subspace and sketching the reduced equations using randomised sampling. We show that a sampling distribution based on the leverage scores of a tall matrix associated with the discrete Laplacian operator, can achieve nearly optimal performance and a significant speedup. We derive an expression of the complexity of the algorithm in terms of the number of samples that are necessary to meet an error tolerance specification with high probability, and an upper bound for the distance between the sketched and the high-dimensional solutions. Our analysis shows that the projection not only reduces the dimension of the problem but also regularises the reduced system against sketching error. Our numerical simulations suggest speed improvements of two orders of magnitude in exchange for a small loss in the accuracy of the prediction.

Key words and phrases:

fRandomised linear algebra, Galerkin finite element method, statistical leverage scores, real-time simulation.

2019 Mathematics Subject Classification:

65F05, 65M60, 68W20

1. Introduction

Motivated by applications in digital manufacturing twins and real-time simulation in robotics, we consider the implementation of the Finite Element Method (FEM) in high-dimensional discrete models associated with elliptic partial differential equations (PDE). In particular, we focus on the many-query context, where a stream of approximate solutions are sought for various PDE parameter fields [8], aiming to expedite computations in situations where speedy model prediction is critical. Realising real-time simulation with high-dimensional models is instrumental to enable digital economy functions and has been driving developments in model reduction over the last decade [12]. Reducing the computational complexity of models is also central to the practical performance of statistical inference and uncertainty quantification algorithms, where a multitude of model evaluations are necessary to achieve convergence [16]. When real-time prediction is coupled with noisy sensor data, as in the digital twins paradigm, a fast, somewhat inaccurate model prediction typically suffices [4].

Our approach is thus tailored to applications where some of the accuracy of the solution can be traded off with speed. In these circumstances the framework of randomised linear algebra presents a competitive alternative [23]. In the seminal work [6], Drineas and Mahoney propose an algorithm for computing the solution of the Laplacian of a graph, making the case for sampling the rows of the matrices involved based on their statistical leverage scores. Despite aimed explicitly for symmetric diagonally dominant systems arising, their approach provides inspiration for the numerical solution of PDEs on unstructured meshes. Apart from the algebraic resemblance to the Galerkin FEM systems, the authors introduced sampling based on leverage scores of matrices through the concept of ‘effective resistance’ of a graph derived by mimicking Ohmic relations in resistor networks. As it turns out the complexity of computing the leverage scores is similar to that of solving the high-dimensional problem deterministically, however efficient methods to approximate them have since been suggested [7]. More recently, Avron and Toledo have proposed an extension of [6] for preconditioning the FEM equations by introducing the ‘effective stiffness’ of an element in a finite element mesh [1]. Specifically, for sparse symmetric positive definite (SSPD) stiffness matrices, they derive an expression for the effective stiffness of an element and show its equivalence to the statistical leverage scores. Sampling $O(n\log n)$ elements leads to a sparser preconditioner.

In situations where a single, high-dimensional linear system is sought, randomised algorithms suited to SSPD systems are readily available. The methods of Gower and Richtarik for example randomises the row-action iterative methods by taking a sequence of random projections onto convex sets [9]. This algorithm is equivalent to a stochastic gradient descent method with provable convergence, while their alternative approach in [10] iteratively sketches the inverse of the matrix. In [2], Bertsekas and Yu present a Monte Carlo method for simulating approximate solutions to linear fixed-point equations, arising in evaluating the cost of stationary policies in Markovian decisions. Their algorithm is based on approximate dynamic programming and has subsequently led to [20], that extends some of the proposed importance sampling ideas in the context of linear ill-posed inverse problems.

Real-time FEM computing at the many query paradigm, is hindered by two fundamental challenges: the fast assembly of the stiffness matrix for each parameter field, and the efficient solution of the resulting system to the required accuracy. To mitigate these, is to compromise slightly on the accuracy in order to capitalise on speed. To achieve this we first transform the linear SSPD system into an overdetermined least squares problem, and then project its solution this onto a low-dimensional subspace. This mounts to inverting a low-dimensional, dense matrix whose entries are perturbed by random errors. Our emphasis and contributions are in developing the projected sketching algorithm, and in optimising the sampling process so that it is both efficient in the multi-query context and effective in suppressing the variance of the solution. We also analyse the complexity of our algorithm and derive, probabilistic error bounds for quality of the approximation.

Our paper is organised as follows: In section 2 we provide a concise introduction to the Galerkin formulation for elliptic boundary value problems, and subsequently derive the projected least squares formulation of the problem. We then describe the sampling distribution used in the sketching and provide the conditions under which the reduced sketched system has a unique solution. Section 4 contains a description of our algorithm, and our main result that describes the complexity of our algorithm in achieving an error tolerance in high probability. We then provide an error analysis addressing the various types of errors imparted on the solution through the various stages of the methodology, before concluding with some numerical experiments based on the steady-state diffusion equation.

1.1. Notation

Let $[m]$ denote the set of integers between 1 and $m$ inclusive. For a matrix $X\in\mathbb{R}^{m\times n}$ , $X_{(\ell)}$ and $X^{(\ell)}$ denote its $\ell$ -th row and column respectively, and $X_{ij}$ its $(i,j)$ -th entry. $X^{\dagger}$ is the pseudo-inverse of $X$ and $\kappa(X)$ its condition number. If $m\geq n$ we define the singular value decomposition $X=U_{X}\Sigma_{X}V_{X}^{T}$ where $U_{X}\in\mathbb{R}^{m\times n}$ , $\Sigma_{X}\in\mathbb{R}^{n\times n}$ and $V_{X}\in\mathbb{R}^{n\times n}$ . Unless stated otherwise, singular values and eigenvalues are ordered in non-increasing order. Analogously, for a symmetric and positive definite matrix $A\in\mathbb{R}^{m\times m}$ , $\lambda_{\max(A)}$ is the largest eigenvalue, and $\lambda_{\min(A)}$ the smallest. By $\mathrm{nnz}(A)$ we denote the number of non-zero elements in $A$ . Further we write $\|\cdot\|$ for the Euclidean norm for a vector or the spectral norm of a matrix and $\|\cdot\|_{F}$ the Frobenius norm of a matrix. For matrices $X$ and $Y$ with the same number of rows $(X|Y)$ is the augmented matrix formed by column concatenation. The identity matrix is expressed as $I$ or $I_{n}$ to specify its dimension $n$ when important to the context. We write $y\otimes 1_{n}$ for the Kronecker product of vector $y$ with the ones vector in $n$ dimensions.

2. Galerkin finite element method preliminaries

Consider the elliptic partial differential equation

[TABLE]

on a bounded, simply connected domain $\Omega\subset\mathbb{R}^{d}$ , $d\in\{2,3\}$ with Dirichlet conditions

[TABLE]

on a Lipschitz smooth boundary $\partial\Omega$ . Further let $p$ a bounded positive parameter function in the Banach space $L^{\infty}(\Omega)$ such that

[TABLE]

for some finite constants $p_{\min}$ and $p_{\max}$ . Multiplying (1) by an appropriate test function $v$ , then integrating over the domain and invoking the divergence theorem yields

[TABLE]

where $\mathrm{d}x$ denotes the $d$ -dimensional integration element. Using the standard definition of the Sobolev space on this domain as

[TABLE]

where $L^{2}(\Omega)$ is the space of square-integrable functions on $\Omega$ we define the solution and test function spaces respectively by

[TABLE]

Let $f\in L^{2}(\Omega)$ and $g^{(D)}\in\mathcal{H}^{1/2}(\partial\Omega)$ , where the Sobolev space $\mathcal{H}^{1/2}$ is to be understood in terms of a surjective trace operator from $\mathcal{H}_{U}^{1}(\Omega)$ to $\mathcal{H}^{1/2}(\partial\Omega)$ . Then the weak form of the boundary value problem (1)-(2) is to find a function $u\in\mathcal{H}^{1}_{U}$ such that

[TABLE]

The existence and uniqueness of the weak solution $u$ is guaranteed by the Lax-Milgram theorem [8].

To derive the Galerkin finite element approximation method from the weak form (7), we consider $\mathcal{T}_{\Omega}\doteq\{\Omega_{1},\ldots,\Omega_{k}\}$ a mesh comprising $k$ elements, having $n$ interior and $n_{\partial}$ boundary vertices (nodes). Further let $\mathcal{S}^{1}_{\Omega}\subset\mathcal{H}^{1}_{0}$ the conforming finite dimensional space associated with the chosen finite element basis defined on $\mathcal{T}_{\Omega}$ . Choosing

[TABLE]

to comprise linear interpolation shape functions with local support over the elements in $\mathcal{T}_{\Omega}$ then we can express the FEM approximation of $u$ in this basis for a set of coefficients $u_{1},\ldots,u_{n+n_{\partial}}$ as

[TABLE]

We have made slight abuse of notation by using $u$ for the function in $\mathcal{H}^{1}_{U}$ as well as its FEM approximation in $\mathcal{S}^{1}_{\Omega}$ . In effect, the finite element formulation of the boundary value problem is to find $u\in\mathcal{S}^{1}_{\Omega}$ such that

[TABLE]

We further define the element-average coefficients

[TABLE]

and applying the Dirichlet boundary conditions on the boundary nodes $n_{\partial}$ we arrive at the Galerkin system of equations for the vector $\{u_{1},\ldots,u_{n}\}$

[TABLE]

The equations in (11) are expressed in a matrix form as

[TABLE]

where $A\in\mathbb{R}^{n\times n}$ is the symmetric, sparse and positive-definite stiffness matrix, whose dependence on the parameters $p$ is implicit and suppressed for clarity. The FEM construction guarantees the consistency of the system (12), thus $b\in\mathbb{R}^{n}$ is always in the column space of $A$ and consequently it admits a unique solution $u_{\mathrm{opt}}=A^{-1}b$ . As we focus to the efficient approximation of $u_{\mathrm{opt}}$ in the many query context we content with two challenges: the efficient assembly of the stiffness matrix, and the speedy solution of the resulted FEM system.

2.1. The stiffness matrix

Let $\mathcal{I}_{\ell}$ is the index set of the $d+1$ vertices of the $\ell$ th element, and consider $D_{\ell}\in\mathbb{R}^{d\times n}$ to be the sparse matrix holding the gradients of the linear shape functions $\phi_{i}$ where $i\in\mathcal{I}_{\ell}$ . In this $D_{\ell}^{(i)}$ is then a constant gradients vector associated with the $i$ th node of $\Omega_{\ell}$ , and let $z_{\ell}=|\Omega_{\ell}|p_{\ell}$ the element of a vector $z\in\mathbb{R}^{k}$ such that $Z^{2}=\mathrm{diag}(z\otimes 1_{d})$ and $D\in\mathbb{R}^{kd\times n}$ a row concatenation of $D_{\ell}$ matrices for all elements. If we define as $Y_{\ell}=\sqrt{z_{\ell}}D_{\ell}$ and $Y\in\mathbb{R}^{kd\times n}$ the concatenation of the $Y_{\ell}$ matrices as

[TABLE]

then the stiffness matrix takes the form of a high-dimensional sum or product of sparse matrices

[TABLE]

which for large $k$ require efficient assembly using reference elements and geometry mappings [15]. The above construction typically leads to a stiffness matrix that is well-conditioned for inversion with the exception of acute element skewness [14] and parameter vectors with wild variation [22], which cause the the condition number $\kappa(A)$ to increase dramatically. Explicit bounds on the largest and smallest eigenvalues of $A$ , and respectively the singular values of $Y$ , are given in [13].

3. A regularised sketched formulation

The sought solution $u_{\mathrm{opt}}=A^{-1}b$ can be alternatively obtained by solving the over-determined least squares problem

[TABLE]

since

[TABLE]

The fact that the above problem is over-determined implies, at least to some extent, robustness against noise, such as random perturbations on the elements of the matrix $Y$ and vector $b$ . A similar error is induced by randomised sketching where we replace (15) with

[TABLE]

and look for a random approximation $\hat{Y}$ of $Y$ in the sense that ${\hat{u}}_{\mathrm{LS}}\approx u_{\mathrm{LS}}$ . We note that $\hat{Y}$ and $Y$ don’t have to be similar as such, e.g. have the same dimensions, as long as the problems are well defined and the optimisers remain similar. Following [6] and [19] we seek to approximate $Y$ with some sketch $\hat{Y}$ by sampling and scaling rows according to probabilities that will be specified later. The number of rows in $\hat{Y}$ in that case equals the number of drawn samples. Clearly $\hat{Y}$ must have at least $n$ rows as otherwise the problem (16) will be under-determined and, due to the non-uniqueness of the solution, the error could become arbitrarily large. On the other hand, if around $n\log(n)$ rows are sampled from a suitable distribution, then Drineas and Mahoney show that the resulting sketch is a good approximation with high probability. However, if substantially less than $n\log(n)$ samples are drawn then the sketching induced error outweighs its computational benefits. In order to understand how this issue can be addressed we note that, if $\hat{Y}$ has full column-rank and thus the optimiser of (16) is unique, the solution of the sketched problem can be obtained by solving the linear system

[TABLE]

which is equivalent to solving

[TABLE]

From (17) it becomes clear that the sketching induced error can be regarded as an error on the right-hand side of the linear system (12) or the least squares problem (15). We can easily obtain a bound for the relative error given by

[TABLE]

A standard way of dealing with noise as in (17) is regularisation [18]. Suppose that there exists a low-dimensional subspace

[TABLE]

spanned by a basis of $\rho\ll n$ orthonormal functions arranged in the columns of matrix $\Psi$ , and assume that is sufficient to approximate $u_{\mathrm{opt}}$ within some acceptable level of accuracy, in the sense of incurring a small subspace error $\|(I-\Pi)u_{\mathrm{opt}}\|$ . The orthogonal projection operator $\Pi\dot{=}\Psi\Psi^{T}$ maps vectors from $\mathbb{R}^{n}$ onto the subspace $\mathcal{S}_{\rho}$ . Of course, such a subspace can’t accommodate all but rather only sufficiently regular $u\in\mathbb{R}^{n}$ . For that reason $\mathcal{S}_{\rho}$ has to be constructed using prior information (e.g. degree of smoothness) about the solution. Orthogonality of $\Psi$ ensures for any $u_{\mathrm{opt}}=\Pi u_{\mathrm{opt}}+(I-\Pi)u_{\mathrm{opt}}$ the existence of a unique, optimal low-dimensional vector $r_{\mathrm{opt}}$ satisfying

[TABLE]

In these conditions we can pose a projected-regularised least-squares problem replacing (15) by

[TABLE]

in order to improve the robustness of the solution against sketching-induced errors. The problem in (20) still involves high-dimensional quantities such as $Y$ and $b$ , but the solution is unique as soon as $\mathcal{S}_{\rho}$ and the null-space of $Y$ have $\{0\}$ intersection. We start by introducing the low dimensional problem***We emphasise the contrast between the projected equations in (21) and the projected variable least squares problem

$r^{\prime}=\arg\min_{r\in\mathbb{R}^{\rho}}\bigl{\|}A\Psi r-b\bigr{\|}^{2},$

whose solution is

$\displaystyle r^{\prime}=(\Psi^{T}A^{2}\Psi)^{-1}\Psi^{T}Ab=\Psi^{T}u+(\Psi^{T}A^{2}\Psi)^{-1}\Psi^{T}A^{2}(I-\Pi)u,$

and incurs a subspace regression error term that is quadratic in $A$ . Moreover, note that the right hand side vector in the normal equations $\Psi^{T}A^{T}A\Psi r^{\prime}=\Psi^{T}A^{T}b$ has dependence on the parameter through $A$ .

[TABLE]

A solution $r_{\mathrm{reg}}$ of (21) yields a solution $u_{\mathrm{reg}}=\Psi r_{\mathrm{reg}}$ of (20) because the columns of $\Psi$ form an ONB of $\mathcal{S}_{\rho}$ . In addition, we have the following.

Lemma 3.1.

If $Y$ has full column rank and the columns of $\Psi$ form an ONB of $\mathcal{S}_{\rho}$ so that $\Pi=\Psi\Psi^{T}$ is the projection onto $\mathcal{S}_{\rho}$ , then

[TABLE]

In particular, both problems have a unique solution.

Proof.

Both problems have unique solutions because $\mathcal{S}_{\rho}$ is convex and $Y$ has (by assumption) full column rank. Therefore it suffices to show that there exists an element $u_{\mathrm{reg}}\in\mathcal{S}_{\rho}$ that solves both problems. The solution $r_{\mathrm{reg}}$ of (21) can be found explicitly by solving the linear system

[TABLE]

We have used that $Y$ has full column rank so that $Y^{T}(Y^{T})^{\dagger}=I$ and $\Psi^{T}Y^{T}Y\Psi$ is invertible. Similarly we may consider

[TABLE]

which produces solutions $r_{\Psi}$ such that $\Psi r_{\Psi}$ is a solution of the right-hand side of (22). Since $\Pi\Psi=\Psi$ and $Y\Psi$ has full column rank we can write $r_{\Psi}$ as

[TABLE]

We conclude that $\Psi(\Psi^{T}Y^{T}Y\Psi)^{-1}\Psi^{T}b$ is a solution to both sides of (22) which completes the proof. ∎

The right hand side of (22) has a very natural interpretation and is obtained by embedding the rows of $Y$ , the vector $b$ and the variable $u$ in $\mathcal{S}_{\rho}$ using its low dimensional representation from the basis induced by the columns of $\Psi$ . In view of Lemma 3.1 we may regularise the problem from (16) and obtain an embedded sketched counterpart to (20) as

[TABLE]

We argue that (23) is much more robust to the noise imparted by the approximation $\hat{Y}$ and produces solutions with controlled errors even if substantially less than $n$ suitably drawn samples are used for the approximation. In order to see why, notice that the problem (23) can be expressed in terms of the low-dimensional vector of coefficients

[TABLE]

so that $\Psi\hat{r}_{\mathrm{reg}}={\hat{u}}_{\mathrm{reg}}$ . Recalling that $A=Y^{T}Y$ , it is convenient to introduce

[TABLE]

together with their sketched approximations

[TABLE]

Lemma 3.2.

If $\hat{X}=\hat{Y}\Psi$ has full column rank then the solution of the least-squares problem (24) is given by $\hat{r}_{\mathrm{reg}}=\hat{G}^{-1}\Psi^{T}b$ and we have

[TABLE]

where $u_{\mathrm{reg}}$ and ${\hat{u}}_{\mathrm{reg}}$ are the solutions of (20) and (23) respectively.

Proof.

If $\hat{Y}\Psi$ has linearly independent columns then $\Psi^{T}\hat{Y}^{T}(\Psi^{T}\hat{Y}^{T})^{\dagger}=I$ and the solution $\hat{r}_{\mathrm{reg}}$ of (24) solves

[TABLE]

Again $\hat{G}$ is invertible because $\hat{Y}\Psi$ has linearly independent columns and the first claim follows. The matrix $A$ is positive definite which implies that $G$ is positive definite and $u_{\mathrm{reg}}=\Psi G^{-1}\Psi^{T}b$ . The matrix $\Psi$ has orthonormal columns which implies $\Psi^{T}b=G\Psi^{T}u_{\mathrm{reg}}$ . Since ${\hat{u}}_{\mathrm{reg}}=\Psi\hat{r}_{\mathrm{reg}}$ we can use the formula we have just shown and obtain

[TABLE]

where the last identity is due to $u_{\mathrm{reg}}\in\mathcal{S}_{\rho}$ . ∎

In order to understand the effect of row sampling and why it can be a good approximation, we start by writing

[TABLE]

as a sum of outer products of rows. Introduce for some sample size $c\in\mathbb{N}$ the iid random indices $\mathbf{i}_{1},\dots,\mathbf{i}_{c}$ taking values in $[kd]$ with distribution

[TABLE]

for each $j\in[c]$ and $i\in[kd]$ . Instead of (28) we may consider the sketch

[TABLE]

If we define the random matrix $R\in\mathbb{R}^{kd\times c}$ and the random diagonal matrix $W\in\mathbb{R}^{c\times c}$ via

[TABLE]

then can put $S=RW$ and construct the sketch $\hat{G}$ as

[TABLE]

Lastly, we can write $\hat{Y}=S^{T}Y$ as well as $\hat{X}=\hat{Y}\Psi=S^{T}Y\Psi$ for the sketches of $Y$ and $X$ . A simple computation together with an application of the strong law of large numbers shows the following.

Proposition 3.3 (Lemma 3 and 4 in [DrineasMahoneyKannan]).

Assume that the sampling probabilities satisfy the consistency condition

[TABLE]

In this case we have for the matrix $\hat{G}$ as defined in (30) that $\mathbb{E}[\hat{G}]=G$ and $\mathbb{E}[\|\hat{G}-G\|_{F}^{2}]=\mathcal{O}\left(c^{-1}\right)$ . As a consequence, $\hat{G}\to G$ almost surely for $c\to\infty$ .

Proposition 3.3 summarises the asymptotic properties of the used sketch. The condition (33) is very mild and holds for a wide range of distributions such as sampling from scaled row norms or uniform sampling. The convergence rate of $c^{-1}$ cannot be improved although the constant depends on the chosen probabilities $q_{j}$ . In other words, as long as we sample all non-zero rows with positive probability we will obtain a sketch that has good asymptotic properties when considered as an approximation for $G$ . However, in order to find good sampling probabilities $q_{j}$ we have to consider the non-asymptotic behaviour of the sketch. In fact, the main purpose of the regularisation/dimensionality reduction was to avoid situations where sampling a large number of rows is necessary. If $\rho\ll n$ , then the regularised problem (21) has substantially fewer degrees of freedom than the high dimensional formulation in (15). Consequently, the dependence of $G$ on the rows of $X$ is a lot smoother than the dependence of $A$ on $Y_{(j)}$ . In other words, approximating $X$ by row sampling has a much smaller effect on the regularised solution $u_{\mathrm{reg}}$ than an approximation of $Y$ with the same sample size $c$ would have on the solution $u$ of the full system (12). For example, a much smaller number of rows needs to be sampled to obtain the correct null-space which results in a full-rank approximation of $G$ . Note that, conditional on $\hat{G}$ being invertible, $u_{\mathrm{reg}}\in\mathcal{S}_{\rho}$ in combination with Lemma 3.2 implies

[TABLE]

so the randomisation error of the regularised problem is entirely controlled by low dimensional structures. This property is the key to a small sketching error and thus to an overall accurate approximation when only few samples are drawn. Using the notation from before and letting $X=U_{X}\Sigma_{X}V_{X}^{T}$ be the singular value decomposition of $X$ , we can write the bound from (34) as

[TABLE]

From the above formulation it becomes apparent that the error will be small if the sketch is constructed such that $(U_{X}SS^{T}U_{X})^{-1}\approx I$ in spectral norm. We argue that this is essentially equivalent to $U_{X}SS^{T}U_{X}\approx I$ . Indeed, we have the following.

Lemma 3.4.

If $\|U^{T}_{X}SS^{T}U_{X}-I\|<\varepsilon<1$ then

[TABLE]

Proof.

Under the condition of the lemma we know that $U_{X}SS^{T}U_{X}$ is invertible and that

[TABLE]

which implies the upper bound by considering the estimate

[TABLE]

Denote by $\lambda_{i}(U_{X}SS^{T}U_{X})$ the $i$ -th eigenvalue of $U_{X}SS^{T}U_{X}$ . Then we may write

[TABLE]

where $\lambda_{\min}(U_{X}SS^{T}U_{X})$ is the smallest eigenvalue. By assumption of the lemma

[TABLE]

which implies the claim after dividing by $\|1-U^{T}_{X}SS^{T}U_{X}\|$ and taking the inverse. ∎

An approximation of $U^{T}_{X}SS^{T}U_{X}$ can be obtained by sampling with probabilities that are proportional to the statistical leverage scores

[TABLE]

i.e. the row norms of the left singular vectors of $X$ [7]. At first sight it seems that taking sampling probabilities proportional to the leverage scores in (35) in order to obtain a sketch of (21) is very similar to using the leverage scores of $Y$ to obtain (16) from (15) as was proposed by Drineas and Mahoney in [6] for a similar problem. A key difference is that $X$ is tall and dense while $Y$ is sparse and thus $G$ is quite different to the initial stiffness matrix $A$ . Consequently, an interpretation of the leverage scores from (35) in terms of effective stiffness [1] is, to the best of our knowledge, not possible. The following Lemma will be useful for our further developments.

Lemma 3.5 ([21] section 6.4).

Assume that $S$ is constructed as before with sampling probabilities $q_{i}$ satisfying

[TABLE]

for some $\beta\in(0,1]$ . Then we have $\forall\varepsilon>0$

[TABLE]

An important corollary of the above lemma is that a sketch which is constructed by sampling from leverage score probabilities will virtually always be invertible and therefore the sketched problem (24) has a unique solution. The following result states that this property is preserved even when the rows are re-weighted, an operation which changes the leverage scores.

Proposition 3.6.

Let $\Gamma\in\mathbb{R}^{kd\times kd}$ be a diagonal matrix with positive entries, i.e. $\Gamma_{ii}>0$ for each $i=1,\dots,kd$ . Assume that the sketching matrix $S$ is constructed with sampling probabilities $q_{i}=\rho^{-1}\ell_{i}(X)$ . For the scaled sketch $\hat{H}=X^{T}\Gamma SS^{T}\Gamma X$ we have

[TABLE]

Proof.

It is sufficient to show that

[TABLE]

because the probability bound follows immediately from

[TABLE]

after applying (37) from Lemma 3.5. The above matrices are always positive semi-definite and therefore invertibility is equivalent to positive definiteness. For any diagonal matrix $\Gamma$ it holds that $S^{T}\Gamma=\hat{\Gamma}S^{T}$ where $\hat{\Gamma}$ is a random diagonal matrix with entries $\hat{\Gamma}_{jj}=\Gamma_{\mathbf{i}_{j}\mathbf{i}_{j}}$ . Thus for any $x\in\mathbb{R}^{\rho}$ we have

[TABLE]

Since $X$ has full column rank we know that $\Sigma_{X}V_{X}^{T}$ corresponds to a change of basis and $\Sigma_{X}V_{X}^{T}x\neq 0$ whenever $x\neq 0$ . It follows that $\hat{H}$ is positive definite if and only if $U^{T}_{X}S\hat{\Gamma}^{2}S^{T}U_{X}$ is positive definite. As $\hat{\Gamma}$ is a diagonal such that $\hat{\Gamma}_{jj}>0$ with probability $1$ , the latter is equivalent to $U^{T}_{X}SS^{T}U_{X}$ being positive definite. The case of $\hat{G}$ is covered by $\Gamma=I$ . ∎

Proposition 3.6 states that re-scaling of rows doesn’t affect the quality of the sketching matrix regarding its invertibility and after sampling $\rho\log(\rho)$ rows the probability of the sketch being singular decays exponentially fast with each additional draw. In practice this makes knowledge of $\ell_{i}(X)$ valuable because we only need to sample $\rho\log(\rho)+M$ rows for some moderately large $M$ and obtain a sketch that is virtually never singular. On the other hand, we need at least $\rho$ samples so that there is any hope in obtaining a non-singular matrix. The remarkable thing about Proposition 3.6 is that the failure probability is independent of both, the inner dimension $kd$ of the product $X^{T}X$ as well as the scaling matrix $\Gamma$ and equivalent to the bound which could be obtained by sampling from $\ell_{i}(\Gamma X)$ . This suggests that a sketch which is constructed by drawing samples from $\ell_{i}(X)$ is not too different compared to sampling from $\ell_{i}(\Gamma X)$ . This intuition is supported by the following result which describes the change in the leverage scores after re-weighting a single row.

Proposition 3.7 ([5] Lemma 5).

Let $\Gamma^{\langle i\rangle}\in\mathbb{R}^{kd\times kd}$ be a diagonal matrix with $\Gamma^{\langle i\rangle}_{ii}=\sqrt{\gamma}\in(0,1)$ and $\Gamma^{\langle i\rangle}_{jj}=1$ for each $j\neq i$ . Then

[TABLE]

and for $i\neq j$

[TABLE]

where $\ell_{ij}(X)=(U_{X}U_{X}^{T})_{ij}$ are the cross leverage scores.

Since $U_{X}$ has orthogonal columns, we have $\|v\|=\|U_{X}v\|$ for any $v\in\mathbb{R}^{\rho}$ and thus the cross leverage scores from the above Lemma satisfy

[TABLE]

For a general diagonal matrix $\Gamma$ as in Proposition 3.6 we may without loss of generality assume that each entry lies in $(0,1]$ since we can divide the elements by their maximum. The re-weighting can thus be considered as a superposition of single row operations

[TABLE]

where the $\Gamma^{\langle i\rangle}$ are as in Proposition 3.7. Since the $\Gamma^{\langle i\rangle}$ commute we can apply them in any order without changing the outcome. Considering Lemma 3.5, if we could ensure that $\ell_{i}(X)$ isn’t substantially smaller than $\ell_{i}(\Gamma X)$ then sampling from $q_{i}=\rho^{-1}\ell_{i}(X)$ will produce good sketches for $\Gamma X$ .

Large leverage scores $\ell_{i}(X)\approx 1$

Equation (39) shows that the relative change of the $i$ -th leverage score after a re-weighting of the $i$ -th row shrinks when $\ell_{i}(X)\to 1$ . In the extreme case when $\ell_{i}(X)=1$ the re-weighting has no effect. In addition to this stability property it trivially holds that $\ell_{i}(X)\leq 1$ which suggests that large leverage scores are fairly stable when rows are re-weighted.

Small leverage scores $\ell_{i}(X)\ll 1$

From Equation (40) we know that the increase of $\ell_{j}(X)$ after re-weighting of row $i$ is proportional to $\ell_{ij}(X)$ . If the entries of the scaling matrix $\Gamma$ don’t vary too much, then (41) suggests that we can expect the total increase, i.e. after applying $\Gamma^{\langle j\rangle}$ for each $j\neq i$ to be roughly of order $\ell_{i}(X)-\ell_{i}^{2}(X)\approx\ell_{i}(X)$ . On the other hand, small $\ell_{i}(X)$ are fairly sensitive to re-weighting of row $i$ since $\ell_{i}(\Gamma^{\langle i\rangle}X)\approx(\Gamma^{\langle i\rangle}_{ii})^{2}\ell_{i}(X)$ in that case. Thus we can expect that the re-weighting of row $i$ will counterbalance the effects from re-weighting the other rows. In addition, we know that

[TABLE]

Since large leverage scores will likely be quite stable and $\ell_{i}(\Gamma X)\geq 0$ we would expect that not too many small leverage scores will become large.

So far we have discussed the projection of the high-dimensional system without providing explicit details on how the basis $\Psi$ is selected. A desired property is to sustain a small projection error for all admissible parameter choices under the constraint $\rho\ll n$ . Suitable options include subsets of the right singular vectors of $A$ or orthogonalised Krylov-subspace bases [11], however these have to be computed for each individual parameter vector which can be detrimental to the speed of the solver. Alternatively, we opt for a generic basis exploiting the smoothness of $u$ on domains with smooth Lipschitz boundaries. A simple choice is to select the basis among the eigenvectors of the discrete Laplacian operator

[TABLE]

for $Z^{2}_{\Delta}=\mathrm{diag}\bigl{(}[|\Omega_{1}|,\ldots,|\Omega_{k}|]\otimes 1_{d}\bigr{)}$ . From $U_{\Delta}^{T}\Delta U_{\Delta}=\Sigma_{\Delta}$ and splitting the eigenvectors as

[TABLE]

such that the columns of $\Psi$ correspond to the last $\rho$ columns of $U_{\Delta}$ , and respectively to the $\rho$ smallest eigenvalues $\{\lambda_{n-\rho-1}(\Delta),\ldots,\lambda_{n}(\Delta)\}$ . In effect, with $\Delta$ constrained by the Dirichlet boundary conditions, the norm $\|\Delta\Psi^{(i)}\|$ provides a measure of the smoothness of $\Psi^{(i)}$ in the interior of $\Omega$ . It is not difficult to see that this basis satisfies

[TABLE]

We remark that the computation of the basis is computationally very expensive for large $n$ , as the eigen-decomposition of $\Delta$ is necessary, however this is only computed once, prior to the beginning of the simulation (offline stage) in an offline stage. After the matrix $\Psi$ has been obtained we can compute the leverage scores $\ell_{i}(Z_{\Delta}D\Psi)$ . The Laplacian $\Delta$ differs from a general stiffness matrix $A$ only by different diagonal weights, i.e. $Z^{2}_{\Delta}$ is replaced by the diagonal matrix $Z^{2}=Z^{2}_{\Delta}\mathrm{diag}\bigl{[}(p_{1},\ldots,p_{k})\otimes 1_{d}\bigr{]}$ where the $p_{i}$ contain information about the parameter from (1). Propositions 3.6 and 3.7 along with the developments thereafter suggest that the Laplacian leverage scores $\ell_{i}(Z_{\Delta}D\Psi)$ can nonetheless be used to construct sketches $\hat{G}=X^{T}SS^{T}X$ of the projected matrix $G=X^{T}X=\Psi^{T}Y^{T}Y\Psi$ because the difference in the stiffness matrices is just a diagonal weighting.

4. Complexity and error analysis

Motivated by the developments from the previous sections we propose the following algorithm for computing solutions to a sequence of $N$ problem of the form (1). We assume that each problem is specified by its parameter vector $z^{(t)}\in\mathbb{R}^{kd}$ for $t=1,\dots,N$ (see section 2.1).

The complexity and approximation error of Algorithm 1 are obviously linked. The more samples we draw the better we expect our solutions to be. Although the size of the reduced system matrix $G$ (and therefore its sketched counterpart $\hat{G}$ as well) is independent of $c$ , the computational burden for building $\hat{G}$ is higher when drawing more samples. More precisely, we need:

•

$\mathcal{O}(c)$ operations in order to find $\mathbf{i}_{1},\dots\mathbf{i}_{c}\stackrel{{\scriptstyle\mathrm{iid}}}{{\sim}}q$ . This is possible because $q$ is fixed and we can perform the necessary pre-processing offline [3].

•

$\mathcal{O}(c)$ operations for computing the sampled indices $\{\mathbf{j}_{1},\dots,\mathbf{j}_{c^{\prime}}\}$ and their frequencies $m_{j}$ as this requires a single loop through the set $\{\mathbf{i}_{1},\dots,\mathbf{i}_{c}\}$ of initial samples.

•

$\mathcal{O}(c^{\prime})$ operation for assembling the diagonal matrices $M$ and $\hat{Z}$ .

•

$\mathcal{O}(c^{\prime}\rho)$ operations for computing $M\hat{Z}D_{(J)}\Psi$ . This can be achieved since computing $M\hat{Z}D_{(J)}$ requires $\mathrm{nnz}(D_{(J)})=\mathcal{O}(c^{\prime})$ multiplications and $\rho\cdot\mathrm{nnz}(M\hat{Z}D_{(J)})=\rho\cdot\mathrm{nnz}(D_{(J)})=\mathcal{O}(\rho c^{\prime})$ multiplications are enough for computing $[M\hat{Z}D_{(J)}]\Psi$ due to sparsity of $D$ .

•

$\mathcal{O}(c^{\prime}\rho^{2})$ operations in order to build $\hat{G}$ which corresponds to the cost of multiplication for dense matrices.

•

$\mathcal{O}(\rho^{3})$ operations for solving $\hat{G}r=\Psi^{T}b$ with a direct method.

The sketch $\hat{G}$ will be singular if we draw $c^{\prime}<\rho$ distinct samples which means that building the sketch $\hat{G}$ dominates the complexity of Algorithm 1. If the sampling probabilities are a good approximation in the sense that $\beta$ in Lemma 3.5 can be chosen close to $1$ , then we need $c=\mathcal{O}(\varepsilon^{-2}\rho\log(\rho))$ samples in order to have a provably controlled error. The worst case, i.e. the the largest increase of $\ell_{i}(X)$ , will be observed if $z^{(t)}_{j}\ll z^{(t)}_{i}$ for $j\neq i$ . A parameter $p$ corresponding to such a situation essentially renders the implementation of the classical Galerkin FEM problematic, as $\kappa(A)$ scales to $p_{\max}/p_{\min}$ , see Theorem 5.2 in [13] The following theorem summarises the findings of this section.

Theorem 4.1.

Let $\varepsilon\in(0,1)$ and $\beta\in(0,1]$ is such that the sampling probabilities $q_{i}$ from Algorithm 1 satisfy (36), i.e.

[TABLE]

where $Z^{2}=\mathrm{diag}(z^{(t)})$ . Let $G=X^{T}X=\Psi^{T}D^{T}Z^{2}D\Psi$ be the reduced system matrix corresponding to parameter $z^{(t)}$ and $\kappa(G)$ its condition number. For the choice $c=15\rho\log(15\rho)\beta^{-1}\varepsilon^{-2}$ Algorithm 1 requires $\mathcal{O}(\rho^{3}\log(\rho)\beta^{-1}\varepsilon^{-2})$ operations and outputs, with probability exceeding $0.999$ , a vector $\hat{r}^{(t)}$ that satisfies

[TABLE]

Proof.

As stated before, the complexity of Algorithm 1 is $\mathcal{O}(c\rho^{2})$ which immediately implies that it requires $\mathcal{O}(\rho^{3}\log(\rho)\beta^{-1}\varepsilon^{-2})$ operations for a single query. It remains to prove the error bound. In view of (34) and the developments thereafter it follows, conditional on $\hat{G}$ being invertible, that

[TABLE]

Since $\kappa^{2}(X)=\kappa(G)$ we only need to show that

[TABLE]

because $\hat{G}$ is necessarily invertible on that event which implies validity of the estimates from before. But plugging the value for $c$ into (37) we obtain for any $\rho\geq 1$

[TABLE]

∎

Algorithm 1 is most attractive when we can tolerate an error somewhere between 1% to 10% in which case we can obtain the solution to a single query in about $\mathcal{O}(\beta^{-1}\rho^{3}\log(\rho))$ time. In practice the value for $\beta$ is unobtainable since it requires knowledge of the true leverage scores but considering Lemma 3.7 and the arguments thereafter, we expect that for a moderately large $\beta^{-1}$ the required bound will hold for all but a few small leverage scores. The statement in Lemma 3.5 is rather pessimistic when there are few misaligned leverage scores since it requires a uniform bound. For practical purposes we expect that $\beta^{-1}$ can be substituted with a small constant and we take $\varepsilon=0.1$ which will ensure reglarity of the sketch. Up until now we have only considered the randomisation error of the sketched solution, i.e. we have analysed $\|\hat{u}_{\mathrm{reg}}-u_{\mathrm{reg}}\|$ . However, the the total error of $\hat{u}_{\mathrm{reg}}$ compared to the high dimensional solution $u$ of (12) has two components. If we decompose the process into two steps

[TABLE]

it becomes apparent that even with a perfect sketch, i.e. if we solved the noiseless projected problem (20) and (46) is negligible, we could still not achieve an error smaller than $\|u_{\mathrm{opt}}-\Pi u_{\mathrm{opt}}\|$ . The next result tells us that the error from (45) is close to the optimal one.

Theorem 4.2.

Let $u_{\mathrm{opt}}$ be the solution of (12) and $u_{\mathrm{reg}}$ be the optimum of (20). If $\kappa(A)$ is the condition number of the stiffness matrix $A$ and $\Pi=\Psi\Psi^{T}$ the projection ont $\mathcal{S}_{\rho}$ , then

[TABLE]

Proof.

Recall that $A=Y^{T}Y$ and $G=X^{T}X=\Psi^{T}Y^{T}Y\Psi$ . From the developments in Lemma 3.2 we know that $u_{\mathrm{reg}}=\Psi G^{-1}\Psi^{T}b$ . We may write as before $X=U_{X}\Sigma_{X}V_{X}^{T}$ so that $G^{-1}=V_{X}\Sigma_{X}^{-2}V_{X}^{T}$ and

[TABLE]

If we write $\lambda_{\min}(A)$ and $\lambda_{\max}(A)$ for the smallest and largest eigenvalues of $A$ , then it must hold that

[TABLE]

because $\Psi$ has orthogonal columns. Indeed, if $\mathbb{S}^{n-1}\doteq\{w\in\mathbb{R}^{n}:\|w\|=1\}$ is the $n$ -dimensional unit sphere, then

[TABLE]

is obviously true. Since the columns of $\Psi$ form an ONB of $\mathcal{S}_{\rho}$ we have

[TABLE]

Thus, $\|\Sigma^{-1}_{X}\|^{2}=\lambda^{-1}_{\min}(G)\leq\lambda^{-1}_{\min}(A)$ . Clearly we also have $\|Y\|^{2}=\lambda_{\max}(A)$ . Due to orthogonality we know that $\|\Psi\|=\|V_{X}\|=\|U_{X}\|=1$ . Combining those estimates we obtain

[TABLE]

which yields the desired bound. ∎

If the subspace $\mathcal{S}_{\rho}$ is such that the relative projection error is small, then the norm of $u_{\mathrm{reg}}$ will be similar to the norm of $u_{\mathrm{opt}}$ . More precisely,

[TABLE]

so that Theorem 4.1 applies to $\|u_{\mathrm{reg}}-\hat{u}_{\mathrm{reg}}\|/\|u_{\mathrm{opt}}\|$ with a small $\delta$ -dependent constant. By combining the previous two theorems we obtain the following.

Corollary 4.3.

Let $\varepsilon_{\mathrm{R}}\in(0,1)$ and assume that the assumptions of Theorem 4.1 are satisfied for $\varepsilon=\varepsilon_{\mathrm{R}}$ . If $u_{\mathrm{opt}}$ is the solution of (12) and the subspace $\mathcal{S}_{\rho}$ is such that

[TABLE]

for some $\varepsilon_{\mathrm{P}}\in(0,1)$ . Then the total error of the solutions $\hat{u}_{\mathrm{reg}}=\Psi\hat{r}$ produced by Algorithm 1 satisfy the bound

[TABLE]

Proof.

We can start with the estimate

[TABLE]

Using the estimate from Theorem 4.2 we get

[TABLE]

It remains to bound the other term. Since $\Psi$ has orthogonal columns we obtain from Theorem 4.1

[TABLE]

Since we have shown in the proof of Theorem 4.2 that

[TABLE]

we can estimate

[TABLE]

As before, we have used the fact that

[TABLE]

From $\|u_{\mathrm{opt}}-\Pi u_{\mathrm{opt}}\|\leq\varepsilon_{\mathrm{P}}\|u_{\mathrm{opt}}\|$ it follows that

[TABLE]

which completes the proof. ∎

If we assume that $\varepsilon_{\mathrm{P}}\sqrt{\kappa(G)}\approx 1$ , then the error estimate from Corollary 4.3 states, with small leading constants, that

[TABLE]

It therefore makes sense to have a sketching error $\varepsilon_{\mathrm{R}}$ that is of the same order as the projection error $\varepsilon_{\mathrm{P}}$ . In practice we found that projection errors of roughly 1% to 10% can be expected so that the sketching induced error isn’t very harmful if we choose the sample size as in Theorem 4.1 with $\varepsilon_{\mathrm{R}}=0.1$ .

5. Numerical results

To test the performance of Algorithm 1 we consider the finite element formulation of the elliptic equation (1) with homogeneous Dirichlet boundary conditions $u=0$ on $\partial\Omega$ and a forcing term derived from a piecewise constant approximation of the function

[TABLE]

We discretise the model on a spherical domain $\Omega$ $(d=3)$ of unit radius comprising $k=684560$ unstructured linear tetrahedral elements. This leads to a total $116805$ nodes of which $n=101509$ are situated in the interior of the domain. In these circumstances $X$ is a tall matrix with $2053680$ rows, the stiffness matrix $A$ has dimensions $101509\times 101509$ and the sample space is $[2053680]$ .

We seek to assess the practical performance of our algorithm in terms of its speed and accuracy in computing the sketched solution under various choices sampling budgets and low-dimensional subspaces, for the proposed sampling distribution. To achieve this we perform three benchmark tests involving realisations of (i) a uniformly distributed random parameter field, (ii) a smoothly varying lognormal random field, and (iii) a random field with jump discontinuities. For each of these we run a sequence of $N=100$ simulations, i.e. $p$ queries, and record timings and error measures on average. For each realisation we compute also the conventional FEM solution to provide a reference for comparison. The high-dimensional $u_{\mathrm{opt}}$ is computed using Matlab’s built-in A\b command [17], and the times provided include the efficient assembly of the full stiffness matrix as a triple product of sparse matrices $A=D^{T}Z^{2}D$ . Our code was implemented in Matlab R2018b and executed on a workstation equipped with two 14-core Intel Xeon dual processors, running Linux NixOS with 384GB RAM.

In the offline phase of Algorithm 1 we form a low-dimensional ONB for the projection by computing the last eigenfunctions of the sparse Laplacian matrix discretised on $\Omega$ . For this time consuming and memory demanding operation we have resorted to the svds and qr commands which avoid computing the complete spectrum or they produce a sparse ONB respectively. The computation of the sampling distribution based on the leverage scores of $X_{\Delta}=Z_{\Delta}D\Psi$ was also performed once during the offline phase and took about 4 hours, using the svd(,’econ’) command. The distribution $q$ was sampled with replacement during the online phase of the algorithm using the efficient command datasample, which indicatively, for the chosen $q$ , outputs a million samples in less than 0.3 s. Notice that although this sampling implementation is not independent of the dimension $kd$ , there exist alternative schemes that can handle arbitrarily large distributions with constant complexity [3].

In the implementation of the algorithm we record the following quantities–diagnostics that provide evidence on the performance in the conditions of each benchmark: the ratio $c^{\prime}/3k$ indicating how many of the rows of $X$ are used in the sketch, the relative subspace projection error $\|\Pi u_{\mathrm{opt}}-u_{\mathrm{opt}}\|/\|u_{\mathrm{opt}}\|$ , the upper bound of the randomisation error $\|\hat{G}^{-1}G-I\|$ , the relative regression error $\|\hat{u}_{\mathrm{reg}}-u_{\mathrm{reg}}\|/\|u_{\mathrm{reg}}\|$ , and the relative total error $\|\hat{u}_{\mathrm{reg}}-u_{\mathrm{opt}}\|/\|u_{\mathrm{opt}}\|$ . In the context of real-time model prediction in manufacturing processes an upper limit of 10% for the total error is deemed reasonable.

5.1. Uniformly random parameter field

In this first instance we simulate sketched solutions for 100 parameter vectors $p\in\mathbb{R}^{k}$ drawn at random from $\mathcal{U}\bigl{(}[10^{-1},10^{2}]\bigr{)}$ . Five sets of simulations were performed using ONBs incorporating the last $\rho=\{50,100\}$ singular functions of the Laplacian. Our focus was on monitoring the trade-off between accuracy and time consumption when $c=\{5\times 10^{5},10^{6},5\times 10^{6}\}$ iid samples are drawn from $p$ . The results are tabulated in table 1.

Although the values in $p$ vary over four orders of magnitude, the parameter has a homogeneous expectation within the domain and thus overall the algorithm yields sketched solutions at 10% or less total error, with only 100 basis functions. The results show that the sampling is highly non-uniform since even in the case where a million idd samples were taken these involved only 41074, a mere 6%, of the rows of $X$ . The sketching-induced error factor $\|\hat{G}^{-1}G-I\|$ appears to reduce almost linearly with the number of samples $c$ . Comparing the relative subspace projection $\|\Pi u_{\mathrm{opt}}-u_{\mathrm{opt}}\|$ and total $\|\hat{u}_{\mathrm{reg}}-u_{\mathrm{opt}}\|$ errors note that for $\|\hat{G}^{-1}G-I\|\approx 1$ the later is kept marginally larger than the former, which verifies the regularising effect of the projection on the sketching-induced noise. It is also important to see that in switching from $\rho=50$ to $\rho=100$ the projection error is halved to 0.03, however the number of samples necessary to yield the same levels of the error increases by about 5 times. For relative error tolerances around the 10% mark, the times recorded are below 1 s, while by comparison the time for computing $u_{\mathrm{opt}}$ was on average found to be 23.75 s.

The trade-off between speed and accuracy can be seen by comparing the results in the first and last rows of the table 1 where the algorithm achieves a 4% total error, when the projection error is at 3%, after five million samples. On the other hand, solutions within a 10% error margin, when the projection error is at 7%, are obtained in less than 0.5 s, which is 55 times faster than computing the standard $u_{\mathrm{opt}}$ . The speedup in sketching the more accurate solution with $\rho=100$ and $c=5$ million is still 7 times faster, compared to the FEM solver. The histograms in figure 1 provide a further insight on how the various error components vary within the ensemble of the 100 problems. We point out that the numerical results are in good agreement with the assertion of Theorem 4.1. For the example shown in figure 1, i.e. when $\rho=50$ and the error tolerance is $\varepsilon=10\%$ , our theorem predicts $c=15\rho\log(15\rho)\beta^{-1}\varepsilon^{-2}\approx 5.0\cdot 10^{5}\beta^{-1}$ samples which is consistent to the observed $c=1$ when $\beta^{-1}\approx 2$ . In the histograms we see that the sketching error virtually never exceeds $10\%$ and that $\|\hat{G}^{-1}G-I\|$ exhibits the same pattern as $\|u_{\mathrm{opt}}-u_{\mathrm{reg}}\|/\|u_{\mathrm{opt}}\|$ which supports the claim that this quantity is driving the sketching error. Similar observations can be made for the other cases of table 1. Figure 1 also shows that, although their magnitude is comparable, the variability in the projection error is much smaller than that of the sketching error. This is not surprising as the sketching is an intrinsically random method while the differences in the projection are only due to perturbations in the parameter.

5.2. Smooth parameter field

In the second benchmark we turn our attention to parameter functions with smooth spatial variation like those encountered in the context of uncertainty quantification for PDEs [16]. As the anticipated FEM solution is smooth we maintain the bases used in 5.1. In this case, the parameter $p$ is a lognormal random field given by $p\doteq\exp(b)$ , where $b$ is a zero-mean Gaussian random field with Whittle-Matérn covariance function with smoothness parameter $\nu>0$ given by

[TABLE]

where $\Gamma(\nu)$ is the Gamma function, $\|x\|_{M}^{2}=x^{T}M^{-1}x$ is the weighted Euclidean norm with positive definite matrix $M$ and $K_{\nu}$ is the order $\nu>0$ modified Bessel function of the second kind. Here we use $\nu=15/2$ , $M^{1/2}=\mathrm{diag}(1/5,1/5,1/5)$ and $\operatorname{Var}[b]=1$ . We draw realisations of $p$ by calculating once the Karhunen-Loève expansion of $b$ and then drawing iid from $\mathcal{N}(0,1)$ .

The results presented in table 2 show a similar performance to the uniformly random case in subsection 5.1. The suitability of the low-dimensional subspace is evidenced by the 7% relative projection error attained at $\rho=50$ . Sketched solutions within an error tolerance of 10% were computed in less than 1 s. Further, note that the total error is within a 2% margin from the projection error, which demonstrates the effectiveness of our sketching regularisation approach, apart from the test with $\rho=100$ and $c=1$ where $\|\hat{G}^{-1}G-I\|$ is considerably higher, implying that $c$ was insufficiently small for that test. This observation is consistent with our error bound in (4.1). Comparing the results for $(\rho=50,c=5)$ and $(\rho=100,c=1)$ shows that in the former case, although using half the number of basis functions and five times more samples, due to the larger projection error, the total error is still 1% larger than that of the later. The images presented in figure 2 correspond to one of the simulations in this benchmark with $\rho=100$ and $c=1$ million, illustrating a cross section of the profile of $p$ , the exact FEM solution, the sketched solution and the relative error between the two.

5.3. Non-smooth parameter field

A more challenging benchmark test is to consider the FEM solution for a parameter field with non-smooth variation. In this case it is natural to anticipate that any significant jump discontinuities in the profile of $p$ will have an adverse effect on the condition number of the stiffness matrix [13]. For our simulations we choose a piecewise constant approximation of the positive function

[TABLE]

which is discontinuous along the three axes. The sign function $\mathrm{sgn}:\mathbb{R}\to\mathbb{R}$ is given by $\mathrm{sgn}(x)=x/\lvert x\rvert$ when $x\neq 0$ and $\mathrm{sgn}(0)=0$ . In constructing the projection subspace we found that the smooth basis utilised in the previous cases was not appropriate to this case and we thus resorted in a sparse ONB taking a subset of the columns of the sparse unitary matrix computed from the QR decomposition of the Laplacian.

The results in table 3 suggest that the chosen basis is not very appropriate since not only the number of basis functions is substantially larger, but also the reduction in the projection error for a 100% increase in $\rho$ is quiet marginal. In turn, this increase in the dimension of $\hat{G}$ affects the level of sketching error, as even with $c=5$ million samples $\|\hat{G}^{-1}G-I\|>1$ . Consequently, this has a profound effect on timings, although the sketched approach maintains a five fold advantage to the standard FEM solver. For the tests for $(\rho=2\times 10^{3},c=10^{6})$ and $(\rho=2\times 10^{3},c=5\times 10^{6})$ notice that increasing the samples by five times does not yield a significant improvement in the results, which is likely triggered by the large $\kappa(A)\approx 10^{5}$ in the error term of Theorem 4.2 which causes the $\|u_{\mathrm{reg}}-u_{\mathrm{opt}}\|$ to grow.

6. Conclusions

We have considered expediting the solution of the finite element method equations arising from the discretisation of elliptic PDEs on high-dimensional models. Taking into consideration the multi-query context and the smooth profile of the FEM solution, we proposed a practical sketch-based algorithm that involves projection onto lower-dimensional subspace and sketching using a generic, sampling distribution derived from the leverage scores of a tall matrix associated with the Laplacian operator. We have elaborated on the impact of the projection in reducing the dimensionality as well as mitigating the effects of sketching noise. The performance of our method was evaluated in a series of benchmark tests of FEM simulations that demonstrated substantial speed improvements at the cost of a small compromise in accuracy when the stiffness matrix is well conditioned.

Bibliography23

The reference list from the paper itself. Each links out to its DOI / PubMed record.

1[1] Avron, H., and Toledo, S. Effective Stiffness: Generalizing Effective Resistance Sampling to Finite Element Matrices . Ar Xiv, oct 2011.
2[2] Bertsekas, D. P., and Yu, H. Journal of Computational and Applied Projected equation methods for approximate solution of large linear systems. Journal of Computational and Applied Mathematics 227 , 1 (2009), 27–50.
3[3] Bringmann, K., and Panagiotou, K. Efficient sampling methods for discrete distributions. Algorithmica 79 , 2 (Oct 2017), 484–508.
4[4] Calvetti, D., Dunlop, M., Somersalo, E., and Stuart, A. Iterative updating of model error for Bayesian inversion. Inverse Problems 34 , 2 (feb 2018), 025008.
5[5] Cohen, M. B., Lee, Y. T., Musco, C., Musco, C., Peng, R., and Sidford, A. Uniform sampling for matrix approximation. In Proceedings of the 2015 Conference on Innovations in Theoretical Computer Science (New York, NY, USA, 2015), ITCS ’15, ACM, pp. 181–190.
6[6] Drineas, P., and Mahoney, M. W. Effective Resistances, Statistical Leverage, and Applications to Linear Equation Solving . Ar Xiv, may 2010.
7[7] Drineas P., Magdon-Ismail M., M. M., and D., W. Fast approximation of matrix coherence and statistical leverage. Journal of Machine Learning Research 13 , 1 (2012), 3441–3472.
8[8] Elman, H., Silvester, D., and Wathen, A. Finite Elements and Fast Iterative Solvers , 2nd ed. Oxford University Press, 2014.

TL;DR

Contribution

Findings

Abstract

Peer Reviews

Videos

A sketched finite element method for elliptic models

Abstract.

Key words and phrases:

2019 Mathematics Subject Classification:

Contents

1. Introduction

1.1. Notation

2. Galerkin finite element method preliminaries

2.1. The stiffness matrix

3. A regularised sketched formulation

Lemma 3.1**.**

Proof.

Lemma 3.2**.**

Proof.

Proposition 3.3** (Lemma 3 and 4 in [DrineasMahoneyKannan]).**

Lemma 3.4**.**

Proof.

Lemma 3.5** ([21] section 6.4).**

Proposition 3.6**.**

Proof.

Proposition 3.7** ([5] Lemma 5).**

Large leverage scores ℓi(X)≈1\ell_{i}(X)\approx 1ℓi​(X)≈1

Small leverage scores ℓi(X)≪1\ell_{i}(X)\ll 1ℓi​(X)≪1

4. Complexity and error analysis

Theorem 4.1**.**

Proof.

Theorem 4.2**.**

Proof.

Corollary 4.3**.**

Proof.

5. Numerical results

5.1. Uniformly random parameter field

5.2. Smooth parameter field

5.3. Non-smooth parameter field

6. Conclusions

Lemma 3.1.

Lemma 3.2.

Proposition 3.3 (Lemma 3 and 4 in [DrineasMahoneyKannan]).

Lemma 3.4.

Lemma 3.5 ([21] section 6.4).

Proposition 3.6.

Proposition 3.7 ([5] Lemma 5).

Large leverage scores $\ell_{i}(X)\approx 1$

Small leverage scores $\ell_{i}(X)\ll 1$

Theorem 4.1.

Theorem 4.2.

Corollary 4.3.