Multilevel adaptive sparse Leja approximations for Bayesian inverse   problems

Ionut-Gabriel Farcas; Jonas Latz; Elisabeth Ullmann; Tobias Neckel and; Hans-Joachim Bungartz

arXiv:1904.12204·stat.CO·July 15, 2020

Multilevel adaptive sparse Leja approximations for Bayesian inverse problems

Ionut-Gabriel Farcas, Jonas Latz, Elisabeth Ullmann, Tobias Neckel and, Hans-Joachim Bungartz

PDF

TL;DR

This paper introduces a multilevel adaptive sparse Leja algorithm that efficiently approximates Bayesian posteriors in inverse problems by combining coarse and fine model discretizations with adaptive sparse grids, reducing computational costs.

Contribution

It proposes a novel multilevel adaptive sparse Leja method that improves Bayesian inverse problem solutions by efficiently focusing on high posterior probability regions.

Findings

01

The algorithm accurately approximates posteriors with fewer expensive model evaluations.

02

It outperforms standard multilevel methods and MCMC in computational efficiency.

03

Numerical experiments demonstrate effectiveness in 2D and 3D elliptic inverse problems.

Abstract

Deterministic interpolation and quadrature methods are often unsuitable to address Bayesian inverse problems depending on computationally expensive forward mathematical models. While interpolation may give precise posterior approximations, deterministic quadrature is usually unable to efficiently investigate an informative and thus concentrated likelihood. This leads to a large number of required expensive evaluations of the mathematical model. To overcome these challenges, we formulate and test a multilevel adaptive sparse Leja algorithm. At each level, adaptive sparse grid interpolation and quadrature are used to approximate the posterior and perform all quadrature operations, respectively. Specifically, our algorithm uses coarse discretizations of the underlying mathematical model to investigate the parameter space and to identify areas of high posterior probability. Adaptive sparse…

Tables7

Table 1. Table 1: Results for the quadrature problem Eq. 16 using a reference MH solution with 3 ⋅ 10 5 ⋅ 3 superscript 10 5 3\cdot 10^{5} samples, integration w.r.t. the prior density as in Eq. 17 and our proposed approach in which we integrate w.r.t. the Gaussian approximation of the posterior, as showed in Eq. 18 .

Method	No. quadrature points	Result
MH	$3 \cdot 10^{5}$	0.33813
Integration w.r.t. $π_{0}$	1603	0.33813
Integration w.r.t. ${\hat{π}}^{𝒚} (𝜽)$	49	0.33811

Table 2. Table 2: Multilevel setup for the 2D inversion problem with forward model Eq. 19 .

Level	$h$	$t o l^{in}$	$𝝉^{in}$	$t o l^{qu}$
$ℓ (1)$	$h_{1} = \sqrt{2} / 2^{4}$	$t o l_{3}^{in} = 10^{- 5}$	$𝝉_{3}^{in} = (10^{- 7}, 10^{- 7})$	$t o l_{3}^{qu} = 10^{- 12}$
$ℓ (2)$	$h_{2} = \sqrt{2} / 2^{5}$	$t o l_{2}^{in} = 10^{- 4}$	$𝝉_{2}^{in} = (10^{- 6}, 10^{- 6})$	$t o l_{2}^{qu} = 10^{- 11}$
$ℓ (3)$	$h_{3} = \sqrt{2} / 2^{6}$	$t o l_{1}^{in} = 10^{- 3}$	$𝝉_{1}^{in} = (10^{- 5}, 10^{- 5})$	$t o l_{1}^{qu} = 10^{- 10}$

Table 3. Table 3: Comparison of estimates of 𝔼 [ π 𝒚 ( 𝜽 ) ] 𝔼 delimited-[] superscript 𝜋 𝒚 𝜽 \mathbb{E}[\pi^{\boldsymbol{y}}(\boldsymbol{\theta})] for the source inversion problem with forward model Eq. 19 . We first compute a reference solution using 2 ⋅ 10 5 ⋅ 2 superscript 10 5 2\cdot 10^{5} MH samples. Afterwards, we employ StdML and the two variants of our proposed multilevel approach, MLLejaStd and MLLejaDV.

Method	$𝔼_{π^{𝒚}} [𝜽]$
MH	$(0.3628, 0.6370)$
StdML	$(0.3631, 0.6368)$
MLLejaStd	$(0.3630, 0.6369)$
MLLejaDV	$(0.3630, 0.6369)$

Table 4. Table 4: Multilevel setup for the 2D inversion problem with forward model Section 5.3 .

Level	$h$	$t o l^{in}$	$𝝉^{in}$	$t o l^{qu}$
$ℓ (1)$	$h_{1} = \sqrt{2} / 2^{5}$	$t o l_{3}^{in} = 10^{- 6}$	$𝝉_{3}^{in} = (10^{- 8}, 10^{- 8})$	$t o l_{3}^{qu} = 10^{- 13}$
$ℓ (2)$	$h_{2} = \sqrt{2} / 2^{6}$	$t o l_{2}^{in} = 10^{- 5}$	$𝝉_{2}^{in} = (10^{- 7}, 10^{- 7})$	$t o l_{2}^{qu} = 10^{- 12}$
$ℓ (3)$	$h_{3} = \sqrt{2} / 2^{7}$	$t o l_{1}^{in} = 10^{- 4}$	$𝝉_{1}^{in} = (10^{- 6}, 10^{- 6})$	$t o l_{1}^{qu} = 10^{- 11}$

Table 5. Table 5: Comparison of estimates of 𝔼 [ π 𝒚 ( 𝜽 ) ] 𝔼 delimited-[] superscript 𝜋 𝒚 𝜽 \mathbb{E}[\pi^{\boldsymbol{y}}(\boldsymbol{\theta})] for the source inversion problem with forward model Section 5.3 . We first compute a reference solution using 2 ⋅ 10 5 ⋅ 2 superscript 10 5 2\cdot 10^{5} MH samples. Afterwards, we employ StdML and the two variants of our proposed multilevel approach, MLLejaStd and MLLejaDV.

Method	$𝔼_{π^{𝒚}} [𝜽]$
MH	$(0.5032, 0.5068)$
StdML	$(0.5002, 0.5002)$
MLLejaStd	$(0.6688, 0.6548)$
MLLejaDV	$(0.6648, 0.6548)$

Table 6. Table 6: Multilevel setup for the 8D inversion problem with forward model Eq. 21 .

Level	$h$	$t o l^{in}$	$𝝉^{in}$	$t o l^{qu}$
$ℓ (1)$	$h_{1} = \sqrt{3} / 2^{4}$	$t o l_{3}^{in} = 10^{- 5}$	$𝝉_{3}^{in} = 10^{- 7} \cdot 𝟏_{8}$	$t o l_{3}^{qu} = 10^{- 9}$
$ℓ (2)$	$h_{2} = \sqrt{3} / 2^{5}$	$t o l_{2}^{in} = 10^{- 4}$	$𝝉_{2}^{in} = 10^{- 6} \cdot 𝟏_{8}$	$t o l_{2}^{qu} = 10^{- 8}$
$ℓ (3)$	$h_{3} = \sqrt{3} / 2^{6}$	$t o l_{1}^{in} = 10^{- 3}$	$𝝉_{1}^{in} = 10^{- 5} \cdot 𝟏_{8}$	$t o l_{1}^{qu} = 10^{- 7}$

Table 7. Table 7: Estimation of the posterior’s mean value for the fourth test case using a referece MH solution with 10 5 superscript 10 5 10^{5} samples and the two variants of our proposed multilevel approach for Bayesian inversion.

Method	$𝔼_{π^{𝒚}} [θ_{1}]$	$𝔼_{π^{𝒚}} [θ_{2}]$	$𝔼_{π^{𝒚}} [θ_{3}]$	$𝔼_{π^{𝒚}} [θ_{4}]$	$𝔼_{π^{𝒚}} [θ_{5}]$	$𝔼_{π^{𝒚}} [θ_{6}]$	$𝔼_{π^{𝒚}} [θ_{7}]$	$𝔼_{π^{𝒚}} [θ_{8}]$
MH	$0.2532$	$0.2123$	$- 0.1363$	$0.1326$	$0.1486$	$0.0753$	$- 0.1584$	$0.0066$
MLStd	$0.2642$	$0.2111$	$- 0.1630$	$0.1539$	$0.1429$	$0.0670$	$- 0.1816$	$- 0.0053$
MLDV	$0.2620$	$0.2114$	$- 0.1600$	$0.1542$	$0.1448$	$0.0689$	$- 0.1808$	$- 0.0050$

Equations98

G (θ) + η = y

G (θ) + η = y

L (θ ∣ y)

L (θ ∣ y)

Φ (θ; y)

π^{y} (θ)

π^{y} (θ)

Z (y)

E_{μ^{y}} [g] = \int_{X} g (θ) π^{y} (θ) d θ .

E_{μ^{y}} [g] = \int_{X} g (θ) π^{y} (θ) d θ .

E_{μ^{y}} [g] = \frac{1}{Z ( y )} \int g (θ) L (θ ∣ y) π_{0} (θ) d θ = \frac{E _{μ_{0}} [ g ( \cdot ) L ( \cdot ∣ y )]}{E _{μ_{0}} [ L ( \cdot ∣ y )]} .

E_{μ^{y}} [g] = \frac{1}{Z ( y )} \int g (θ) L (θ ∣ y) π_{0} (θ) d θ = \frac{E _{μ_{0}} [ g ( \cdot ) L ( \cdot ∣ y )]}{E _{μ_{0}} [ L ( \cdot ∣ y )]} .

Δ_{k}^{i} [f^{i}] := U_{k}^{i} [f^{i}] - U_{k - 1}^{i} [f^{i}], i = 1, 2, \dots, N_{sto},

Δ_{k}^{i} [f^{i}] := U_{k}^{i} [f^{i}] - U_{k - 1}^{i} [f^{i}], i = 1, 2, \dots, N_{sto},

U_{K} [f^{N_{sto}}] = k \in K \sum (Δ_{k_{1}}^{1} \otimes Δ_{k_{2}}^{2} \otimes \dots \otimes Δ_{k_{N_{sto}}}^{N_{sto}}) [f^{N_{sto}}] = k \in K \sum Δ_{k} [f^{N_{sto}}],

U_{K} [f^{N_{sto}}] = k \in K \sum (Δ_{k_{1}}^{1} \otimes Δ_{k_{2}}^{2} \otimes \dots \otimes Δ_{k_{N_{sto}}}^{N_{sto}}) [f^{N_{sto}}] = k \in K \sum Δ_{k} [f^{N_{sto}}],

θ_{1} = θ \in X_{i} argmax w (θ), θ_{n} = θ \in X_{i} argmax w (θ) m = 1 \prod n - 1 (θ - θ_{m}), n = 2, 3, \dots

θ_{1} = θ \in X_{i} argmax w (θ), θ_{n} = θ \in X_{i} argmax w (θ) m = 1 \prod n - 1 (θ - θ_{m}), n = 2, 3, \dots

X = i = 1 ⨂ N_{sto} X_{i},

X = i = 1 ⨂ N_{sto} X_{i},

π_{0} (θ) = i = 1 \prod N_{sto} π_{0, i} (θ_{i}),

π_{0} (θ) = i = 1 \prod N_{sto} π_{0, i} (θ_{i}),

t o l_{1}^{op} \leq t o l_{2}^{op} \leq \dots \leq t o l_{J}^{op} .

t o l_{1}^{op} \leq t o l_{2}^{op} \leq \dots \leq t o l_{J}^{op} .

m_{ℓ (1)}^{i} := Z_{ℓ (1)}^{- 1} (y) \int_{X} θ_{i} L_{ℓ (1)} (θ ∣ y) π_{0} (θ) d θ

m_{ℓ (1)}^{i} := Z_{ℓ (1)}^{- 1} (y) \int_{X} θ_{i} L_{ℓ (1)} (θ ∣ y) π_{0} (θ) d θ

C_{ℓ (1)}^{n p} := Z_{ℓ (1)}^{- 1} (y) \int_{X} θ_{n} θ_{p} L_{ℓ (1)} (θ ∣ y) π_{0} (θ) d θ - m_{ℓ (1)}^{n} m_{ℓ (1)}^{p},

C_{ℓ (1)}^{n p} := Z_{ℓ (1)}^{- 1} (y) \int_{X} θ_{n} θ_{p} L_{ℓ (1)} (θ ∣ y) π_{0} (θ) d θ - m_{ℓ (1)}^{n} m_{ℓ (1)}^{p},

Φ_{ℓ (1)} (θ; y) = AdaptSGInterp (t o l_{1}^{in}, K_{max}^{in}, Φ_{1}, π_{0})

Φ_{ℓ (1)} (θ; y) = AdaptSGInterp (t o l_{1}^{in}, K_{max}^{in}, Φ_{1}, π_{0})

Z_{ℓ (1)} (y) = AdaptSGQuad (t o l_{1}^{qu}, K_{max}^{qu}, L_{ℓ (1)}, π_{0})

Z_{ℓ (1)} (y) = AdaptSGQuad (t o l_{1}^{qu}, K_{max}^{qu}, L_{ℓ (1)}, π_{0})

π_{ℓ (1)}^{y} (θ) := \frac{π _{0} ( θ ) L _{ℓ (1)} ( θ ∣ y )}{Z _{ℓ (1)} ( y )}

π_{ℓ (1)}^{y} (θ) := \frac{π _{0} ( θ ) L _{ℓ (1)} ( θ ∣ y )}{Z _{ℓ (1)} ( y )}

π_{ℓ (j - 1)}^{y} (θ) := N (m_{ℓ (j - 1)}, C_{ℓ (j - 1)}),

π_{ℓ (j - 1)}^{y} (θ) := N (m_{ℓ (j - 1)}, C_{ℓ (j - 1)}),

θ = T_{ℓ (j - 1)} (ζ) := m_{ℓ (j - 1)} + C_{ℓ (j - 1)}^{1/2} ζ \Rightarrow θ \sim N (m_{ℓ (j - 1)}, C_{ℓ (j - 1)}),

θ = T_{ℓ (j - 1)} (ζ) := m_{ℓ (j - 1)} + C_{ℓ (j - 1)}^{1/2} ζ \Rightarrow θ \sim N (m_{ℓ (j - 1)}, C_{ℓ (j - 1)}),

π_{j}^{y} (θ) := \frac{L _{j} ( θ ∣ y ) π _{0} ( θ )}{Z _{j} ( y )} = \frac{L _{j} ( θ ∣ y ) π _{0} ( θ )}{Z _{j} ( y )} \frac{L _{j - 1} ( θ ∣ y )}{L _{j - 1} ( θ ∣ y )} \frac{Z _{j - 1} ( y )}{Z _{j - 1} ( y )} = \frac{L _{j - 1} ( θ ∣ y ) π _{0} ( θ )}{Z _{j - 1} ( y )} \frac{L _{j} ( θ ∣ y )}{L _{j - 1} ( θ ∣ y )} \frac{Z _{j - 1} ( y )}{Z _{j} ( y )} = \frac{π _{j - 1}^{y} ( θ ) \frac{L _{j} ( θ ∣ y )}{L _{j - 1} ( θ ∣ y )}}{\frac{Z _{j} ( y )}{Z _{j - 1} ( y )}} \approx \frac{π _{ℓ (j - 1)}^{y} ( θ ) L _{δ j} ( θ ∣ y )}{Z _{δ j} ( y )} = \frac{π _{ℓ (j - 1)}^{y} ( θ ) L _{δ j} ( θ ∣ y ) \frac{π _{ℓ (j - 1)}^{y} ( θ )}{π _{ℓ (j - 1)}^{y} ( θ )}}{Z _{δ j} ( y )},

π_{j}^{y} (θ) := \frac{L _{j} ( θ ∣ y ) π _{0} ( θ )}{Z _{j} ( y )} = \frac{L _{j} ( θ ∣ y ) π _{0} ( θ )}{Z _{j} ( y )} \frac{L _{j - 1} ( θ ∣ y )}{L _{j - 1} ( θ ∣ y )} \frac{Z _{j - 1} ( y )}{Z _{j - 1} ( y )} = \frac{L _{j - 1} ( θ ∣ y ) π _{0} ( θ )}{Z _{j - 1} ( y )} \frac{L _{j} ( θ ∣ y )}{L _{j - 1} ( θ ∣ y )} \frac{Z _{j - 1} ( y )}{Z _{j} ( y )} = \frac{π _{j - 1}^{y} ( θ ) \frac{L _{j} ( θ ∣ y )}{L _{j - 1} ( θ ∣ y )}}{\frac{Z _{j} ( y )}{Z _{j - 1} ( y )}} \approx \frac{π _{ℓ (j - 1)}^{y} ( θ ) L _{δ j} ( θ ∣ y )}{Z _{δ j} ( y )} = \frac{π _{ℓ (j - 1)}^{y} ( θ ) L _{δ j} ( θ ∣ y ) \frac{π _{ℓ (j - 1)}^{y} ( θ )}{π _{ℓ (j - 1)}^{y} ( θ )}}{Z _{δ j} ( y )},

L_{δ j} (θ ∣ y) := \frac{exp ( - Φ _{j} ( θ ; y ))}{exp ( - Φ _{j - 1} ( θ ; y ))} = exp (- Φ_{δ j} (θ; y)) .

L_{δ j} (θ ∣ y) := \frac{exp ( - Φ _{j} ( θ ; y ))}{exp ( - Φ _{j - 1} ( θ ; y ))} = exp (- Φ_{δ j} (θ; y)) .

Φ_{δ j} (θ; y) \approx Φ_{ℓ (δ j)} (T_{ℓ (j - 1)} (ζ); y) .

Φ_{δ j} (θ; y) \approx Φ_{ℓ (δ j)} (T_{ℓ (j - 1)} (ζ); y) .

Z_{δ j} (y) = \int_{X} π_{j - 1}^{y} (θ) L_{δ j} (θ ∣ y) \frac{π _{j - 1}^{y} ( θ )}{π _{j - 1}^{y} ( θ )} d θ \approx \int_{X} π_{ℓ (j - 1)}^{y} (T_{ℓ (j - 1)} (ζ)) \cdot L_{ℓ (δ j)} (T_{ℓ (j - 1)} (ζ) ∣ y)) \cdot \frac{π _{ℓ (j - 1)}^{y} ( T _{ℓ (j - 1)} ( ζ ))}{π _{ℓ (j - 1)}^{y} ( T _{ℓ (j - 1)} ( ζ ))} d ζ

Z_{δ j} (y) = \int_{X} π_{j - 1}^{y} (θ) L_{δ j} (θ ∣ y) \frac{π _{j - 1}^{y} ( θ )}{π _{j - 1}^{y} ( θ )} d θ \approx \int_{X} π_{ℓ (j - 1)}^{y} (T_{ℓ (j - 1)} (ζ)) \cdot L_{ℓ (δ j)} (T_{ℓ (j - 1)} (ζ) ∣ y)) \cdot \frac{π _{ℓ (j - 1)}^{y} ( T _{ℓ (j - 1)} ( ζ ))}{π _{ℓ (j - 1)}^{y} ( T _{ℓ (j - 1)} ( ζ ))} d ζ

π_{j}^{y} (θ) \approx π_{ℓ (j)}^{y} (θ) := Z_{ℓ (δ j)}^{- 1} (y) π_{ℓ (j - 1)}^{y} (T_{ℓ (j - 1)} (ζ)) L_{ℓ (δ j)} (T_{ℓ (j - 1)} (ζ) ∣ y)) .

π_{j}^{y} (θ) \approx π_{ℓ (j)}^{y} (θ) := Z_{ℓ (δ j)}^{- 1} (y) π_{ℓ (j - 1)}^{y} (T_{ℓ (j - 1)} (ζ)) L_{ℓ (δ j)} (T_{ℓ (j - 1)} (ζ) ∣ y)) .

\int_{X} g (T_{ℓ (j - 1)} (ζ)) L_{ℓ (δ j)} (T_{ℓ (j - 1)} (ζ) ∣ y)) R_{ℓ (j - 1)} (T_{ℓ (j - 1)} (ζ)) π_{ℓ (j - 1)}^{y} (T_{ℓ (j - 1)} (ζ)) d ζ,

\int_{X} g (T_{ℓ (j - 1)} (ζ)) L_{ℓ (δ j)} (T_{ℓ (j - 1)} (ζ) ∣ y)) R_{ℓ (j - 1)} (T_{ℓ (j - 1)} (ζ)) π_{ℓ (j - 1)}^{y} (T_{ℓ (j - 1)} (ζ)) d ζ,

π_{ℓ (1)}^{y} (θ), m_{ℓ (1)}, C_{ℓ (1)} = Level1SparseLeja (h_{1}, t o l_{1}^{in}, t o l_{1}^{qu}, K_{max}, Φ, π_{0})

π_{ℓ (1)}^{y} (θ), m_{ℓ (1)}, C_{ℓ (1)} = Level1SparseLeja (h_{1}, t o l_{1}^{in}, t o l_{1}^{qu}, K_{max}, Φ, π_{0})

π_{ℓ (j - 1)}^{y} (θ) := N (m_{ℓ (j - 1)}, C_{ℓ (j - 1)}), T_{ℓ (j - 1)} (ζ) := m_{ℓ (j - 1)} + C_{ℓ (j - 1)}^{1/2} ζ

π_{ℓ (j - 1)}^{y} (θ) := N (m_{ℓ (j - 1)}, C_{ℓ (j - 1)}), T_{ℓ (j - 1)} (ζ) := m_{ℓ (j - 1)} + C_{ℓ (j - 1)}^{1/2} ζ

Φ_{ℓ (δ j)} (θ; y) = AdaptSGInterp (t o l_{j}^{in}, K_{max}^{in}, Φ_{δ j} \circ T_{ℓ (j - 1)} (ζ), π_{ℓ (j - 1)}^{y})

Φ_{ℓ (δ j)} (θ; y) = AdaptSGInterp (t o l_{j}^{in}, K_{max}^{in}, Φ_{δ j} \circ T_{ℓ (j - 1)} (ζ), π_{ℓ (j - 1)}^{y})

Z_{ℓ (δ j)} (y) = AdaptSGQuad (t o l_{j}^{qu}, K_{max}^{qu}, L_{ℓ (δ j)}, π_{ℓ (j - 1)}^{y})

Z_{ℓ (δ j)} (y) = AdaptSGQuad (t o l_{j}^{qu}, K_{max}^{qu}, L_{ℓ (δ j)}, π_{ℓ (j - 1)}^{y})

π_{ℓ (j)}^{y} (θ) := \frac{π _{ℓ (j - 1)}^{y} ( θ ) L _{ℓ (δ j)} ( θ ∣ y )}{Z _{ℓ (δ j)} ( y )}

π_{ℓ (j)}^{y} (θ) := \frac{π _{ℓ (j - 1)}^{y} ( θ ) L _{ℓ (δ j)} ( θ ∣ y )}{Z _{ℓ (δ j)} ( y )}

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Full text

\newsiamremark

remarkRemark \newsiamremarkhypothesisHypothesis

\newsiamthmclaimClaim

\headersMultilevel Adaptive Sparse Leja ApproximationsI-G. Farca\cbs, J. Latz, E. Ullmann, T. Neckel, and H.-J. Bungartz

\externaldocumentsupplement

Multilevel Adaptive Sparse Leja Approximations for Bayesian Inverse Problems

††thanks: \funding This work was supported by the German Research Foundation (DFG) through the TUM International Graduate School of Science and Engineering (IGSSE) within Project 10.02 BAYES, and by the EPSRC under grant number EP/R014604/1.

I-G. Farca\cbs Department of Informatics, Technical University of Munich, Boltzmannstr. 3, 85748 Garching, Germany (, , ). [email protected]

[email protected]

J. Latz Department of Mathematics, Technical University of Munich, Boltzmannstr. 3, 85748 Garching, Germany (, ). [email protected]

[email protected]

E. Ullmann33footnotemark: 3

T. Neckel22footnotemark: 2

H.-J. Bungartz22footnotemark: 2

Abstract

Deterministic interpolation and quadrature methods are often unsuitable to address Bayesian inverse problems depending on computationally expensive forward mathematical models. While interpolation may give precise posterior approximations, deterministic quadrature is usually unable to efficiently investigate an informative and thus concentrated likelihood. This leads to a large number of required expensive evaluations of the mathematical model. To overcome these challenges, we formulate and test a multilevel adaptive sparse Leja algorithm. At each level, adaptive sparse grid interpolation and quadrature are used to approximate the posterior and perform all quadrature operations, respectively. Specifically, our algorithm uses coarse discretizations of the underlying mathematical model to investigate the parameter space and to identify areas of high posterior probability. Adaptive sparse grid algorithms are then used to place points in these areas, and ignore other areas of small posterior probability. The points are weighted Leja points. As the model discretization is coarse, the construction of the sparse grid is computationally efficient. On this sparse grid, the posterior measure can be approximated accurately with few expensive, fine model discretizations. The efficiency of the algorithm can be enhanced further by exploiting more than two discretization levels. We apply the proposed multilevel adaptive sparse Leja algorithm in numerical experiments involving elliptic inverse problems in 2D and 3D space, in which we compare it with Markov chain Monte Carlo sampling and a standard multilevel approximation.

keywords:

Bayesian inference, multilevel method, adaptive sparse grids, partial differential equation

{AMS}

35R30, 62F15, 65C60, 65D30, 65N30, 68T05

1 Introduction

Mathematical models in science and engineering often require input parameters which cannot be observed directly, yet these parameters are required for predictions based on the model. A standard procedure is to estimate the inputs from indirect observations, which is known as an inverse problem. In contrast, the corresponding forward problem maps from the parameter to the observation space.

In many applications, for instance the geosciences or medical sciences, the observations are noisy and their number is insufficient to identify a unique associated parameter value. The Bayesian approach to inverse problems [21, 41, 43] provides a consistent mechanism to combine noisy or incomplete data with prior knowledge, and to quantify the uncertainty in the parameter estimate. The prior knowledge is incorporated into a probability distribution over the parameter space; this is termed prior (measure) $\mu_{0}$ . The Bayesian solution to the inverse problem is then the posterior (measure) $\mu^{\boldsymbol{y}}$ arising from conditioning the prior $\mu_{0}$ on the observations. Unfortunately, the posterior is often intractable in the sense that it does not admit closed form analytic expressions. Hence approximations have to be used in practice.

Sampling-based posterior approximations such as Markov chain Monte Carlo (MCMC) [1] or Sequentical Monte Carlo (SMC) [7] do not rely on the smoothness of the parameter-to-observation map, and can be conducted in high-dimensional parameter spaces. The drawback is that without exploiting smoothness or low-dimensional structure in the parameter space often a prohibitive number of samples are required to obtain the desired accuracy. Since each sample entails the evaluation of the forward map, the total cost of the Bayesian inversion becomes prohibitive if the forward model is specified by a partial differential equation (PDE), and is thus computationally expensive. In recent years, many works suggested computationally feasible, yet accurate approximations of the posterior. In this work we focus on sampling-free approximations for computationally expensive problems for parameter spaces of small to moderate dimension. We assume that the prior and posterior have a probability density function with respect to (w.r.t.) the Lebesgue measure and approximate the posterior density using sparse grid interpolation and sparse grid quadrature.

Sampling-free approaches often involve surrogates for the forward response operator to decrease the computational cost. Typical surrogates are based on generalized polynomial chaos [10, 26, 30, 47], sparse grids [3, 5, 28, 36, 37, 38], Gaussian process regression [22, 42], model reduction [13, 24, 27], and combinations, e.g., sparse grids and reduced bases [3, 4]. To obtain an accurate (and convergent) approximation, these surrogates require certain types of smoothness of the response surface or likelihood function w.r.t. the input parameters. The smoothness assumptions can be weakened by using piecewise polynomial approximations together with Voronoi tesselations of the parameter space [31]. Of course, surrogates can also be used to accelerate sampling-based approximations such as MCMC, see e.g., [26, 33]. We remark that Quasi-Monte Carlo [8, 35] is in principle a sampling-free method which does not rely on surrogates, however, it requires again a smooth approximand, and is often used together with randomization.

Theoretical analysis shows that if the surrogate converges to the forward model at a specific rate w.r.t. the prior weighted $L^{2}$ -norm, then the approximate posterior converges to the exact posterior with at least the same rate [30, 41]. This result has been improved recently in [46] where it was showed that the convergence rate of the posterior approximation is at least twice as large as the convergence rate of the surrogate, for general priors. However, constructing an accurate surrogate over the entire support of the prior might not be feasible and is in fact often unnecessary. Indeed, in inference problems where the data is informative, the posterior differs significantly from the prior, and is supported only in a small part of the prior support. This suggests to adapt and localize the surrogate construction to the support of the posterior. We adopt this approach in our work and construct multilevel, adaptive surrogates with localized support using adaptive sparse grid approximations.

The idea of posterior-focused surrogates is not new, it has however received little attention to date in the literature. Li and Marzouk [26] borrow ideas from statistics and construct an efficient polynomial chaos surrogate associated with a density that minimizes the cross entropy between the posterior and a family of multivariate normal distributions. Jiang and Ou [20] suggest a two-stage surrogate based on generalized multiscale finite elements and least-squares stochastic collocation. Yan and Zhou [47] propose a multifidelity polynomial chaos surrogate which combines a large number of inexpensive low-fidelity model evaluations with a small number of expensive high-fidelity model evaluations, following the idea of multifidelity approximations [34].

One challenge of posterior-focused surrogates is the need to handle arbitrary densities which can deviate significantly from the prior which is usually a classical density such as uniform or Gaussian. We address this by constructing adaptive sparse grid approximations based on weighted (L)-Leja sequences (see, e.g., [16, 32]). Note that sparse grid approximations with Leja points have been devised for forward uncertainty propagation in [11, 12, 14, 32]. In [11] the adaptive construction of the points is guided by sensitivity scores, and the strategy is applied in plasma microturbulence analysis. In [14] the Leja points are constructed with the help of an adjoint-based error indicator. The use of Leja points to approximate posterior densities is a novel contribution to the literature. Leja points offer further computational advantages since they are nested and thus allow to reuse (expensive) model evaluations.

We address the possible high-dimensional parameter space in Bayesian inversion by the use of multilevel approximations. At each level, dimension-adaptive sparse grids [6, 15, 32] are employed, either in standard form, or using directional variances to better exploit anisotropies in the parameter space. In particular, at the first level, our algorithm uses a coarse discretization of the given model to investigate the parameter space and to identify areas of high posterior probability. Adaptive algorithms are then used to place weighted (L)-Leja points in these areas, and ignore other areas of small posterior probability. Starting with the second level, we sequentially update the prior such that the previous sparse grid approximation of the posterior is reused. In this way, the current posterior measure can be approximated accurately with few expensive, finer model discretizations. We point out that sparse grid approximations are based on point sequences in one dimension. However, starting with the second level the posterior densities are in general not separable and hence we cannot rely on a simple tensorization of univariate Leja points. Instead we construct Leja points w.r.t. a Gaussian approximation of the posterior, which is separable, and we then correct the bias introduced by this approximation in quadrature computations.

The remainder of this paper is structured as follows. In Section 2 we provide the necessary background information. In particular, in Section 2.1 we formulate the Bayesian inverse problem and discuss the computation of posterior expectations by importance sampling. Section 2.2 reviews multilevel approaches, and Section 2.3 discusses generalized sparse grids. Section 3 contains the major contribution of our work, the multilevel adaptive sparse Leja approximation to the posterior density in a Bayesian inverse problem. This method is independent of the specific implementation of the adaptive sparse grid approximations. Details on the used dimension-adaptive approximations are given in Section 4. In Section 5, we present numerical results, comparing our multilevel algorithm with sampling methods and the classical multilevel approach based on telescoping sums. Finally, Section 6 offers concluding remarks.

2 Background

2.1 Bayesian inversion

To begin we formulate the Bayesian inverse problem. Let $\boldsymbol{X}\subset\mathbb{R}^{N_{\mathrm{sto}}}$ denote the parameter space. In addition, let $\boldsymbol{Y}=\mathbb{R}^{N_{\mathrm{obs}}}$ be a separable Banach space that denotes the data space. $N_{\mathrm{sto}}$ is the dimension of the data space, and $N_{\mathrm{obs}}$ is the number of observations. Notably, the parameter and data space are finite-dimensional. This allows us to work with densities w.r.t. the Lebesgue measure. The underlying mathematical model is formalized by a function $\mathcal{G}:\boldsymbol{X}\rightarrow\boldsymbol{Y}$ , which maps from the parameter space to the data space. Noisy observations $\boldsymbol{y}\in\boldsymbol{Y}$ are obtained. To model the noise we assume that $\boldsymbol{y}$ is a realisation of the random variable $\mathcal{G}(\boldsymbol{\theta}^{\mathrm{true}})+\boldsymbol{\eta}$ where $\boldsymbol{\eta}\sim\mathrm{N}(0,\Gamma)$ is non-degenerate Gaussian noise, and $\boldsymbol{\theta}^{\mathrm{true}}\in\boldsymbol{X}$ is the true parameter. In an inverse problem we wish to identify the parameter $\boldsymbol{\theta}^{\mathrm{true}}$ , i.e., solve the equation

[TABLE]

for $\boldsymbol{\theta}$ . This problem is typically ill-posed in the sense of Hadamard [18], due to noise, and since the low-dimensional data space is often not sufficiently rich to allow the identification of a unique parameter in the high-dimensional space $\boldsymbol{X}$ . The ill-posedness can be cured by reformulating Eq. 1 as a Bayesian inverse problem. Next, we introduce our notation and briefly discuss Bayesian inversion, and refer to [41] for more details.

We assume that $\boldsymbol{\theta}$ is an $\boldsymbol{X}$ -valued random variable that is distributed according to a prior measure $\mu_{0}$ with Lebesgue density $\pi_{0}$ on the parameter space $\boldsymbol{X}$ . Moreover, we assume that $\boldsymbol{\theta}$ is stochastically independent of the noise $\boldsymbol{\eta}$ . The density $\pi_{0}$ reflects our knowledge about $\boldsymbol{\theta}$ before we make an observation $\boldsymbol{y}$ . The information provided by $\boldsymbol{y}$ is modelled by the (data) likelihood. Since the noise $\boldsymbol{\eta}$ is Gaussian by assumption, the likelihood is given by

[TABLE]

The function $\Phi$ is called potential or negative log-likelihood. The solution of the Bayesian inverse problem is the posterior (measure) $\mu^{\boldsymbol{y}}$ , i.e., the conditional measure of $\boldsymbol{\theta}$ given that the event $\{\mathcal{G}(\boldsymbol{\theta})+\boldsymbol{\eta}=\boldsymbol{y}\}$ occurred. The posterior measure $\mu^{\boldsymbol{y}}$ has also a density $\pi^{\boldsymbol{y}}$ which can be computed using Bayes’s formula:

[TABLE]

provided that $0<Z(\boldsymbol{y})<\infty$ . In the given setting (non-degenerate Gaussian additive noise, finite dimensional data space), one can show that $Z(\boldsymbol{y})$ is always finite and bounded away from [math]. This implies existence of the posterior measures, see [23]. The work [23] also establishes that Bayesian inverse problems of this type are always well-posed.

Finally, let $g:\boldsymbol{X}\rightarrow\mathbb{R}$ denote a quantity of interest (QoI) depending on the parameter $\boldsymbol{\theta}$ . Since $\boldsymbol{\theta}$ is a random variable, one is typically interested in the forward propagation of uncertainties through the action of ${g}$ . For instance, we wish to evaluate integrals of $g$ w.r.t. the posterior measure:

[TABLE]

In practice, we approximate integrals of this type via numerical quadrature. Note that we can write the expected value in Eq. 4 in terms of a ratio of two expected values w.r.t. the prior measure:

[TABLE]

Note that the prior is typically much more accessible compared to the posterior, via i.i.d. sampling or a closed form probability density function (pdf). If the two expected values in Eq. 5 are approximated with samples from $\mu_{0}$ , we refer to this method by importance sampling. If the expected values are approximated with other numerical quadrature rules, e.g., Quasi-Monte Carlo [35] or sparse grid quadrature [36, 38], we refer to any of these methods as importance-sampling-based methods.

2.2 Multilevel approximation of the quantity of interest

The approximation of the expected value $\mathbb{E}_{\mu^{\boldsymbol{y}}}[g]$ involves two sources of error: $(i)$ the quadrature error associated with the approximation of the measure $\mu^{\boldsymbol{y}}$ , and $(ii)$ the discretization error associated with the approximation of $g$ . The quadrature error is typically controlled by the number of particles or grid points in the parameter domain. The discretization error in $g$ is often controlled by the mesh size in the physical domain. In our setting, the evaluation of $g$ typically involves a computationally expensive model, e.g., a partial differential equation (PDE). If we wish to construct accurate approximations to $\mathbb{E}_{\mu^{\boldsymbol{y}}}[g]$ , then $g$ has to be discretized with a high resolution on a fine grid in physical space, and $g$ has to be evaluated for a large number of parameter values. For these reasons the approximation of $\mathbb{E}_{\mu^{\boldsymbol{y}}}[g]$ is computationally demanding.

Multilevel methods provide a framework to approximate high-resolution problems efficiently by combining evaluations from fine and coarse grid approximations. Specifically, approximations based on quadrature and physical domain discretization grids of complementary resolution are combined such that most computations are done on coarse grids while fine grid approximations are evaluated only a limited number of times. In this way, the overall computational cost is reduced, while the accuracy is preserved. A large number of multilevel approaches for Bayesian inverse problems proceed by using a telescoping sum based on the linearity of the expectation operator, see, e.g., [9]. Alternatively, it is possible to construct a multilevel approximation without relying on such a telescoping sum. Examples are the multifidelity preconditioned MCMC method in [33], Multilevel Sequential2 Monte Carlo [25], and our multilevel sparse Leja approximation presented in Section 3.

2.3 Approximation with generalized sparse grids

We aim to approximate posterior density functions with sparse grid interpolation and quadrature at each level in our multilevel approach. To this end, we employ generalized, adaptive Smolyak approximations [39]. Smolyak’s algorithm, also known as the combination scheme [17], is a strategy to construct multivariate sparse grid approximations by weakening the assumed coupling between the input dimensions (see, e.g., [6, 11]). We briefly summarize Smolyak’s algorithm. For a more detailed overview of this approach and its application in scientific computing, see [2, 44] and the references therein.

Let $f^{i}\colon X_{i}\subset\mathbb{R}\rightarrow\mathbb{R}$ denote univariate functions, where $i=1,2,\ldots,N_{\mathrm{sto}}$ . In addition, let $f^{\boldsymbol{N_{\mathrm{sto}}}}:\boldsymbol{X}\rightarrow\mathbb{R}$ denote a multivariate function with a scalar output. For example, $f^{\boldsymbol{N_{\mathrm{sto}}}}$ could be the potential function, $\Phi(\boldsymbol{\theta};\boldsymbol{y})$ , or the QoI, $g(\boldsymbol{\theta})$ . Let $\mathcal{U}^{i}[f^{i}]$ for $i=1,2,\ldots,N_{\mathrm{sto}}$ denote either univariate interpolation or integration operators defined w.r.t. a weight function $w:X_{i}\rightarrow\mathbb{R}_{+}$ , which in our context is the $i$ th component of the prior density. Further, consider approximations $\mathcal{U}^{i}_{k}[f^{i}]$ that converge as $k\rightarrow\infty$ , where $k$ is typically referred to as level. Starting from one-dimensional difference or hierarchical surplus operators,

[TABLE]

with the convention $\Delta^{i}_{1}[f^{i}]:=\mathcal{U}^{i}_{1}[f^{i}]$ , Smolyak’s approximation formula reads

[TABLE]

where $\boldsymbol{k}:=(k_{1},k_{2},\ldots,k_{N_{\mathrm{sto}}})\in\mathbb{N}^{N_{\mathrm{sto}}}$ is a multiindex and $\mathcal{K}$ is a finite set of multiindices. Note that by construction, Eq. 7 requires the underlying multivariate space, $\boldsymbol{X}$ , as well as the corresponding weight function to be separable. When the weight function is a density the stochastic parameters need to be independent.

Since Eq. 7 is written in terms of tensorizations of univariate difference operators Eq. 6 the set $\mathcal{K}$ must be constructed such that the summation in Eq. 7 telescopes correctly. Suitable sets of multiindices $\mathcal{K}$ are called admissible (or downward closed; see [15]). In particular, for an admissible set $\mathcal{K}$ it holds that $\boldsymbol{k}\in\mathcal{K}\Rightarrow\boldsymbol{k}-\boldsymbol{e}_{i}\in\mathcal{K}$ for $i=1,2,\ldots,{N_{\mathrm{sto}}}$ , where $\boldsymbol{e}_{i}$ denotes the $i$ th unit vector in $\mathbb{R}^{N_{\mathrm{sto}}}$ .

To construct the approximations $\mathcal{U}^{i}_{k}[f^{i}]$ , we employ weighted (L)-Leja sequences (see, e.g., [16, 32]). Given the weight function $w:X_{i}\rightarrow\mathbb{R}_{+}$ , weighted (L)-Leja sequences are constructed recursively as follows:

[TABLE]

When $w$ is the standard uniform density with support $[0,1]$ , we choose $\theta_{1}=0.5$ . Note that the above point sequence is in general not uniquely defined, because Eq. 8 might have multiple maximizers. In that case we simply pick one of the maximizers. Weighted (L)-Leja sequences allow the construction of sparse grid approximations for arbitrary probability densities. Moreover, they lead to accurate approximations with low cardinality (see, e.g., [32]). To fully define Eq. 7, we need to specify the multiindex set $\mathcal{K}$ as well. We construct $\mathcal{K}$ adaptively based on the dimension-adaptive algorithm of [15, 19], which we outline in Section 4.

2.4 Assumptions

As discussed in Section 2.3, sparse grid approximations require a tensor domain and a tensorized prior measure. Hence, we assume:

A1.

The parameter space is a tensor product space, i.e.,

[TABLE]

where $X_{i}\subset\mathbb{R}$ for $i=1,\dots,N_{\mathrm{sto}}$ .

A2.

The prior density $\pi_{0}$ is separable, i.e.,

[TABLE]

where $\pi_{0,i}\colon X_{i}\rightarrow\mathbb{R}$ , $i=1,\dots,N_{\mathrm{sto}}$ .

Assumption A1 can always be satisfied by embedding a non-tensorized parameter space into a hyperrectangle of suitable dimension. Assumption A2 is fulfilled if the components of $\boldsymbol{\theta}$ are stochastically independent under the prior measure. If Assumption A2 is not satisfied, the Gaussian approximation of $\pi_{0}$ is needed (see Section 3).

3 Multilevel sparse Leja algorithm

This section contains the major contribution of our work. Our goal is to address the challenges of Bayesian inversion in computationally expensive problems. To this end, we formulate a deterministic, multilevel, sampling-free methodology based on sparse grids in which we sequentially update the prior information as the level in the multilevel hierarchy increases.

Most computations in Bayesian inversion involve evaluations of the forward operator, $\mathcal{G}(\boldsymbol{\theta})$ , which in this paper is assumed to be computationally expensive. When a large number of such evaluations is needed, the corresponding computational cost is prohibitive. To reduce the cost, we employ multilevel approximations. At each level, we construct sparse grid surrogates of the potential function $\Phi(\boldsymbol{\theta};\boldsymbol{y})$ . Our motivation is two-fold. First, evaluating $\Phi(\boldsymbol{\theta};\boldsymbol{y})$ means evaluating the forward model, $\mathcal{G}(\boldsymbol{\theta})$ (see Eq. 2), hence the computationally expensive part. Second, even if $\mathcal{G}(\boldsymbol{\theta})$ is vector valued, $\Phi(\boldsymbol{\theta};\boldsymbol{y})$ is a scalar. Sparse grid approximations can be constructed for vector-valued functions, however, a separate approximation is usually needed for each output component, which can be infeasibly expensive. After the surrogate is obtained, all other single level operations, which typically involve integration, employ this surrogate, making them computationally cheap. In the following, we use the superscripts in and qu to specifically refer to interpolation and quadrature, respectively, whereas superscript op is used to refer to either of the two operations.

In this paper, the surrogates of the potential function are constructed via adaptive sparse grid interpolation, whereas all integration operations are performed using adaptive sparse grid quadrature. The specific implementations do not influence the formulation of our multilevel approach. Thus we assume we have two adaptive strategies, AdaptSGInterp( $tol^{\mathrm{in}}$ , $K_{\mathrm{max}}^{\mathrm{in}}$ , $g$ , $\pi$ ) and AdaptSGQuad( $tol^{\mathrm{qu}}$ , $K_{\mathrm{max}}^{\mathrm{qu}}$ , $g$ , $\pi$ ), which depend on a tolerance, $tol^{\mathrm{op}}$ . The other inputs are a maximum reachable sparse grid level, $K_{\mathrm{max}}$ , the target function, $g$ , and the density function w.r.t.which the approximation is performed, $\pi$ . Note that specific implementations might have additional input arguments, however the four inputs considered here are sufficient to illustrate these algorithms. The adaptive strategies are summarized in Section 4.

3.1 General setup

Let $J>1$ , $J\in\mathbb{N}$ , denote the number of levels in our multilevel formulation. Further, let $A\in\{\mathcal{G},\Phi,L,Z,\pi^{\boldsymbol{y}}\}$ denote a generic quantity depending on both physical and stochastic parameters. Let $j=1,2,\ldots,J$ . By $h_{j}$ we characterize the discretization of the physical domain of the forward response operator $\mathcal{G}(\boldsymbol{\theta})$ , where $h_{1}$ is the coarsest and $h_{J}$ the finest discretization level. Hence, by $A_{j}$ we denote the semi-discrete approximation of $A$ depending on $h_{j}$ , whereas $A_{\delta j}$ denotes either $A_{j}-A_{j-1}$ or $A_{j}/A_{j-1}$ . In addition, $tol^{\mathrm{op}}_{j}$ denotes the tolerance employed in the adaptive sparse grid approximations of $A_{j}$ such that

[TABLE]

Thus by $A_{j,s}$ we denote the sparse grid approximation of $A_{j}$ depending on $tol^{\mathrm{op}}_{s}$ .

In our multilevel formulation, we determine $A_{j,J-j+1}$ for all $j=1,2,\ldots,J$ . To simplify the notation we use the subscript $\ell(j)$ to refer to $(h_{j},tol^{\mathrm{op}}_{J-j+1})$ and the subscript $\ell(\delta j)$ to denote approximations $A_{\ell(\delta j)}\approx A_{\delta j}$ . Hence we refer to the level $j$ in our multilevel approach by $\ell(j)$ or $\ell(\delta j)$ . Note that levels are used to characterize both sparse grid and multilevel formulations. To avoid confusion, we will explicitly specify what is meant by level in each context.

3.2 Level $\ell(1)$

At $\ell(1)$ we compute the approximation $\Phi_{\ell(1)}(\boldsymbol{\theta};\boldsymbol{y})\approx\Phi_{1}(\boldsymbol{\theta};\boldsymbol{y})$ using adaptive sparse grid interpolation w.r.t. the prior $\pi_{0}$ . This is possible since $\pi_{0}$ has a product structure by Assumption A2, see Eq. 9. The potential’s approximation $\Phi_{\ell(1)}(\boldsymbol{\theta};\boldsymbol{y})$ is used for $L_{1}(\boldsymbol{\theta}|\boldsymbol{y})\approx L_{\ell(1)}(\boldsymbol{\theta}|\boldsymbol{y}):=\exp{(-\Phi_{\ell(1)}(\boldsymbol{\theta};\boldsymbol{y}))}$ , the level one likelihood surrogate. Afterwards, we employ $L_{\ell(1)}(\boldsymbol{\theta}|\boldsymbol{y})$ to compute the evidence $Z_{\ell(1)}(\boldsymbol{y})\approx Z_{1}(\boldsymbol{y})$ via adaptive sparse grid quadrature w.r.t. the prior density $\pi_{0}$ . Having computed $L_{\ell(1)}(\boldsymbol{\theta}|\boldsymbol{y})$ and $Z_{\ell(1)}(\boldsymbol{y})$ we apply formula Eq. 3 and obtain the posterior at level $\ell(1)$ , $\pi_{\ell(1)}^{\boldsymbol{y}}(\boldsymbol{\theta})$ . In addition, we also compute the posterior expectation, $\boldsymbol{m}_{\ell(1)}\in\mathbb{R}^{N_{\mathrm{sto}}}$ , where for $i=1,\ldots,N_{\mathrm{sto}}$ ,

[TABLE]

and covariance matrix, $\boldsymbol{C}_{\ell(1)}\in\mathbb{R}^{N_{\mathrm{sto}}\times N_{\mathrm{sto}}}$ , where for $n,p=1,\ldots,N_{\mathrm{sto}}$ ,

[TABLE]

which we need to construct the Gaussian approximation $\widehat{\pi}_{\ell(1)}^{\boldsymbol{y}}(\boldsymbol{\theta}):=\mathrm{N}(\boldsymbol{m}_{\ell(1)},\boldsymbol{C}_{\ell(1)})$ of $\pi_{\ell(1)}^{\boldsymbol{y}}(\boldsymbol{\theta})$ at the second level $\ell(2)$ (see Section 3.3.1).

These steps are summarized in Algorithm 1. The first three inputs are the spatial discretization parameter $h_{1}$ and the tolerances for sparse grid interpolation and quadrature, $tol^{\mathrm{in}}_{1}$ and $tol^{\mathrm{qu}}_{1}$ . Moreover, $\boldsymbol{K}_{\mathrm{max}}:=(K_{\mathrm{max}}^{\mathrm{in}},K_{\mathrm{max}}^{\mathrm{qu}})$ comprises the maximum attainable sparse grid levels for the two adaptive operations. The last two inputs are the potential function, $\Phi(\boldsymbol{\theta};\boldsymbol{y})$ , and the prior density, $\pi_{0}(\boldsymbol{\theta})$ .

3.3 Level $\ell(j)$ with $j\geq 2$

3.3.1 Gaussian approximation

At levels $\ell(j)$ with j $\geq 2$ , we sequentially update the prior density such that the previous level posterior, $\pi_{\ell(j-1)}^{\boldsymbol{y}}(\boldsymbol{\theta})$ , is used as the prior; we detail the sequential update in Section 3.3.2. To be able to construct sparse grid approximations w.r.t. $\pi_{\ell(j-1)}^{\boldsymbol{y}}(\boldsymbol{\theta})$ , the underlying stochastic space needs to have a separable density (recall Assumption A1). However, $\pi_{\ell(j-1)}^{\boldsymbol{y}}(\boldsymbol{\theta})$ is usually not separable. Therefore, we approximate $\pi_{\ell(j-1)}^{\boldsymbol{y}}(\boldsymbol{\theta})$ with a density $\widehat{\pi}_{\ell(j-1)}^{\boldsymbol{y}}(\boldsymbol{\theta})$ that allows us to obtain the required product structure. In this paper, we approximate $\pi_{\ell(j-1)}^{\boldsymbol{y}}(\boldsymbol{\theta})$ with the Gaussian density $\widehat{\pi}_{\ell(j-1)}^{\boldsymbol{y}}(\boldsymbol{\theta})$ defined as

[TABLE]

where $\boldsymbol{m}_{\ell(j-1)}$ and $\boldsymbol{C}_{\ell(j-1)}$ are the expectation and covariance matrix of $\pi_{\ell(j-1)}^{\boldsymbol{y}}(\boldsymbol{\theta})$ .

In most cases $\boldsymbol{C}_{\ell(j-1)}$ is not diagonal, i.e., $\widehat{\pi}_{\ell(j-1)}^{\boldsymbol{y}}(\boldsymbol{\theta})$ does not have a product structure. Nevertheless, from the spectral decomposition $\boldsymbol{C}_{\ell(j-1)}=VDV^{-1}$ , we have $\boldsymbol{C}_{\ell(j-1)}^{1/2}=VD^{1/2}V^{-1}$ . We arrive at

[TABLE]

where $\boldsymbol{\zeta}$ is a standard Gaussian random variable, i.e., $\boldsymbol{\zeta}\sim\mathrm{N}(\boldsymbol{0},I)$ .

Formula Eq. 11 allows to write a general multivariate Gaussian random variable with correlated components as a mapping of a standard multivariate Gaussian random variable, which has the desired product structure since the components of $\boldsymbol{\zeta}$ are uncorrelated and thus independent. In our context, we use Eq. 11 as follows. We first generate $1$ D (L)-Leja points weighted w.r.t. the standard normal density, that is, $w(\theta):=\exp{(-\theta^{2}/2)}/{\sqrt{2\pi}}$ in Eq. 8. Moreover, since the maximization defined in Eq. 8 is typically performed over a compact domain, we consider $X_{i}:=\mathbb{R}\approx[-4,4]$ . For quadrature, we compute $1$ D quadrature weights w.r.t. normalized Hermite polynomials. We extend these constructions to $N_{\mathrm{sto}}$ dimensions via tensorization and employ Eq. 11 to obtain the desired weighted (L)-Leja points. Note that $T_{\ell(j-1)}$ in Eq. 11 can be seen as an affine transport map (see [29]).

3.3.2 Level update on tensor domain

We assume that $\pi_{\ell(j-1)}^{\boldsymbol{y}}(\boldsymbol{\theta})$ for $j\geq 2$ is not separable, hence we employ the Gaussian approximation Eq. 11 for all adaptive sparse grid operations; if $\pi_{\ell(j-1)}^{\boldsymbol{y}}(\boldsymbol{\theta})$ is separable, everything that follows is computed directly using $\pi_{\ell(j-1)}^{\boldsymbol{y}}(\boldsymbol{\theta})$ for all levels greater than two. Thus, we sequentially update the prior in Bayes’ formula Eq. 3 such that we reuse the Gaussian approximation of the posterior from the previous level, i.e.,

[TABLE]

where $L_{\delta j}(\boldsymbol{y}):=L_{j}(\boldsymbol{y})/L_{j-1}(\boldsymbol{y})$ and $Z_{\delta j}(\boldsymbol{y}):=Z_{j}(\boldsymbol{y})/Z_{j-1}(\boldsymbol{y})$ . Note that in Eq. 12 we correct the bias introduced by the Gaussian approximation of the posterior from the previous level, $\widehat{\pi}_{\ell(j-1)}^{\boldsymbol{y}}$ , with the ratio $\pi_{\ell(j-1)}^{\boldsymbol{y}}/\widehat{\pi}_{\ell(j-1)}^{\boldsymbol{y}}$ .

First, we construct an adaptive sparse grid interpolation surrogate $\Phi_{\ell(\delta j)}(\boldsymbol{\theta};\boldsymbol{y})$ of $\Phi_{\delta j}(\boldsymbol{\theta};\boldsymbol{y})$ w.r.t. the density $\widehat{\pi}_{\ell(j-1)}^{\boldsymbol{y}}(\boldsymbol{\theta})$ because it holds

[TABLE]

Recall that the mapping $T_{\ell(j-1)}$ defined in Eq. 11 allows us to use adaptive sparse grid interpolation w.r.t. the Gaussian approximation $\widehat{\pi}_{\ell(j-1)}^{\boldsymbol{y}}(\boldsymbol{\theta})$ . To construct the surrogate for the potential function $\Phi_{\delta j}(\boldsymbol{\theta};\boldsymbol{y})$ we employ $T_{\ell(j-1)}$ and obtain

[TABLE]

$\Phi_{\ell({\delta j})}$ gives the approximation $L_{\ell({\delta j})}(T_{\ell(j-1)}(\boldsymbol{\zeta})|\boldsymbol{y}):=\exp{(-\Phi_{\ell(\delta j)}(T_{\ell(j-1)}(\boldsymbol{\zeta});\boldsymbol{y}))}$ .

To evaluate the ratio of evidences $Z_{\delta j}(\boldsymbol{y})$ we make use of Eq. 12, i.e.,

[TABLE]

and numerically integrate $\Big{(}L_{\ell({\delta j})}\pi_{\ell(j-1)}^{\boldsymbol{y}}/\widehat{\pi}_{\ell(j-1)}^{\boldsymbol{y}}\Big{)}\circ T_{\ell(j-1)}$ w.r.t. the density $\widehat{\pi}_{\ell(j-1)}^{\boldsymbol{y}}(\boldsymbol{\theta})$ via adaptive sparse grid quadrature to obtain the approximation $Z_{\ell(\delta j)}(\boldsymbol{y})\approx Z_{\delta j}(\boldsymbol{y})$ . Thus, at each level $\ell(j)$ with $j\geq 2$ we obtain the posterior approximation

[TABLE]

For all other quadrature computations, we proceed analogously to Eq. 13. To simplify the notation, denote $R_{\ell(j-1)}:=\pi_{\ell(j-1)}^{\boldsymbol{y}}/\widehat{\pi}_{\ell(j-1)}^{\boldsymbol{y}}$ . Given an integrable function $g(\boldsymbol{\theta})$ , we integrate $\int_{\boldsymbol{X}}g(\boldsymbol{\theta})\pi_{j}^{\boldsymbol{y}}(\boldsymbol{\theta})\mathrm{d}\boldsymbol{\theta}$ , which reads as

[TABLE]

via adaptive sparse grid quadrature w.r.t. $\widehat{\pi}_{\ell(j-1)}^{\boldsymbol{y}}$ . The above formula is used to assess the expectation and covariance matrix of $\pi_{j}^{\boldsymbol{y}}(\boldsymbol{\theta})$ . Moreover, at level $J$ of our approach, we integrate the QoI $g\approx g_{\ell(J)}$ w.r.t. $\widehat{\pi}_{\ell(J)}^{\boldsymbol{y}}(\boldsymbol{\theta})$ . In this way, we perform the multilevel decomposition implicitly, different from standard multilevel methods in which the QoI is assessed explicitly via telescoping sums. Note that at the end of our multilevel algorithm we also obtain a surrogate for the posterior density which can be further used, for example, in an uncertainty propagation setting.

We summarize the steps for levels greater than two in Algorithm 2. The first three inputs are the number of levels, $J$ , the sequence of mesh sizes, $\boldsymbol{h}$ , and $\boldsymbol{tol}$ , which comprises the tolerances for adaptive sparse grid interpolation and quadrature at all levels. The next input denotes the maximum reachable levels for the adaptive algorithms, $\boldsymbol{K}_{\mathrm{max}}:=(K_{\mathrm{max}}^{\mathrm{qu}},K_{\mathrm{max}}^{\mathrm{in}})$ . Finally, $\pi_{0}$ is the prior density, $\Phi(\boldsymbol{\theta};\boldsymbol{y})$ is the potential function and $g$ is the QoI. We combine Algorithms 1 and 2 and depict all steps in our proposed multilevel approach in Fig. 1.

3.4 Computational cost

The largest computational effort in the proposed multilevel methodology is spent in finding the interpolation surrogates since this involves evaluations of the forward operator. Thus, for $j=1,2,\ldots J$ , let $C_{h_{j}}$ denote the cost of the evaluating once the forward model discretized using a mesh depending on $h_{j}$ . Additionally, let $N_{tol_{J-j+1}^{\mathrm{in}}}$ denote the number of forward operator solves to achieve tolerance $tol_{J-j+1}^{\mathrm{in}}$ for adaptive sparse interpolation. Then, the total interpolation cost of our approach reads

[TABLE]

We obtain the above number because for each level except the last one, the same potential function and thus the same forward model enters two different likelihood ratios (see Eq. 12). However, because we update the prior as the level increases, we expect the number of forward model solves to decrease significantly with the level.

All other costs are due to computations depending on the interpolation surrogates. These costs are however insignificant compared to the evaluation cost of a computationally expensive forward operator.

4 Dimension adaptivity with sparse grids

In this section we discuss the construction of the multiindex set $\mathcal{K}$ used in the sparse grid approximations defined in Section 2.3 via adaptive refinement. For interpolation we consider a standard adaptive strategy as well as an enhanced approach that employs directional variance information. For quadrature we employ a standard adaptive strategy. In the following, our notation is similar to [11].

4.1 Standard dimension-adaptive interpolation and quadrature

Adaptive refinement is preferred especially when the underlying problem has a richer structure, such as anisotropic coupling of the input parameters or lower intrinsic dimensionality – which is typically the case in most problems (see, e.g., [6, 11, 12, 45]). The standard strategy is based on the dimension-adaptive algorithm of [15, 19]. The algorithm is described, e.g., in [6, 15, 32]. We summarize only the basic idea below. $\mathcal{K}=\{\boldsymbol{1}\}$ initially. Each refinement step is performed using the following principle: if a current multiindex contributes significantly to the approximation, its adjacent neighbours are likely to contribute as well. Therefore, the forward neighbours of the multiindex with the largest contribution are added to $\mathcal{K}$ provided that $\mathcal{K}$ remains admissible. The contribution of each $\boldsymbol{k}$ is assessed via a refinement indicator $\epsilon(\boldsymbol{\boldsymbol{k}})$ , whose choice has a crucial impact on the performance of the adaptive algorithm.

We define

[TABLE]

where $s^{\mathrm{op}}$ is a function depending on the multivariate surplus $\boldsymbol{\Delta}_{\boldsymbol{k}}^{\mathrm{op}}[f^{\boldsymbol{N_{\mathrm{sto}}}}]$ and $\delta N_{\boldsymbol{k}}$ is the number model evaluations needed to assess $\boldsymbol{\Delta}_{\boldsymbol{k}}^{\mathrm{op}}[f^{\boldsymbol{N_{\mathrm{sto}}}}]$ . Note that $\delta N_{\boldsymbol{k}}$ penalizes subspaces with a large number of points.

For sparse grid quadrature, we consider

[TABLE]

which is a surrogate for the local quadrature error. For sparse interpolation, we use

[TABLE]

As in [11], we employ $\|\boldsymbol{\Delta}^{\mathrm{in}}_{\boldsymbol{k}}[f^{\boldsymbol{N_{\mathrm{sto}}}}]\|_{L^{2}}$ in the standard refinement indicator Eq. 15 because it yields the local variance contribution of the surplus to the total variance.

4.2 Directional variance dimension-adaptive sparse interpolation

Dimension adaptivity based on error indicators such as Eq. 14 does not inherently distinguish between the individual input parameters. Since in most problems the input parameters are anisotropically coupled, we wish to exploit this structure and tune the adaptive process such that it stops refining directions that are rendered unimportant. This is particularly important in our proposed approach since we update the prior starting with level $\ell(2)$ and thus we have updated information about the model’s stochastic input parameters. To this end, for sparse interpolation, we enhance the standard adaptive strategy such that we additionally compute a global measure of importance of each input parameter via total directional variances, and we stop refining the directions having insignificant total directional variances.

To enhance the standard adaptive approach for interpolation, we proceed analogously to [11, 45] and perform a Sobol’ decomposition [40] of the active set $\mathcal{A}$ to obtain directional variance surpluses, that is,

[TABLE]

where $\Delta V_{\mathcal{A}}^{0}:=\gamma_{\boldsymbol{0}}^{2}$ refers to the expectation contribution,

[TABLE]

where $\mathcal{J}_{i}=\{\boldsymbol{0}<\boldsymbol{\ell}\leq\boldsymbol{I}_{\boldsymbol{k}}:\boldsymbol{\ell}_{i}\neq 0\land\boldsymbol{\ell}_{j}=0,\forall j\neq i\}$ , are surplus contributions to all individual variances, and $\Delta V_{\mathcal{A}}^{\mathrm{int}}:=\sum_{\boldsymbol{m}\in\mathcal{J}_{\mathrm{int}}}\Delta\gamma_{\boldsymbol{m}}^{2}$ refers to the variance surplus due to all possible interactions, where $\mathcal{J}_{\mathrm{int}}=\bigcup_{i=1}^{N_{\mathrm{sto}}}\mathcal{J}_{i,\mathrm{int}}$ , $\mathcal{J}_{.}{i,\mathrm{int}}=\{\boldsymbol{0}<\boldsymbol{m}\leq\boldsymbol{I}_{\boldsymbol{k}}:\boldsymbol{m}_{i}\neq 0\}$ . Further, we compute total directional variance surpluses as

[TABLE]

where $\Delta V_{\mathcal{A}}^{i,\mathrm{int}}:=\sum_{\boldsymbol{m}\in\mathcal{J}_{i,\mathrm{int}}}\Delta\gamma_{\boldsymbol{m}}^{2}$ denotes the contribution due to all interactions involving direction $i$ . Note that $\Delta V_{\mathcal{A}}^{i,\mathrm{tot}}$ can be seen as a global measure of importance for each stochastic input: a large $\Delta V_{\mathcal{A}}^{i,\mathrm{tot}}$ implies that the $i$ th parameter is significant from a stochastic perspective. To this end, we prescribe $N_{\mathrm{sto}}$ user-defined directional tolerances $\boldsymbol{\tau}^{\mathrm{in}}:=(\tau_{1}^{2},\tau_{2}^{2},\ldots,\tau_{N_{\mathrm{sto}}}^{2})$ and ascertain the importance of each input directions by comparing $\Delta V_{\mathcal{A}}^{i,\mathrm{tot}}$ with $\tau_{i}^{2}$ for $i=1,2,\ldots,N_{\mathrm{sto}}$ . When the stochastic direction $i$ is rendered unimportant, we simply stop adding multiindices whose $i$ th component exceeds the maximum $i$ th index in the current multiindex set $\mathcal{K}$ . In this way, the algorithm preferentially refines the most important directions, thus decreasing the overall computational cost. When neither of the directional tolerances are met, the enhanced algorithm reduces to the standard approach in Section 4.1.

5 Numerical experiments

In this section we present the numerical results obtained using our proposed multilevel approach for Bayesian inversion.

5.1 Simple quadrature showcase

In this test case we investigate the behaviour of weighted (L)-Leja points in integration problems of the form

[TABLE]

where $g(\boldsymbol{\theta})$ is an integrable function and $\pi^{\boldsymbol{y}}(\boldsymbol{\theta})$ is the posterior density; we outline the setup used to compute $\pi^{\boldsymbol{y}}(\boldsymbol{\theta})$ below. We assess Eq. 16 via quadrature w.r.t. two different weight functions. In the first case, we employ a standard importance-sampling-based strategy (recall Eq. 5). Specifically, adaptive sparse grid quadrature w.r.t.the prior density, $\pi_{0}(\boldsymbol{\theta})$ , with tolerance $tol^{\mathrm{qu}}_{\pi_{0}}$ is used:

[TABLE]

where $\{\boldsymbol{\theta}_{n,\mathrm{pr}}\}_{n=1}^{N_{\mathrm{pr}}}$ are (L)-Leja nodes computed w.r.t. $\pi_{0}(\boldsymbol{\theta})$ and

[TABLE]

In the second strategy, we compute Eq. 16 using our proposed approach. We integrate Eq. 16 numerically via adaptive sparse grid quadrature w.r.t.the Gaussian approximation $\widehat{\pi}^{\boldsymbol{y}}(\boldsymbol{\theta})$ of the posterior density (recall Eq. 10), using a tolerance $tol^{\mathrm{qu}}_{\widehat{\pi}^{\boldsymbol{y}}}$ :

[TABLE]

where $\{\boldsymbol{\zeta}_{n,\mathrm{post}}\}_{n=1}^{N_{\mathrm{post}}}$ are (L)-Leja nodes computed w.r.t.the standard multivariate normal density, $N(\boldsymbol{0},I)$ , and $T(\boldsymbol{\zeta}):=\boldsymbol{m}+\boldsymbol{C}^{1/2}\boldsymbol{\zeta}$ , where $\boldsymbol{m}$ and $\boldsymbol{C}$ are the expectation and covariance matrix associated with the density $\widehat{\pi}^{\boldsymbol{y}}(\boldsymbol{\theta})$ .

We consider the following forward model

[TABLE]

where $x\in[0,1],A(\theta_{1})=20\theta_{1}+1$ and $w(\theta_{2})=\theta_{2}+1.2$ . We employ Bayesian inversion to infer $(\theta_{1},\theta_{2})$ . The prior is the uniform density in $[0,1]^{2}$ , i.e., $\pi_{0}=U(0,1)^{2}$ . The observation data $\boldsymbol{y}$ are generated synthetically using $(\theta_{1},\theta_{2})_{\mathrm{true}}=(0.45,0.65)$ . We take $N_{obs}=9$ measurements at locations $o_{j}=0.1j,\quad j=1,\ldots,9$ , assumed to be corrupted by additive Gaussian noise $\eta\sim\mathrm{N}(\boldsymbol{0},0.1^{2}I)$ . We depict the prior and posterior densities in the left figure in Fig. 2. Observe that the posterior is unimodal and non-symmetric, but it can be well approximated with a Gaussian density.

In the numerical experiments, we let $g(\boldsymbol{\theta}):=\exp{(-\theta_{1}-\theta_{2})}$ in Eq. 16. We compute a reference solution using $3\cdot 10^{5}$ Metropolis-Hastings MCMC (MH) samples obtained from a random walk Gaussian proposal with initial sample $\boldsymbol{\theta}_{0}=(1,1)$ and covariance matrix $C_{\mathrm{MH}}=7\cdot 10^{-3}I$ . The acceptance rate is $44\%$ . Additionally, we employ a tolerance $tol^{\mathrm{qu}}_{\pi_{0}}=10^{-11}$ in Eq. 17 and a tolerance $tol^{\mathrm{qu}}_{\widehat{\pi}^{\boldsymbol{y}}}=10^{-5}$ in Eq. 18.

The results are summarized in Table 1. The employed tolerances in the two (L)-Leja sparse grid quadrature approaches are sufficient to match four digits of the reference results. However, integrating w.r.t. the prior requires $1603$ nodes, whereas our approach which uses the Gaussian approximation of the posterior as weight function requires only $49$ quadrature nodes, i.e., almost $33$ times fewer points. This is because the support of the prior density, $\pi_{0}(\boldsymbol{\theta})$ , is significantly larger than the support of the posterior: when integrating w.r.t. the prior density, the adaptive algorithm places a large number of quadrature points outside of the support of the posterior.

We visualize the quadrature nodes corresponding to Eqs. 17 and 18 in the center and right figures in Fig. 2, respectively.

Remark 5.1.

*Adaptive sparse grid quadrature w.r.t. an uninformative prior can sometimes stop early since the quadrature points will fall outside of the support of the integrated function, yielding null evaluations and thus null error indicators. To overcome this, one could employ non-adaptive quadrature with a sufficiently large, a priori chosen number of nodes to cover the support of the integrated function. *

5.2 Source inversion with one source in a 2D spatial domain

Consider a two dimensional Bayesian inverse problem in which the forward model $\mathcal{G}(\boldsymbol{\theta})$ is an elliptic PDE defined on $\Omega:=[0,1]^{2}$ ,

[TABLE]

with $A(\alpha)=5/(2\pi\alpha^{2})$ and $\alpha=0.2$ .

The goal is to infer the coordinates $(\theta_{1},\theta_{2})$ of the source term in the right-hand side, i.e., we seek the solution to a source inversion problem. We perform the multilevel Bayesian inversion as described in Algorithms 1 and 2 with three levels, i.e., $J=3$ . Thus $j=1,2,3$ . The employed multilevel setup is summarized in Table 2. Standard triangular finite elements (FEs) with mesh widths $h_{j}$ are used for spatial discretization. To find surrogates for the potential function at each level in our proposed multilevel approach we employ both adaptive sparse interpolation variants summarized in Section 4. Recall that for standard adaptive interpolation we have tolerances $tol^{\mathrm{in}}_{1},tol^{\mathrm{in}}_{2},tol^{\mathrm{in}}_{3}$ (see Section 4.1) whereas for directional variances-based adaptivity from Section 4.2 we additionally have the directional tolerances $\{\boldsymbol{\tau}^{\mathrm{in}}_{j}\}_{j=1}^{3}$ . We choose the FE mesh widths and adaptive interpolation tolerances such that the approximation errors are quantitatively similar. At level $\ell(1)$ we combine $h_{1}$ with $tol^{\mathrm{in}}_{3}$ and $\boldsymbol{\tau}^{\mathrm{in}}_{3}$ , at level $\ell(2)$ , $h_{2}$ is combined with $tol^{\mathrm{in}}_{2}$ and $\boldsymbol{\tau}^{\mathrm{in}}_{2}$ , and at level $\ell(3)$ we employ $h_{3}$ together with $tol^{\mathrm{in}}_{1}$ and $\boldsymbol{\tau}^{\mathrm{in}}_{1}$ . For adaptive sparse grid quadrature we employ small tolerances $tol^{\mathrm{qu}}_{1},tol^{\mathrm{qu}}_{2},tol^{\mathrm{qu}}_{3}$ to prevent the adaptive algorithm to stop too early especially when integrating w.r.t. the prior density (recall Remark 5.1).

Since the source locations need to reside inside $\Omega$ , the prior is the uniform density in $[0,1]^{2}$ , i.e., $\pi_{0}=U(0,1)^{2}$ . We consider 16 sensor locations at $(0.2i,0.2j)$ for $i,j=1,2,3,4$ . The measurements are obtained synthetically by discretizing the forward model on a finer mesh, i.e., $h=\sqrt{2}/2^{7}$ to avoid committing an “inverse crime”. Moreover, $\boldsymbol{\theta}_{\mathrm{true}}=(0.35,0.65)$ and the additive Gaussian noise $\boldsymbol{\eta}\sim\text{N}(\mathbf{0},0.2^{2}I)$ .

The QoI is the posterior mean $\mathbb{E}_{\pi^{\boldsymbol{y}}}[\boldsymbol{\theta}]$ . We compute a reference solution using $2\cdot 10^{5}$ samples obtained from a random walk Metropolis-Hastings algorithm with Gaussian proposal having covariance matrix $C_{\mathrm{MH}}=4\cdot 10^{-3}I$ , started from $\boldsymbol{\theta}_{0}=(1,1)$ . The acceptance rate of the chain is $64\%$ . To obtain a comprehensive overview of the accuracy and cost of our approach, we compare it with the standard three-level approach in which all adaptive sparse grid operations are performed w.r.t. the prior density. Moreover, the QoI is assessed using the classical telescoping sum. To simplify the notation, in the following we use the abbreviation $\mathrm{StdML}$ to refer to the standard multilevel approach. $\mathrm{MLLejaStd}$ refers to our approach in which standard dimension-adaptive interpolation is used at each level and $\mathrm{MLLejaDV}$ refers to our approach combined with directional variance-based adaptive interpolation summarized in Section 4.2. The results are presented in Table 3. Observe that all multilevel methods yield results very close to the reference estimate. Thus, the two variants of our proposed approach, $\mathrm{MLLejaStd}$ and $\mathrm{MLLejaDV}$ , are comparably accurate as the sampling-based and the standard multilevel solutions.

In Fig. 3, we visualize the results for all employed multilevel methods as follows. The left subplots show the results for $\mathrm{StdML}$ , whereas the center and right subplots depict the results for $\mathrm{MLLejaStd}$ and $\mathrm{MLLejaDV}$ respectively. Furthermore, at each level, the prior as well as the corresponding (L)-Leja points used to find the interpolation surrogate are visualized in the top part. In the bottom plots, we depict the resulting posterior densities. At level $\ell(1)$ we obtain the same three posteriors since the prior is the same in all cases. However, starting with level $\ell(2)$ the sequential update of the prior in the proposed approach leads to significantly fewer interpolation points compared to $\mathrm{StdML}$ , which places a large number weighted (L)-Leja points outside of the support of the corresponding posterior. Moreover, comparing the two variants of the proposed approach, $\mathrm{MLLejaDV}$ requires fewer (L)-Leja points than $\mathrm{MLLejaStd}$ . This is because at both levels $\ell(2)$ and $\ell(3)$ in $\mathrm{MLLejaDV}$ , the two total directional variances fall below the imposed tolerances. Thus $\mathrm{MLLejaDV}$ discovers and exploits more structure in the underlying approximation problem.

We visualize the multiindex sets for the three multilevel variants in Fig. 4. At level $\ell(1)$ the multiindex sets corresponding to $\mathrm{StdML}$ and $\mathrm{MLLejaStd}$ are symmetric; this is because the two stochastic parameters have equal importance w.r.t. the prior density, which is the $2$ D (symmetric) uniform density. $\mathrm{MLLejaDV}$ leads to a smaller multiindex set since the directional variances $\boldsymbol{\tau}^{\mathrm{in}}_{3}$ fall below $(10^{-7},10^{-7})$ , but there is no clear distinction between the two input directions as well. However, at level $\ell(2)$ we observe a different behaviour in the two variants of our proposed approach, $\mathrm{MLLejaStd}$ and $\mathrm{MLLejaDV}$ . Recall that in these two variants the prior density is the Gaussian approximation of the posterior density from level $\ell(1)$ . Computing the eigenvalues $(\lambda_{1},\lambda_{2})$ of its covariance matrix, we obtain $\lambda_{1}=0.0097$ and $\lambda_{2}=0.0055$ . Therefore, the first direction is more important that the second, which is reflected in the two multiindex sets. On the other hand, the multiindex set for the standard approach remains symmetric because the prior density is unchanged. Finally, at level $\ell(3)$ we see low-cardinality multiindex sets for both $\mathrm{MLLejaStd}$ and $\mathrm{MLLejaDV}$ . This is because at this stage we have an informative prior, thus the likelihood ratio is close $1$ , which requires little approximation effort. Hence going beyond level $\ell(3)$ is not necessary for our approach.

The costs of all multilevel methods are visualized in Fig. 5. The number of forward model evaluations needed to find the adaptive sparse grid surrogate of the potential function are shown on the left side. In the right plot we depict the number of evaluations of the surrogate in all quadrature computations. Note in all multilevel variants we need quadrature to assess the evidences and expectations at all three levels. Additionally, in $\mathrm{MLLejaStd}$ and $\mathrm{MLLejaDV}$ we need to compute the covariance matrices at levels $\ell(1)$ and $\ell(2)$ as well, which are needed in the Gaussian approximation of the associated posteriors and affine mapping (recall Eqs. 10 and 11). However, since we integrate w.r.t. the same weight function, we keep all surrogate evaluations in a look-up table and reuse them whenever the same grid points are used for different evaluations. We observe that at level $\ell(1)$ our proposed approach is slightly more expensive for interpolation which is due to the need to evaluate the FE solver on level $\ell(1)$ at level $\ell(2)$ as well: recall that at level $\ell(2)$ we construct a sparse grid surrogate for the ratio of potential functions. However, the increased cost is not significant since it involves evaluations of the coarsest FE solver, which are very fast. Starting with level $\ell(2)$ we see significant cost savings for both interpolation and quadrature. On the one hand, at level $\ell(2)$ , $\mathrm{MLLejaStd}$ leads to about $2$ times fewer forward model evaluations for sparse grid interpolation and around $12.5$ fewer sparse grid quadrature evaluations. Moreover, at level $\ell(3)$ we obtain about $15$ times fewer interpolation points and $7.5$ times fewer quadrature nodes. On the other hand, $\mathrm{MLLejaDV}$ leads to about $5$ times fewer interpolation nodes and about $12.5$ times fewer quadrature evaluations. Furthermore, we obtain $20$ times fewer interpolation nodes and $9.5$ times fewer quadrature nodes at level $\ell(3)$ . These results clearly show that updating the prior information in our multilevel approach for Bayesian Inversion leads to significant cost reduction in finding and evaluating sparse grid surrogates.

5.3 Source inversion with two sources in a 2D spatial domain

For a more comprehensive overview of the proposed approach, we consider now a test case with multimodal observation data. In particular, we consider another source inversion test case in which we use two sources to generate the data – to have bimodal observation data – and only one source to perform the Bayesian inference.

The elliptic forward operator defined on $\Omega:=[0,1]^{2}$ reads:

[TABLE]

where $A(\alpha)=5/(2\pi\alpha^{2}),\alpha=0.15$ and the binary parameter $b=1$ when generating the data and $b=0$ when performing the inference. Therefore, we are solving a source inversion similar to the one in Section 5.2 but starting from bimodal data.

To generate the data we choose the locations of the two sources far apart, i.e., $(\theta_{1},\theta_{2})_{\mathrm{true}}=(0.15,0.15)$ and $(\theta_{3},\theta_{4})_{\mathrm{true}}=(0.85,0.85)$ . For Bayesian inference we employ StdML, MLLejaStd and MLLejaDV using three levels. The multilevel setup is outlined in Table 4.

We visualize in Fig. 6 the bimodal posterior density obtained via standard Bayes’ formula Eq. 3 for which we used $50^{2}=2500$ Gauss-Legendre points to assess the evidence. Note that the two peaks are symmetric around $(0.5,0.5)$ . Therefore, Bayesian inference using only once source in the forward model can yield, in the best case, a posterior density centered at $(0.5,0.5)$ .

The QoI is again $\mathbb{E}[\pi^{\boldsymbol{y}}(\boldsymbol{\theta})]$ . In Table 5 we show the obtained estimates. First, a reference Metropolis-Hastings estimate with $2\cdot 10^{5}$ samples is computed using a random walk Gaussian proposal with initial sample $\boldsymbol{\theta}_{0}=(1.0,1.0)$ and covariance matrix $C_{\mathrm{MH}}=10^{-1}I$ . The acceptance rate is $45\%$ . We observe that the MH and StdML solutions yield estimates close to the center, $(0.5,0.5)$ . However, the estimates given by MLLejaStd and MLLejaDV are far away from this value.

We depict in Fig. 7 the prior and posterior density as well as the weighted (L)-Leja points used to construct the adaptive sparse grid interpolation surrogate of the potential function for all employed multilevel methods. We observe that the Gaussian approximation used in our proposed approach at levels $\ell(2)$ and $\ell(3)$ is very spread and quite different from the approximated posterior. Hence the bias-correcting ratio in quadrature operations, $\pi_{\ell(j)}^{\boldsymbol{y}}/\widehat{\pi}_{\ell(j)}^{\boldsymbol{y}}$ for $j=2,3$ (recall Eq. 12) is different from $1$ in most regions of the domain. The large variations of this ratio lead to large variations of the error indicators in adaptive sparse grid quadrature which prevent the algorithm to converge and hence to yield accurate estimates. Therefore, the estimates of the expectation and covariance matrix of the posteriors from levels $\ell(1)$ and $\ell(2)$ , and with that, the Gaussian approximations employed at levels $\ell(2)$ and $\ell(3)$ , are inaccurate. Note that the large spread of the Gaussian approximations leads to weighted (L)-Leja points outside of the domain of the uniform prior, which coincides with the domain of the forward operator, $\Omega$ (see Section 5.3). Whenever this happens, we impose the corresponding likelihood evaluation to be zero.

5.4 Higher-dimensional problem in a 3D spatial domain

In the final test case, we apply the proposed approach in a more challenging and computationally more expensive problem. We consider an elliptic forward model defined on $\Omega:=[0,1]^{3}$ with a permeability field projected onto a Fourier basis:

[TABLE]

where $f\equiv 5$ and

[TABLE]

where $s_{1}=0.1785$ , $s_{2}=s_{3}=s_{4}=0.1428$ , $s_{5}=s_{6}=s_{7}=0.1071,s_{8}=0.0714$ are normalized scaling factors, i.e., $\sum_{n=1}^{8}s_{n}=1$ , and $(p_{n,1},p_{n,2},p_{n,3})\in\{1,2\}^{3}$ . Note that with the chosen setup, $\theta_{1}$ is the most important parameter, $\theta_{2},\theta_{3},\theta_{4}$ are the second most important parameters etc under the prior density.

Bayesian inference is carried out for the weights $(\theta_{1},\theta_{2},\ldots,\theta_{8})$ of the permeability field $k(x,y,z,\boldsymbol{\theta})$ . Thus, we are solving an $8$ D inversion problem. These weights follow a standard normal prior distribution, i.e., $\mu_{0}=\mathrm{N}(\boldsymbol{0},I)$ . To perform the Bayesian inference we employ both the standard and our proposed multilevel approach with the two variants, $\mathrm{MLLejaStd}$ and $\mathrm{MLLejaDV}$ , considering $J=3$ . The employed multilevel setup is outlined in Table 6. The forward model Eq. 21 is discretized via standard tetrahedral FE meshes $h_{j}$ . Moreover, the sparse grid interpolation tolerances $tol^{\mathrm{in}}_{j}$ and $\boldsymbol{\tau}^{\mathrm{in}}_{j}$ are chosen to yield quantitatively similar errors to the FE approximation for $j=1,2,3$ . Finally, we choose small tolerances for quadrature to prevent the adaptive algorithm from stopping too early.

The observation data consists of $729$ measurements at $\{0.1,0.2,\ldots,0.9\}^{3}\in\Omega$ , stemming from the FE solution of the forward model discretized using a finer mesh width $h=\sqrt{3}/2^{7}$ and assuming measurement noise $\eta\sim\mathrm{N}(\boldsymbol{0},0.1^{2}I)$ . In addition, $\boldsymbol{\theta}_{\mathrm{true}}$ is drawn from the standard multivariate Gaussian density, i.e.,

[TABLE]

The QoI is again the expectation of the posterior density, $\mathbb{E}_{\pi^{\boldsymbol{y}}}[\boldsymbol{\theta}]$ . We begin with level $\ell(1)$ in the standard multilevel approach. Both the expectation and the covariance matrix of the corresponding posterior are computed since we need these evaluations at $\ell(2)$ . We obtain, however, an indefinite covariance matrix with a negative variance for $\theta_{1}$ . This is mainly due to the limitations of the standard approach: adaptive sparse grid quadrature w.r.t. the prior density becomes challenging when the complexity of the underlying Bayesian inverse problem increases. To overcome this limitation, we employ instead standard sparse grids of a priori fixed levels having sufficiently many points to guarantee a positive definite covariance matrix. In particular, we consider an interpolation grid of level $10$ for interpolation ( $24310$ grid points) and a quadrature grid of the same level comprising $598417$ points. Since our goal is to compare multilevel methods based on adaptive sparse grid algorithms, we do not perform the standard multilevel approach on the remaining two levels using a priori chosen sparse grids, but rather focus only on the two variants of our proposed approach starting from the Gaussian approximation of the posterior at level $\ell(1)$ obtained with the aforementioned standard sparse grids. Moreover, evaluating the FE discretizations depending on $h_{2}$ and $h_{3}$ on the standard sparse grid of level $10$ ( $24310$ evaluations) requires a significant computational cost.

We first compute a reference MH solution with $10^{5}$ samples. The Gaussian proposal has covariance matrix $C_{\mathrm{MH}}=0.5I$ . To reduce the burn-in, the chain is started from $\boldsymbol{\theta}_{\mathrm{true}}$ . The acceptance rate is $24\%$ . Afterwards, we employ MLLejaStd and MLLejaDV as described above. The results are showed in Table 7. The two variants of our approach produce results comparable to the reference solution, thus making our proposed approach competitive with sampling methods in this test case as well. Observe, however, that the accuracy of all estimates deteriorates compared with $\boldsymbol{\theta}_{\mathrm{true}}$ . This is because the likelihood becomes less informative as the index increases from $1$ to $8$ : the likelihood updates the prior very well for the first direction, relatively well for the next three, and almost not at all for the last four directions. Nevertheless, since the last four directions are the least important by construction Eq. 22 we expect that having not accurate corresponding mean estimates will not be too significant.

To assess the quality of the expectation estimates, we use them to represent the permeability field as $k(x,y,z,\mathbb{E}_{\pi^{\boldsymbol{y}}}[\boldsymbol{\theta}])$ and we compare the results with $k(x,y,z,\boldsymbol{\theta}_{true})$ . In Fig. 8 we depict $2$ D slices of the field in which the spatial coordinates $x$ (top), $y$ (middle) and $z$ (bottom) are fixed respectively to $0.5$ . We observe that having inaccurate estimates for the latter four components of $\boldsymbol{\theta}$ does not significantly affect the estimation of the true permeability field. This is due to having good estimates for the first components of $\boldsymbol{\theta}_{\mathrm{true}}$ , which are the most important by construction.

In Fig. 9 the costs for interpolation (left) and quadrature (right) are shown. Note that for interpolation we show costs for level $\ell(1)$ as well because evaluations of the forward PDE discretized using $h_{1}$ are needed at $\ell(2)$ . MLLejaDV is cheaper than MLLejaStd at all three levels, requiring about $4.3$ times fewer evaluations at level $\ell(1)$ , $4.8$ times fewer evaluations on level $\ell(2)$ and $7$ times fewer evaluations at level $\ell(3)$ . Observe that the overall interpolation costs are very small given that we have an $8$ D inversion problem at hand. For example, at level $\ell(3)$ , MLLejaDV requires only $22$ PDE evaluations. The maximum reached level in the corresponding multiindex set is $4$ and all its multiindices have components larger than $1$ only in the first four directions. Indeed, the directional variances-based algorithm detects that the latter four directions are unimportant, thus invests effort only in the first four directions. Therefore, we see once again that using the enhanced adaptive algorithm for adaptive sparse grid interpolation leads to significant cost savings. For quadrature, the total number of evaluations of the interpolation surrogates are similar.

6 Conclusions

We proposed a novel multilevel Leja algorithm for computing posterior approximations in computationally expensive, higher-dimensional Bayesian inverse problems. At each level adaptive sparse grid interpolation is employed to find a surrogate of the potential function, and adaptive sparse grid quadrature is then used to perform all integration operations with respect to the posterior. We considered two adaptive strategies for interpolation: (i) a standard method and (ii) an enhanced adaptive algorithm in which directional variances are used to ensure that only the most important stochastic directions are refined. The backbone of the proposed approach is the sequential update of the prior density. In this way, we can create weighted (L)-Leja points in areas of high posterior probability. Numerical experiments with elliptic inverse problems in 2D and 3D space show that the sequential update of the prior leads to considerably fewer model evaluations compared to the standard multilevel approach which employs the prior density at all levels. We remark that the proposed approach is not designed to handle well multimodal posterior densities. In future research we will extend it such that it employs more general nonlinear mappings, e.g., transport maps, to accurately approximate arbitrary, multimodal posterior densities.

Acknowledgements

IGF thankfully acknowledges the support of the German Academic Exchange Service (DAAD). IGF, JL, and EU would like to thank the Isaac Newton Institute for Mathematical Sciences for support and hospitality during the programme Uncertainty quantification for complex systems: theory and methodologies when work on this paper was undertaken.

Bibliography47

The reference list from the paper itself. Each links out to its DOI / PubMed record.

1[1] S. Brooks, A. Gelman, G. L. Jones, and X.-L. Meng , Handbook of Markov chain Monte Carlo , Chapman & Hall/CRC Handbooks of Modern Statistical Methods, CRC Press, Boca Raton, FL, 2011.
2[2] H.-J. Bungartz and M. Griebel , Sparse grids , Acta Numer., 13 (2004), pp. 147–269.
3[3] P. Chen and C. Schwab , Sparse-grid, reduced-basis Bayesian inversion , Comput. Methods Appl. Mech. Engrg., 297 (2015), pp. 84–115.
4[4] P. Chen and C. Schwab , Adaptive sparse grid model order reduction for fast Bayesian estimation and inversion , in Sparse grids and applications—Stuttgart 2014, vol. 109 of Lect. Notes Comput. Sci. Eng., Springer, Cham, 2016, pp. 1–27.
5[5] P. Chen, U. Villa, and O. Ghattas , Hessian-based adaptive sparse quadrature for infinite-dimensional Bayesian inverse problems , Comput. Methods Appl. Mech. Engrg., 327 (2017), pp. 147–172.
6[6] P. Conrad and Y. Marzouk , Adaptive Smolyak pseudospectral approximations , SIAM J. Sci. Comput., 35 (2013), pp. A 2643 – A 2670.
7[7] P. Del Moral, A. Doucet, and A. Jasra , Sequential Monte Carlo samplers , J. R. Stat. Soc. Ser. B Stat. Methodol., 68 (2006).
8[8] J. Dick, R. N. Gantner, Q. T. Le Gia, and C. Schwab , Multilevel higher-order quasi-monte carlo bayesian estimation , Math. Models Methods Appl. Sci., 27 (2017), pp. 953–995.

TL;DR

Contribution

Findings

Abstract

Peer Reviews

Videos

Multilevel Adaptive Sparse Leja Approximations for Bayesian Inverse Problems

Abstract

keywords:

1 Introduction

2 Background

2.1 Bayesian inversion

2.2 Multilevel approximation of the quantity of interest

2.3 Approximation with generalized sparse grids

2.4 Assumptions

3 Multilevel sparse Leja algorithm

3.1 General setup

3.2 Level ℓ(1)\ell(1)ℓ(1)

3.3 Level ℓ(j)\ell(j)ℓ(j) with j≥2j\geq 2j≥2

3.3.1 Gaussian approximation

3.3.2 Level update on tensor domain

3.4 Computational cost

4 Dimension adaptivity with sparse grids

4.1 Standard dimension-adaptive interpolation and quadrature

4.2 Directional variance dimension-adaptive sparse interpolation

5 Numerical experiments

5.1 Simple quadrature showcase

Remark 5.1**.**

5.2 Source inversion with one source in a 2D spatial domain

5.3 Source inversion with two sources in a 2D spatial domain

5.4 Higher-dimensional problem in a 3D spatial domain

6 Conclusions

Acknowledgements

3.2 Level $\ell(1)$

3.3 Level $\ell(j)$ with $j\geq 2$

Remark 5.1.