Bayesian Probabilistic Numerical Methods

Jon Cockayne; Chris Oates; Tim Sullivan; Mark Girolami

arXiv:1702.03673·stat.ME·November 15, 2019

Bayesian Probabilistic Numerical Methods

Jon Cockayne, Chris Oates, Tim Sullivan, Mark Girolami

PDF

TL;DR

This paper formalizes Bayesian probabilistic numerical methods as solutions to inverse problems, providing conditions for their well-definition, convergence, and compositional use in complex tasks, bridging numerical analysis and uncertainty quantification.

Contribution

It establishes a rigorous Bayesian framework for probabilistic numerics, including convergence analysis and methods for composing solutions to complex numerical problems.

Findings

01

Bayesian probabilistic numerical methods are well-defined under general conditions.

02

A numerical approximation scheme with proven asymptotic convergence is proposed.

03

The framework is extended to pipelines of computation for complex tasks.

Abstract

The emergent field of probabilistic numerics has thus far lacked clear statistical principals. This paper establishes Bayesian probabilistic numerical methods as those which can be cast as solutions to certain inverse problems within the Bayesian framework. This allows us to establish general conditions under which Bayesian probabilistic numerical methods are well-defined, encompassing both non-linear and non-Gaussian models. For general computation, a numerical approximation scheme is proposed and its asymptotic convergence established. The theoretical development is then extended to pipelines of computation, wherein probabilistic numerical methods are composed to solve more challenging numerical tasks. The contribution highlights an important research frontier at the interface of numerical analysis and uncertainty quantification, with a challenging industrial application presented.

Figures29

Click any figure to enlarge with its caption.

Equations259

\int x (t) ν (d t)

\int x (t) ν (d t)

A (x) = x (t_{1}) ⋮ x (t_{n}) = a \in A .

A (x) = x (t_{1}) ⋮ x (t_{n}) = a \in A .

B (μ, a) = N (z^{⊤} K^{- 1} (a - \overset{m}{ˉ}), z_{0} - z^{⊤} K^{- 1} z)

B (μ, a) = N (z^{⊤} K^{- 1} (a - \overset{m}{ˉ}), z_{0} - z^{⊤} K^{- 1} z)

- \nabla \cdot (κ \nabla x)

- \nabla \cdot (κ \nabla x)

x

A (x) = - \nabla \cdot (κ (t_{1, 1}) \nabla x (t_{1, 1})) ⋮ - \nabla \cdot (κ (t_{1, n_{1}}) \nabla x (t_{1, n_{1}})) x (t_{2, 1}) ⋮ x (t_{2, n_{2}}) a = f (t_{1, 1}) ⋮ f (t_{1, n_{1}}) g (t_{2, 1}) ⋮ g (t_{2, n_{2}}) .

A (x) = - \nabla \cdot (κ (t_{1, 1}) \nabla x (t_{1, 1})) ⋮ - \nabla \cdot (κ (t_{1, n_{1}}) \nabla x (t_{1, n_{1}})) x (t_{2, 1}) ⋮ x (t_{2, n_{2}}) a = f (t_{1, 1}) ⋮ f (t_{1, n_{1}}) g (t_{2, 1}) ⋮ g (t_{2, n_{2}}) .

B (μ, a) = δ \circ b (a)

B (μ, a) = δ \circ b (a)

B (μ, a) = Q_{#} μ^{a}, for A_{#} μ -almost-all a \in A .

B (μ, a) = Q_{#} μ^{a}, for A_{#} μ -almost-all a \in A .

BF := \frac{p ~ _{A} ( a )}{p _{A} ( a )}

BF := \frac{p ~ _{A} ( a )}{p _{A} ( a )}

x = x_{0} + i = 0 \sum \infty u_{i} ϕ_{i}

x = x_{0} + i = 0 \sum \infty u_{i} ϕ_{i}

r (q^{†}, ν) = \int L (q^{†}, q) ν (d q) .

r (q^{†}, ν) = \int L (q^{†}, q) ν (d q) .

R (μ, M) = \int r (Q (x), B (μ, A (x))) μ (d x) .

R (μ, M) = \int r (Q (x), B (μ, A (x))) μ (d x) .

B (A) = {B : R (μ, (A, B)) = B^{'} in f R (μ, (A, B^{'}))} .

B (A) = {B : R (μ, (A, B)) = B^{'} in f R (μ, (A, B^{'}))} .

A_{μ} \in A \in Λ arg inf {R (μ, M) s.t. M = (A, B), B = Q_{#} μ^{A}}

A_{μ} \in A \in Λ arg inf {R (μ, M) s.t. M = (A, B), B = Q_{#} μ^{A}}

A_{μ}^{*} \in A \in Λ arg inf {b in f R (μ, M) s.t. M = (A, B), B = δ \circ b} .

A_{μ}^{*} \in A \in Λ arg inf {b in f R (μ, M) s.t. M = (A, B), B = δ \circ b} .

R (μ, (A, δ \circ b)) = \frac{1}{3} - 2 i = 1 \sum n w_{i} (t_{i} - \frac{1}{2} t_{i}^{2}) + i, j = 1 \sum n w_{i} w_{j} min (t_{i}, t_{j}) .

R (μ, (A, δ \circ b)) = \frac{1}{3} - 2 i = 1 \sum n w_{i} (t_{i} - \frac{1}{2} t_{i}^{2}) + i, j = 1 \sum n w_{i} w_{j} min (t_{i}, t_{j}) .

b (A (x)) = \frac{2}{2 n + 1} i = 1 \sum n x (t_{i}^{*}), t_{i}^{*} = \frac{2 i}{2 n + 1}

b (A (x)) = \frac{2}{2 n + 1} i = 1 \sum n x (t_{i}^{*}), t_{i}^{*} = \frac{2 i}{2 n + 1}

A (x)

A (x)

μ_{δ}^{a} (d x) : = \frac{1}{Z _{δ}^{a}} ϕ (\frac{∥ A ( x ) - a ∥ _{A}}{δ}) μ (d x)

μ_{δ}^{a} (d x) : = \frac{1}{Z _{δ}^{a}} ϕ (\frac{∥ A ( x ) - a ∥ _{A}}{δ}) μ (d x)

Z_{δ}^{a} : = \int ϕ (\frac{∥ a ~ - a ∥ _{A}}{δ}) p_{A} (d \tilde{a})

Z_{δ}^{a} : = \int ϕ (\frac{∥ a ~ - a ∥ _{A}}{δ}) p_{A} (d \tilde{a})

d_{F} (ν, ν^{'}) = ∥ f ∥_{F} \leq 1 sup ∣ ν (f) - ν^{'} (f) ∣.

d_{F} (ν, ν^{'}) = ∥ f ∥_{F} \leq 1 sup ∣ ν (f) - ν^{'} (f) ∣.

d_{F} (μ^{a}, μ^{a^{'}}) \leq C_{μ}^{α} ∥ a - a^{'} ∥_{A}^{α}

d_{F} (μ^{a}, μ^{a^{'}}) \leq C_{μ}^{α} ∥ a - a^{'} ∥_{A}^{α}

d_{F} (μ_{δ}^{a}, μ^{a}) \leq C_{μ}^{α} (1 + \overset{ˉ}{C}_{ϕ}^{α}) δ^{α}

d_{F} (μ_{δ}^{a}, μ^{a}) \leq C_{μ}^{α} (1 + \overset{ˉ}{C}_{ϕ}^{α}) δ^{α}

\overset{ˉ}{C}_{ϕ}^{α} = \frac{( α + n - 1 )!!}{( n - 1 )!!}

\overset{ˉ}{C}_{ϕ}^{α} = \frac{( α + n - 1 )!!}{( n - 1 )!!}

P_{N} (x_{0} + i = 0 \sum \infty u_{i} ϕ_{i}) : = x_{0} + i = 0 \sum N u_{i} ϕ_{i} .

P_{N} (x_{0} + i = 0 \sum \infty u_{i} ϕ_{i}) : = x_{0} + i = 0 \sum N u_{i} ϕ_{i} .

\tilde{A} (x) : = {A (x) λ_{m a x} \frac{A ( x )}{∥ A ( x ) ∥ _{A}} if ∥ A (x) ∥_{A} \leq λ_{m a x}, if ∥ A (x) ∥_{A} > λ_{m a x},

\tilde{A} (x) : = {A (x) λ_{m a x} \frac{A ( x )}{∥ A ( x ) ∥ _{A}} if ∥ A (x) ∥_{A} \leq λ_{m a x}, if ∥ A (x) ∥_{A} > λ_{m a x},

d_{F} (μ^{a}, μ_{δ, N}^{a}) \leq C_{μ}^{α} (1 + \overset{ˉ}{C}_{ϕ}^{α}) δ^{α} + C_{δ} Ψ (N) .

d_{F} (μ^{a}, μ_{δ, N}^{a}) \leq C_{μ}^{α} (1 + \overset{ˉ}{C}_{ϕ}^{α}) δ^{α} + C_{δ} Ψ (N) .

(c) \int_{0}^{1} x (t) d t = (a) \int_{0}^{0.5} x (t) d t + (b) \int_{0.5}^{1} x (t) d t

(c) \int_{0}^{1} x (t) d t = (a) \int_{0}^{0.5} x (t) d t + (b) \int_{0.5}^{1} x (t) d t

A_{1,1}(x)=\left[\begin{array}[]{c}x(t_{1})\\ \vdots\\ x(t_{m-1})\end{array}\right],\qquad A_{2,2}(x)=\left[\begin{array}[]{c}x(t_{m+1})\\ \vdots\\ x(t_{2m})\end{array}\right].

A_{1,1}(x)=\left[\begin{array}[]{c}x(t_{1})\\ \vdots\\ x(t_{m-1})\end{array}\right],\qquad A_{2,2}(x)=\left[\begin{array}[]{c}x(t_{m+1})\\ \vdots\\ x(t_{2m})\end{array}\right].

B_{3} (μ, (a_{3, 1}, a_{3, 2})) = δ (a_{3, 1} + a_{3, 2})

B_{3} (μ, (a_{3, 1}, a_{3, 2})) = δ (a_{3, 1} + a_{3, 2})

μ^{(1)}

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Full text

Bayesian Probabilistic Numerical Methods

Jon Cockayne111University of Warwick, [email protected]

Chris Oates222Newcastle University and Alan Turing Institute, [email protected]

Tim Sullivan333Free University of Berlin and Zuse Institute Berlin, [email protected]

Mark Girolami444Imperial College London and Alan Turing Institute, [email protected]

Abstract

The emergent field of probabilistic numerics has thus far lacked clear statistical principals. This paper establishes Bayesian probabilistic numerical methods as those which can be cast as solutions to certain inverse problems within the Bayesian framework. This allows us to establish general conditions under which Bayesian probabilistic numerical methods are well-defined, encompassing both non-linear and non-Gaussian models. For general computation, a numerical approximation scheme is proposed and its asymptotic convergence established. The theoretical development is then extended to pipelines of computation, wherein probabilistic numerical methods are composed to solve more challenging numerical tasks. The contribution highlights an important research frontier at the interface of numerical analysis and uncertainty quantification, with a challenging industrial application presented.

1 Introduction

Numerical computation underpins almost all of modern scientific and industrial research and development. The impact of a finite computational budget is that problems whose solutions are high- or infinite-dimensional, such as the solution of differential equations, must be discretised in order to be solved. The result is an approximation to the object of interest. The declining rate of processor improvement as physical limits are reached is in contrast to the surge in complexity of modern inference problems, and as a result the error incurred by discretisation is attracting increased interest (e.g. Capistrán et al., 2016).

The situation is epitomised in modern climate models, where use of single-precision arithmetic has been explored to permit finer temporal resolution. However, when computing in single-precision, a detailed time discretisation can increase total error, due to the increased number of single precision computations, and in practice some form of ad-hoc trade-off is sought (Harvey and Verseghy, 2015). It has been argued that statistical considerations can permit more principled error control strategies for such models (Hennig et al., 2015).

Numerical methods are designed to mitigate discretisation errors of all forms (Press et al., 2007). Nonetheless, the introduction of error is unavoidable and it is the role of the numerical analyst to provide control of this error (Oberkampf and Roy, 2013). The central theoretical results of numerical analysis have in general not been obtained through statistical considerations. More recently, the connection of discretisation error to statistics was noted as far back as Henrici (1963); Hull and Swenson (1966), who argued that discretisation error can be modelled using a series of independent random perturbations to standard numerical methods. However, numerical analysts have cast doubt on this approach, since discretisation error can be highly structured; see Kahan (1996) and Higham (2002, Section 2.8). To address these objections, the field of probabilistic numerics has emerged with the aim to properly quantify the uncertainty introduced through discretisation in numerical methods.

The foundations of probabilistic numerics were laid in the 1970s and 1980s, where an important shift in emphasis occurred from the descriptive statistical models of the 1960s to the use of formal inference modalities that generalise across classes of numerical tasks. In a remarkable series of papers, Larkin (1969, 1970, 1972); Kuelbs et al. (1972); Larkin (1974, 1979a, 1979b), Mike Larkin presented now classical results in probabilistic numerics, in particular establishing the correspondence between Gaussian measures on Hilbert spaces and optimal numerical methods. Re-discovered and re-emphasised on a number of occasions, the role for statisticians in this new outlook was clearly captured in Kadane and Wasilkowski (1985):

Statistics can be thought of as a set of tools used in making decisions and inferences in the face of uncertainty. Algorithms typically operate in such an environment. Perhaps then, statisticians might join the teams of scholars addressing algorithmic issues.

The 1980s culminated in development of Bayesian optimisation methods (Mockus, 1989; Törn and Žilinskas, 1989), as well as the relation of smoothing splines to Bayesian estimation (Kimeldorf and Wahba, 1970b; Diaconis and Freedman, 1983).

The modern notion of a probabilistic numerical method (henceforth PNM) was described in Hennig et al. (2015); these are algorithms whose output is a distribution over an unknown, deterministic quantity of interest, such as the numerical value of an integral. Recent research in this field includes PNMs for numerical linear algebra (Hennig, 2015; Bartels and Hennig, 2016), numerical solution of ordinary differential equations (ODEs; Schober et al., 2014; Kersting and Hennig, 2016; Schober et al., 2016; Conrad et al., 2016; Chkrebtii et al., 2016), numerical solution of partial differential equations (PDEs; Owhadi, 2015; Cockayne et al., 2016; Conrad et al., 2016) and numerical integration (O’Hagan, 1991; Briol et al., 2016).

Open Problems

Despite numerous recent successes and achievements, there is currently no general statistical foundation for PNMs, due to the infinite-dimensional nature of the problems being solved. For instance, at present it is not clear under what conditions a PNM is well-defined, except for in the standard conjugate Gaussian framework considered in (Larkin, 1972). This limits the extent to which domain-specific knowledge, such as boundedness of an integrand or monotonicity of a solution to a differential equation, can be encoded in PNMs. In contrast, classical numerical methods often exploit such information to achieve substantial reduction in discretisation error. For instance, finite element methods for solution of PDEs proceed based on a mesh that is designed to be more refined in areas of the domain where greater variation of the solution is anticipated (Strang and Fix, 1973).

Furthermore, although PNMs have been proposed for many standard numerical tasks (see Section 2.6.1), the lack of common theoretical foundations makes comparison of these methods difficult. Again taking PDEs as an example, Cockayne et al. (2016) placed a probability distribution on the unknown solution of the PDE, whereas Conrad et al. (2016) placed a probability distribution on the unknown discretisation error of a numerical method. The uncertainty modelled in each case is fundamentally different, but at present there is no framework in which to articulate the relationship between the two approaches. Furthermore, though PNMs are often reported as being “Bayesian” there is no clear definition of what this ought to entail.

A more profound consequence of the lack of common foundation occurs when we seek to compose multiple PNMs. For example, multi-physics cardiac models involve coupled ODEs and PDEs which must each be discretised and approximately solved to estimate a clinical quantity of interest (Niederer et al., 2011). The composition of successive discretisations leads to non-trivial error propagation and accumulation that could be quantified, in a statistical sense, with PNMs. However, proper composition of multiple PNMs for solutions of ODEs and PDEs requires that these PNMs share common statistical foundations that ensure coherence of the overall statistical output. These foundations remain to be established.

Contributions

The main contribution of this paper is to establish rigorous foundations for PNMs:

The first contribution is to argue for an explicit definition of a “Bayesian” PNM. Our framework generalises the seminal work of Larkin (1972) and builds on the modern and popular mathematical framework of Stuart (2010). This illuminates subtle distinctions among existing methods and clarifies the sense in which non-Bayesian methods are approximations to Bayesian PNMs.

The second contribution is to establish when PNMs are well-defined outside of the conjugate Gaussian context. For exploration of non-linear, non-Gaussian models, a numerical approximation scheme is developed and shown to asymptotically approach the posterior distribution of interest. Our aim here is not to develop new or more computationally efficient PNMs, but to understand when such development can be well-defined.

The third contribution is to discuss pipelines of composed PNMs. This is a critical area of development for probabilistic numerics; in isolation, the error of a numerical method can often be studied and understood, but when composed into a pipeline the resulting error structure may be non-trivial and its analysis becomes more difficult. The real power of probabilistic numerics lies in its application to pipelines of numerical methods, where the probabilistic formulation permits analysis of variance (ANOVA) to understand the contribution of each discretisation to the overall numerical error. This paper introduces conditions under which a composition of PNMs can be considered to provide meaningful output, so that ANOVA can be justified.

Structure of the Paper

In Section 2 we argue for an explicit definition of Bayesian PNM and establish when such methods are well-defined. Section 3 establishes connections to other related fields, in particular with relation to evaluating the performance of PNMs. In Section 4 we develop useful numerical approximations to the output of Bayesian PNMs. Section 5 develops the theory of composition for multiple PNMs. Finally, in Section 6 we present applications of the techniques discussed in this paper.

All proofs can be found in either the Appendix or the Electronic Supplement.

2 Probabilistic Numerical Methods

The aim of this section is to provide rigorous statistical foundations for PNMs.

2.1 Notation

For a measurable space $(\mathcal{X},\Sigma_{\mathcal{X}})$ , the shorthand $\mathcal{P}_{\mathcal{X}}$ will be used to denote the set of all distributions on $(\mathcal{X},\Sigma_{\mathcal{X}})$ . For $\mu,\mu^{\prime}\in\mathcal{P}_{\mathcal{X}}$ we write $\mu\ll\mu^{\prime}$ when $\mu$ is absolutely continuous with respect to $\mu$ . The notation $\delta(x)$ will be used to denote a Dirac measure on $x\in\mathcal{X}$ , so that $\delta(x)\in\mathcal{P}_{\mathcal{X}}$ . Let $1[S]$ denote the indicator function of an event $S\in\Sigma_{\mathcal{X}}$ . For a measurable function $f:\mathcal{X}\to\mathbb{R}$ and a distribution $\mu\in\mathcal{P}_{\mathcal{X}}$ , we will on occasion use the notation $\mu(f)=\int f(x)\mu(\mathrm{d}x)$ and $\|f\|_{\infty}=\sup_{x\in\mathcal{X}}|f(x)|$ . The point-wise product of two functions $f$ and $g$ is denoted $f\cdot g$ . For a function or operator $T$ , $T_{\#}$ denotes the associated push-forward operator555Recall that, for measurable $T:\mathcal{X}\to\mathcal{A}$ , the pushforward $T_{\#}\mu$ of a distribution $\mu\in\mathcal{P}_{\mathcal{X}}$ is defined as $T_{\#}\mu(A)=\mu(T^{-1}(A))$ for all $A\in\Sigma_{\mathcal{A}}$ . that acts on measures on the domain of $T$ . Let $\mathbin{\perp\!\!\!\perp}$ denote conditional independence. The subset $\ell^{p}\subset\mathbb{R}^{\infty}$ is defined to consist of sequences $(u_{i})$ for which $\sum_{i=1}^{\infty}|u_{i}|^{p}$ is convergent. $C(0,1)$ will be used to denote the set of continuous functions on $(0,1)$ .

2.2 Definition of a PNM

To first build intuition, consider numerical approximation of the Lebesgue integral

[TABLE]

for some integrable function $x\colon D\to\mathbb{R}$ , with respect to a measure $\nu$ on $D$ . Here we may directly interrogate the integrand $x(t)$ at any $t\in D$ , but unless $D$ is finite we cannot evaluate $x$ at all $t\in D$ with a finite computational budget. Nonetheless, there are many algorithms for approximation of this integral based on information $\{x(t_{i})\}_{i=1}^{n}$ at some collection of locations $\{t_{i}\}_{i=1}^{n}$ .

To see the abstract structure of this problem, assume the state variable $x$ exists in a measurable space $(\mathcal{X},\Sigma_{\mathcal{X}})$ . Information about $x$ is provided through an information operator $A\colon\mathcal{X}\to\mathcal{A}$ whose range is a measurable space $(\mathcal{A},\Sigma_{\mathcal{A}})$ . Thus, for the Lebesgue integration problem, the information operator is

[TABLE]

The space $\mathcal{X}$ , in this case a space of functions, can be high- or infinite-dimensional, but the space $\mathcal{A}$ of information is assumed to be finite-dimensional in accordance with our finite computational budget. In this paper we make explicit a quantity of interest (QoI) $Q(x)$ , defined by a map $Q\colon\mathcal{X}\rightarrow\mathcal{Q}$ into a measurable space $(\mathcal{Q},\Sigma_{\mathcal{Q}})$ . This captures that $x$ itself may not be the object of interest for the numerical problem; for the Lebesgue integration illustration, the QoI is not $x$ itself but $Q(x)=\int x(t)\nu(\mathrm{d}t)$ .

The standard approach to such computational problems is to construct an algorithm which, when applied, produces some approximation $\hat{q}(a)$ of $Q(x)$ based on the information $a$ , whose theoretical convergence order can be studied. A successful algorithm will often tailor the information operator $A$ to the QoI $Q$ . For example, classical Gaussian cubature specifies sigma points $\{t_{i}^{*}\}_{i=1}^{n}$ at which the integrand must be evaluated, based on exact integration of certain polynomial test functions.

The probabilistic numerical approach, instead, begins with the introduction of a random variable $X$ on $(\mathcal{X},\Sigma_{\mathcal{X}})$ . The true state $X=x$ is fixed but unknown; the randomness is used an abstract device used to represent epistemic uncertainty about $x$ prior to evolution of the information operator (Hennig et al., 2015). This is now formalised:

Definition 2.1 (Belief Distribution).

An element $\mu\in\mathcal{P}_{\mathcal{X}}$ is a belief distribution666Two remarks are in order: First, we have avoided the use of “prior” as this abstract framework encompasses both Bayesian and non-Bayesian PNMs (to be defined). Second, the use of “belief” differs to the set-valued belief functions in Dempster–Shafer theory, which do not require that $\mu(E)+\mu(E^{\text{c}})=1$ (Shafer, 1976). for $x$ if it carries the formal semantics of belief about the true, unknown state variable $x$ .

Thus we may consider $\mu$ to be the law of $X$ . The construction of an appropriate belief distribution $\mu$ for a specific numerical task is not the focus of this research and has been considered in detail in previous work; see the Electronic Supplement for an overview of this material. Rather we consider the problem of how one updates the belief distribution $\mu$ in response to the information $A(x)=a$ obtained about the unknown $x$ . Generic approaches to update belief distributions, which generalise Bayesian inference beyond the unique update demanded in Bayes theorem, were formalised in Bissiri et al. (2016); de Carvalho et al. (2017).

Definition 2.2 (Probabilistic Numerical Method).

Let $(\mathcal{X},\Sigma_{\mathcal{X}})$ , $(\mathcal{A},\Sigma_{\mathcal{A}})$ and $(\mathcal{Q},\Sigma_{\mathcal{Q}})$ be measurable spaces and let $A\colon\mathcal{X}\to\mathcal{A}$ , $Q\colon\mathcal{X}\rightarrow\mathcal{Q}$ and $B\colon\mathcal{P}_{\mathcal{X}}\times\mathcal{A}\to\mathcal{P}_{\mathcal{Q}}$ where $A$ and $Q$ are measurable functions. The pair $M=(A,B)$ is called a probabilistic numerical method for estimation of a quantity of interest $Q$ . The map $A$ is called an information operator, and the map $B$ is called a belief update operator.

The output of a PNM is a distribution $B(\mu,a)\in\mathcal{P}_{\mathcal{Q}}$ . This holds the formal status of a belief distribution for the value of $Q(x)$ , based on both the initial belief $\mu$ about the value of $x$ and the information $a$ that are input to the PNM.

An objection sometimes raised to this construction is that $x$ itself is not random. We emphasise that this work does not propose that $x$ should be considered as such; the random variable $X$ is a formal statistical device used to represent epistemic uncertainty (Kadane, 2011; Lindley, 2014). Thus, there is no distinction from traditional statistics, in which $x$ represents a fixed but unknown parameter and $X$ encodes epistemic uncertainty about this parameter.

Before presenting specific instances of this general framework, we comment on the potential analogy between $A$ and the likelihood function, and between $B$ and Bayes’ theorem. Whilst intuitively correct, the mathematical developments in this paper are not well-suited to these terms; in Section 2.5 we show that Bayes formula is not well-defined, as the posterior distribution is not absolutely continuous with respect to the prior.

To strengthen intuition we now give specific examples of established PNMs:

Example 2.3 (Probabilistic Integration).

Consider the numerical integration problem earlier discussed. Take $D\subseteq\mathbb{R}^{d}$ , $\mathcal{X}$ a separable Banach space of real-valued functions on $D$ , and $\Sigma_{\mathcal{X}}$ the Borel $\sigma$ -algebra for $\mathcal{X}$ . The space $(\mathcal{X},\Sigma_{\mathcal{X}})$ is endowed with a Gaussian belief distribution $\mu\in\mathcal{P}_{\mathcal{X}}$ . Given information $A(x)=a$ , define $\mu^{a}$ to be the restriction of $\mu$ to those functions which interpolate $x$ at the points $\{t_{i}\}_{i=1}^{n}$ ; that $\mu^{a}$ is again Gaussian follows from linearity of the information operator (see Bogachev, 1998, for details). The QoI $Q$ remains $Q(x)=\int x(t)\nu(\mathrm{d}t)$ .

This problem was first considered by Larkin (1972). The belief update operator proposed therein, and later considered in Diaconis (1988); O’Hagan (1991) and others, was $B(\mu,a)=Q_{\#}\mu^{a}$ . Since Gaussians are closed under linear projection, the PNM output $B(\mu,a)$ is a univariate Gaussian whose mean and variance can be expressed in closed-form for certain choices of Gaussian covariance function and reference measure $\nu$ on $D$ . Specifically, if $\mu$ has mean function $m\colon\mathcal{X}\rightarrow\mathbb{R}$ and covariance function $k:\mathcal{X}\times\mathcal{X}\rightarrow\mathbb{R}$ , then

[TABLE]

where $\bar{m},z\in\mathbb{R}^{n}$ are defined as $\bar{m}_{i}=m(t_{i})$ , $z_{i}=\int k(t,t_{i})\nu(\mathrm{d}t)$ , $\mathrm{K}\in\mathbb{R}^{n\times n}$ is defined as $\mathrm{K}_{i,j}=k(t_{i},t_{j})$ and $z_{0}=\iint k(t,t^{\prime})(\nu\times\nu)\mathrm{d}(t\times t^{\prime})\in\mathbb{R}$ . This method was extensively studied in Briol et al. (2016), who provided a listing of $(\nu,k)$ combinations for which $z$ and $z_{0}$ possess a closed-form.

An interesting fact is that the mean of $B(\mu,a)$ coincides with classical cubature rules for different choices of $\mu$ and $A$ (Diaconis, 1988; Särkkä et al., 2016). In Section 3 we will show that this is a typical feature of PNMs. The crucial distinction between PNMs and classical numerical methods is the distributional nature of $B(\mu,a)$ , which carries the formal semantics of belief about the QoI. The full distribution $B(\mu,a)$ was examined in Briol et al. (2016), who established contraction to the exact value of the integral under certain smoothness conditions on the Gaussian covariance function and on the integrand. See also Kanagawa et al. (2016); Karvonen and Särkkä (2017).

Example 2.4 (Probabilistic Meshless Method).

As a canonical example of a PDE, take the following elliptic problem with Dirichlet boundary conditions

[TABLE]

where we assume $D\subset\mathbb{R}^{d}$ and $\kappa\colon D\rightarrow\mathbb{R}^{d\times d}$ is a known coefficient. Let $\mathcal{X}$ be a separable Banach space of appropriately differentiable real-valued functions and take $\Sigma_{\mathcal{X}}$ to be the Borel $\sigma$ -algebra for $\mathcal{X}$ . In contrast to the first illustration, the QoI here is $Q(x)=x$ , as the goal is to make inferences about the solution of the PDE itself.

Such problems were considered in Cockayne et al. (2016) wherein $\mu$ was restricted to be a Gaussian distribution on $\mathcal{X}$ . The information operator was constructed by choosing finite sets of locations $T_{1}=\left\{t_{1,1},\dots,t_{1,n_{1}}\right\}\subset D$ and $T_{2}=\left\{t_{2,1},\dots,t_{2,n_{2}}\right\}\subset\partial D$ at which the system defined in Eq. (2.3) was evaluated, so that

[TABLE]

The belief update operator was chosen to be $B(\mu,a)=\mu^{a}$ , where $\mu^{a}$ is the restriction of $\mu$ to those functions for which $A(x)=a$ is satisfied. In the setting of a linear system of PDEs such as that in Eq. (2.3), the distribution $B(\mu,a)$ is again Gaussian (Bogachev, 1998). Full details are provided in Cockayne et al. (2016).

As in the previous example, we note that the mean of $B(\mu,a)$ coincides with the numerical solution to the PDE provided by a classical method (the symmetric collocation method; Fasshauer, 1999). The full distribution $B(\mu,a)$ provides uncertainty quantification for the unknown exact solution and can again be shown to contract to the exact solution under certain smoothness conditions (Cockayne et al., 2016). This method was further analysed for a specific choice of covariance operator in the belief distribution $\mu$ , in an impressive contribution from Owhadi (2017).

2.2.1 Classical Numerical Methods

Standard numerical methods fit into the above framework, as can be seen by taking

[TABLE]

independent of the distribution $\mu$ , where a function $b\colon\mathcal{A}\rightarrow\mathcal{Q}$ gives the output of some classical numerical method for solving the problem of interest. Here $\delta\colon\mathcal{Q}\rightarrow\mathcal{P}_{\mathcal{Q}}$ maps $b(a)\in\mathcal{Q}$ to a Dirac measure centred on $b(a)$ . Thus, information in $a\in\mathcal{A}$ is used to construct a point estimate $b(a)\in\mathcal{Q}$ for the QoI.

The formal language of probabilities is not used in classical numerical analysis to describe numerical error. However, in many cases the classical and probabilistic analyses are mathematically equivalent. For instance, there is an equivalence between the standard deviation of $B(\mu,a)$ for probabilistic integration and the worst-case error for numerical cubature rules from numerical analysis (Novak and Woźniakowski, 2010). The explanation for this phenomenon will be given in Section 3.

2.3 Bayesian PNMs

Having defined a PNM, we now state the central definition of this paper, that is of a Bayesian PNM. Define $\mu^{a}$ to be the conditional distribution of the random variable $X$ , given the event $A(X)=a$ . For now we assume that this can be defined without ambiguity and reserve a more technical treatment of conditional probabilities for Section 2.5.

In this work we followed Larkin (1972) and cast the problem of determining $x$ in Eq. (2.1) as a problem of Bayesian inversion, a framework now popular in applied mathematics and uncertainty quantification research (Stuart, 2010). However, in a standard Bayesian inverse problem the observed quantity $a$ is assumed to be corrupted with measurement error, which is described by a “likelihood”. This leads, under mild assumptions, to general versions of Bayes’ theorem (see Stuart, 2010, Section 2.2)

For PNM, however, the information is not corrupted with measurement error. As a result, the support of the likelihood is a null set under the prior, making the standard approaches to such problems, including Bayes’ theorem, ill-defined outside of the conjugate Gaussian case when unknowns are infinite-dimensional. This necessitates a new definition:

Definition 2.5 (Bayesian Probabilistic Numerical Method).

A probabilistic numerical method $M=(A,B)$ is said to be Bayesian777The use of “Bayesian” contrasts with Bissiri et al. (2016), for whom all belief update operators represent Bayesian learning algorithms to some greater or lesser extent. An alternative term could be “lossless”, since all the information in $a$ is conditioned upon in $\mu^{a}$ . for a quantity of interest $Q$ if, for all $\mu\in\mathcal{P}_{\mathcal{X}}$ , the output

[TABLE]

That is, a PNM is Bayesian if the output of the PNM is the push-forward of the conditional distribution $\mu^{a}$ through $Q$ . This definition is familiar from the examples in Section 2.2, which are both examples of Bayesian PNMs.

For Bayesian PNMs we adopt the traditional terminology in which $\mu$ is the prior for $x$ and the output $Q_{\#}\mu^{a}$ the posterior for $Q(x)$ . Note that, for fixed $A$ and $\mu$ , the Bayesian choice of belief update operator $B$ (if it exists) is uniquely defined.

It is emphasised that the class of Bayesian PNMs is a subclass of all PNMs; examples of non-Bayesian PNMs are provided in Section 2.6.1. Our analysis is focussed on Bayesian PNMs due to their appealing Bayesian interpretation and ease of generalisation to pipelines of computation in Section 5. For non-Bayesian PNMs, careful definition and analysis of the belief update operator is necessary to enable proper interpretation of the uncertainty quantification being provided. In particular, the analysis of non-Bayesian PNMs may present considerable challenges in the context of computational pipelines, whereas for Bayesian PNMs this is shown in Section 5 to be straight-forward.

2.4 Model Evidence

A cornerstone of the Bayesian framework is the model evidence, or marginal likelihood (MacKay, 1992). Let $\mathcal{A}\subseteq\mathbb{R}^{n}$ be equipped with the Lebesgue reference measure $\lambda$ , such that $A_{\#}\mu$ admits a density $p_{A}=\mathrm{d}A_{\#}\mu/\mathrm{d}\lambda$ . Then the model evidence $p_{A}(a)$ , based on the information that $A(x)=a$ , can be used as the basis for Bayesian model comparison. In particular, two prior distributions $\mu$ , $\tilde{\mu}$ , can be compared through the Bayes factor

[TABLE]

where $\tilde{p}_{A}=\mathrm{d}A_{\#}\tilde{\mu}/\mathrm{d}\lambda$ . Here the second expression is independent of the choice of reference measure $\lambda$ and is thus valid for general $\mathcal{A}$ . The model evidence has been explored in connection with the design of Bayesian PNM. For the integration and PDE examples 2.3 and 2.4, the model evidence has a closed form and was investigated in Briol et al. (2016); Cockayne et al. (2016). In Section 6 we investigate the model evidence in the context of non-linear ODEs and PDEs for which it must be approximated.

2.5 The Disintegration Theorem

The purpose of this section is to formalise $\mu^{a}$ and to determine conditions under which $\mu^{a}$ exists and is well-defined. From Definition 2.5, the output of a Bayesian PNM is $B(\mu,a)=Q_{\#}\mu^{a}$ . If $\mu^{a}$ exists, the pushforward $Q_{\#}\mu^{a}$ exists as $Q$ is assumed to be measurable; thus, in this section, we focus on the rigorous definition of $\mu^{a}$ .

Unlike many problems of Bayesian inversion, proceeding by an analogue of Bayes’ theorem is not possible. Let $\mathcal{X}^{a}=\left\{x\in\mathcal{X}:A(x)=a\right\}$ . Then we observe that, if it is measurable, $\mathcal{X}^{a}$ may be a set of zero measure under $\mu$ . Standard techniques for infinite-dimensional Bayesian inversion rely on constructing a posterior distribution based on its Radon–Nikodým derivative with respect to the prior (Stuart, 2010). However, when $\mu^{a}\centernot\ll\mu$ no Radon–Nikodým derivative exists and we must turn to other approaches to establish when a Bayesian PNM is well-defined.

Conditioning on null sets is technical and was formalised in the celebrated construction of measure-theoretic probability by Kolmogorov (1933). The central challenge is to establish uniqueness of conditional probabilities. For this work we exploit the disintegration theorem to ensure our constructions are well-defined. The definition below is due to Dellacherie and Meyer (1978, p.78), and a statistical introduction to disintegration can be found in Chang and Pollard (1997).

Definition 2.6 (Disintegration).

For $\mu\in\mathcal{P}_{\mathcal{X}}$ , a collection $\{\mu^{a}\}_{a\in\mathcal{A}}\subset\mathcal{P}_{\mathcal{X}}$ is a disintegration of $\mu$ with respect to the (measurable) map $A\colon\mathcal{X}\rightarrow\mathcal{A}$ if:

1

(Concentration:) $\mu^{a}(\mathcal{X}\setminus\mathcal{X}^{a})=0$ for $A_{\#}\mu$ -almost all $a\in\mathcal{A}$ ;

and for each measurable $f\colon\mathcal{X}\rightarrow[0,\infty)$ it holds that

2

(Measurability:) $a\mapsto\mu^{a}(f)$ is measurable; 2. 3

(Conditioning:) $\mu(f)=\int\mu^{a}(f)A_{\#}\mu(\mathrm{d}a)$ .

The concept of disintegration extends the usual concept of conditioning of random variables to the case where $\mathcal{X}^{a}$ is a null set, in a way closely related to regular conditional distributions (Kolmogorov, 1933). Existence of disintegrations is guaranteed under general weak conditions:

Theorem 2.7 (Disintegration Theorem; Thm. 1 of Chang and Pollard (1997)).

Let $\mathcal{X}$ be a metric space, $\Sigma_{\mathcal{X}}$ be the Borel $\sigma$ -algebra and $\mu\in\mathcal{P}_{\mathcal{X}}$ be Radon. Let $\Sigma_{\mathcal{A}}$ be countably generated and contain all singletons $\{a\}$ for $a\in\mathcal{A}$ . Then there exists a disintegration $\{\mu^{a}\}_{a\in\mathcal{A}}$ of $\mu$ with respect to $A$ . Moreover, if $\{\nu^{a}\}_{a\in\mathcal{A}}$ is another such disintegration, then $\{a\in\mathcal{A}:\mu^{a}\neq\nu^{a}\}$ is a $A_{\#}\mu$ null set.

The requirement that $\mu$ is Radon is weak and is implied when $\mathcal{X}$ is a Radon space, which encompasses, for example, separable complete metric spaces. The requirement that $\Sigma_{\mathcal{A}}$ is countably generated is also weak and includes the standard case where $\mathcal{A}=\mathbb{R}^{n}$ with the Borel $\sigma$ -algebra. From Theorem 2.7 it follows that $\{\mu^{a}\}_{a\in\mathcal{A}}$ exists and is essentially unique for all of the examples considered in this paper. Thus, under mild conditions, we have established that Bayesian PNMs are well-defined, in that an essentially unique disintegration $\{\mu^{a}\}_{a\in\mathcal{A}}$ exists. It is noted that a variational definition of $\mu^{a}$ has been posited as an alternative approach, for when the existence of a disintegration is difficult to establish (p3 of Garcia Trillos and Sanz-Alonso, 2017).

2.6 Prior Construction

The Gaussian distribution is popular as a prior in the PNM literature for its tractability, both in the fact that finite-dimensional distributions take a closed-form and that an explicit conditioning formula exists. More general priors, such as Besov priors (Dashti et al., 2012) and Cauchy priors (Sullivan, 2016) are less easily accessed. In this section we summarise a common construction for these prior distributions, designed to ensure that a disintegration will exist.

Let $\left\{\phi_{i}\right\}_{i=0}^{\infty}$ denote an orthogonal Schauder basis for $\mathcal{X}$ , assumed to be a separable Banach space in this section. Then any $x\in\mathcal{X}$ can be represented through an expansion

[TABLE]

for some fixed element $x_{0}\in\mathcal{X}$ and a sequence $u\in\mathbb{R}^{\infty}$ . Construction of measures $\mu\in\mathcal{P}_{\mathcal{X}}$ is then reduced to construction of almost-surely convergent measures on $\mathbb{R}^{\infty}$ and studying the pushforward of such measures into $\mathcal{X}$ . In particular, this will ensure that $\mu\in\mathcal{P}_{\mathcal{X}}$ is Radon (as $\mathcal{X}$ is a separable complete metric space), a key requirement for existence of a disintegration $\{\mu^{a}\}_{a\in\mathcal{A}}$ .

To this end it is common to split $u$ into a stochastic and deterministic component; let $\xi\in\mathbb{R}^{\infty}$ represent an i.i.d sequence of random variables, and $\gamma\in\ell^{p}$ for some $p\in(1,\infty)$ . Then with $u_{i}=\gamma_{i}\xi_{i}$ , for the prior distribution to be well-posed we require that almost-surely $u\in\ell^{1}$ . Different choices of $(\xi,\gamma)$ give rise to different distributions on $\mathcal{X}$ . For instance, $\xi_{i}\sim\text{Uniform}(-1,1)$ , $\gamma\in\ell^{1}$ is termed a uniform prior and $\xi_{i}\sim\mathcal{N}(0,1)$ gives a Gaussian prior, where $\gamma$ determines the regularity of the covariance operator $\mathcal{C}$ (Bogachev, 1998). The choice of $\xi_{i}\sim\textup{Cauchy}(0,1)$ gives a Cauchy prior in the sense of Sullivan (2016); here we require $\gamma\in\ell^{1}\cap\ell\log\ell$ for $\mathcal{X}$ a separable Banach space, or $\gamma\in\ell^{2}$ for when $\mathcal{X}$ is a Hilbert space.

A range of prior specifications will be explored in Section 6, including non-Gaussian prior distributions for numerical solution of nonlinear ODEs.

2.6.1 Dichotomy of Existing PNMs

This section concludes with an overview of existing PNMs with respect to our definition of a Bayesian PNM. This serves to clarify some subtle distinctions in existing literature, as well as to highlight the generality of our framework. To maintain brevity we have summarised our findings in Table LABEL:table:comparison.

3 Decision-Theoretic Treatment

Next we assess the performance of PNMs from a decision-theoretic perspective (Berger, 1985) and explore connections to average-case analysis of classical numerical methods (Ritter, 2000). Note that the treatment here is agnostic to whether the PNM in question is Bayesian, and also encompasses classical numerical methods. Throughout, the existence of a disintegration $\{\mu^{a}\}_{a\in\mathcal{A}}$ will be assumed.

3.1 Loss and Risk

Consider a generic loss function $L\colon\mathcal{Q}\times\mathcal{Q}\rightarrow\mathbb{R}$ where $L(q^{\dagger},q)$ describes the loss incurred when the true QoI $q^{\dagger}=Q(x)$ is estimated with $q\in\mathcal{Q}$ . Integrability of $L$ is assumed.

The belief update operator $B$ returns a distribution over $\mathcal{Q}$ which can be cast as a randomised decision rule for estimation of $q^{\dagger}$ . For randomised decision rules, the risk function $r\colon\mathcal{Q}\times\mathcal{P}_{\mathcal{Q}}\rightarrow\mathbb{R}$ is defined as

[TABLE]

The average risk of the PNM $M=(A,B)$ with respect to $\mu\in\mathcal{P}_{\mathcal{X}}$ is defined as

[TABLE]

Here a state $x\sim\mu$ is drawn at random and the risk of the PNM output $B(\mu,A(x))$ is computed. We follow the convention of terming $R(\mu,M)$ the Bayes risk of the PNM, though the usual objection that a frequentist expectation enters into the definition of the Bayes risk could be raised.

Next, we consider a sequence $A^{(n)}$ of information operators indexed such that $A^{(n)}(x)$ is $n$ -dimensional (i.e. $n$ pieces of information are provided about $x$ ).

Definition 3.1 (Contraction).

A sequence $M^{(n)}=(A^{(n)},B^{(n)})$ of PNMs is said to contract at a rate $r_{n}$ under a belief distribution $\mu$ if $R(\mu,M^{(n)})=O(r_{n})$ .

This definition allows for comparison of classical and probabilistic numerical methods (Kadane and Wasilkowski, 1983; Diaconis, 1988). In each case an important goal is to determine methods $M^{(n)}$ that contract as quickly as possible for a given distribution $\mu$ that defines the Bayes risk. This is the approach taken in average-case analysis (ACA; Ritter, 2000) and will be discussed in Section 3.4. For Examples 2.3 and 2.4 of Bayesian PNMs, Briol et al. (2016) and Cockayne et al. (2016) established rates of contraction for particular prior distributions $\mu$ ; we refer the reader to those papers for details.

3.2 Bayes Decision Rules

A (possibly randomised) decision rule is said to be a Bayes rule if it achieves the minimum Bayes risk among all decision rules. In the context of (not necessarily Bayesian) PNMs, let $M=(A,B)$ and let

[TABLE]

That is, for fixed $A$ , $\mathfrak{B}(A)$ is the set of all belief update operators that achieve minimum Bayes risk.

This raises the natural question of which belief update operators yield Bayes rules. Although the definition of a Bayes rule applies generically to both probabilistic and deterministic numerical methods, it can be shown888The proof is included in the Electronic Supplement. that if $\mathfrak{B}(A)$ is non-empty, then there exists a $B\in\mathfrak{B}(A)$ which takes the form of a classical numerical method, as expressed in Eq. (2.4). Thus in general, Bayesian PNMs do not constitute Bayes rules, as the extra uncertainty inflates the Bayes risk, so that such methods are not optimal.

Nonetheless, there is a natural connection between Bayesian PNMs and Bayes rules, as exposed in Kadane and Wasilkowski (1983):

Theorem 3.2.

Let $M=(A,B)$ be a Bayesian probabilistic numerical method for the QoI $Q$ . Let $(\mathcal{Q},\langle\cdot,\cdot\rangle_{\mathcal{Q}})$ be an inner-product space and let the loss function $L$ have the form $L(q^{\dagger},q)=\|q^{\dagger}-q\|_{\mathcal{Q}}^{2}$ , where $\|\cdot\|_{\mathcal{Q}}$ is the norm induced by the inner product. Then the decision rule that returns the mean of the distribution $B(\mu,a)$ is a Bayes rule for estimation of $q^{\dagger}$ .

This well-known fact from Bayesian decision theory999This is the fact that the Bayes act is the posterior mean under squared-error loss (Berger, 1985). is interesting in light of recent research in constructing PNMs whose mean functions correspond to classical numerical methods (Schober et al., 2014; Hennig, 2015; Särkkä et al., 2016; Teymur et al., 2016; Schober et al., 2016). Theorem 3.2 explains the results in Examples 2.3 and 2.4, in which both instances of Bayesian PNMs were demonstrated to be centred on an established classical method.

3.3 Optimal Information

The previous section considered selection of the belief update operator $B$ , but not of the information operator $A$ . The choice of $A$ determines the Bayes risk for a PNM, which leads to a problem of experimental design to minimise that risk.

The theoretical study of optimal information is the focus of the information complexity literature (Traub et al., 1988; Novak and Woźniakowski, 2010), while other fields such as quasi-Monte Carlo (QMC, Dick and Pillichshammer, 2010) attempt to develop asymptotically optimal information operators for specific numerical tasks, such as the choice of evaluation points for numerical approximation of integrals in the case of QMC. Here we characterise optimal information for Bayesian PNMs.

Consider the choice of $A$ from a fixed subset $\Lambda$ of the set of all possible information operators. To build intuition, for the task of numerical integration, $\Lambda$ could represent all possible choices of locations $\{t_{i}\}_{i=1}^{n}$ where the integrand is evaluated. For Bayesian PNM, one can ask for optimal information:

[TABLE]

where we have made explicit the fact that the optimal information depends on the choice of prior $\mu$ . Next we characterise $A_{\mu}$ , while an explicit example of optimal information for a Bayesian PNM is detailed in Example 3.4.

3.4 Connection to Average Case Analysis

The decision theoretic framework in Section 3.1 is closely related to average-case analysis (ACA) of classical numerical methods (Ritter, 2000). In ACA the performance of a classical numerical method $b\colon\mathcal{A}\rightarrow\mathcal{Q}$ is studied in terms of the Bayes risk $R(\mu,M)$ given in Eq. (3.1), for the PNM $M=(A,B)$ with belief operator $B(\mu,a)=\delta\circ b(a)$ as in Eq. (2.4). ACA is concerned with the study of optimal information:

[TABLE]

In general there is no reason to expect $A_{\mu}$ and $A_{\mu}^{*}$ to coincide, since Bayesian PNM are not Bayes rules101010The distribution $Q_{\#}\mu^{a}$ will in general not be supported on the set of Bayes acts.. Indeed, an explicit example where $A_{\mu}\neq A_{\mu}^{*}$ is presented in Appendix S3. However, we can establish sufficient conditions under which optimal information for a Bayesian PNM is the same as optimal information for ACA:

Theorem 3.3.

Let $(\mathcal{Q},\langle\cdot,\cdot\rangle_{\mathcal{Q}})$ be an inner product space and the loss function $L$ have the form $L(q^{\dagger},q)=\|q^{\dagger}-q\|_{\mathcal{Q}}^{2}$ where $\|\cdot\|_{\mathcal{Q}}$ is the norm induced by the inner product. Then the optimal information $A_{\mu}$ for a Bayesian PNM and $A_{\mu}^{*}$ for ACA are identical.

It is emphasised that this result is not a trivial consequence of the correspondance between Bayes rules and worst case optimal methods, as exposed in Kadane and Wasilkowski (1983). To the best of our knowledge, information-based complexity research has studied $A_{\mu}^{*}$ but not $A_{\mu}$ .

Theorem 3.3 establishes that, for the squared norm loss, we can extract results on optimal average case information from the ACA literature and use them to construct optimal Bayesian PNMs. An example is provided next.

Example 3.4 (Optimal Information for Probabilistic Integration).

To illustrate optimal information for Bayesian PNMs, we revisit the first worked example of ACA, due to Sul*′*din (1959, 1960). Set $\mathcal{X}=\{x\in C(0,1):x(0)=0\}$ and take the belief distribution $\mu$ to be induced from the Weiner process on $\mathcal{X}$ , i.e. a Gaussian process with mean [math] and covariance function $k(t,t^{\prime})=\min(t,t^{\prime})$ . Our QoI is $Q(x)=\int_{0}^{1}x(t)\mathrm{d}t$ and the loss function is $L(q,q^{\prime})=(q-q^{\prime})^{2}$ .

Consider standard information $A(x)=(x(t_{1}),\dots,x(t_{n}))$ for $n$ fixed knots $0\leq t_{1}<\dots<t_{n}\leq 1$ . Our aim is to determine knots $t_{i}$ that represent optimal information for a Bayesian PNM with respect to $\mu$ and $L$ .

Motivated by Theorem 3.3 we first solve the optimal information problem for ACA and then derive the associated PNM. It will be sufficient to restrict attention to linear methods $b(a)=\sum_{i=1}^{n}w_{i}x(t_{i})$ with $w_{i}\in\mathbb{R}$ . This allows a closed-form expression for the average error:

[TABLE]

Standard calculus can be used to minimise Eq. (3.2) over both the weights $\{w_{i}\}_{i=1}^{n}$ and the locations $\{t_{i}\}_{i=1}^{n}$ ; the full calculation can be found in Chapter 2, Section 3.3 of Ritter (2000). The result is an ACA optimal method

[TABLE]

which is recognised as the trapezium rule with equally spaced knots. The associated contraction rate $r_{n}$ is $n^{-1}$ (Lee and Wasilkowski, 1986).

From Theorem 3.3 we have that ACA optimal information is also optimal information for the Bayesian PNM. Thus the optimal Bayesian PNM $M=(A,B)$ for the belief distribution $\mu$ is uniquely determined:

[TABLE]

Note how the PNM is centred on the ACA optimal method. However the PNM itself is not a Bayes rule; it in fact carries twice the Bayes risk as the ACA method.

This illustration can be generalised. It is known that for $\mu$ induced from the Weiner process on $\partial^{s}x$ , $Q$ a linear functional and $\phi$ a loss function that is convex and symmetric, equi-spaced evaluation points are essentially optimal information, the Bayes rule is the natural spline of degree $2s+1$ , and the contraction rate $r_{n}$ is essentially $n^{-(s+1)}$ ; see Lee and Wasilkowski (1986) for a complete treatment.

This completes our performance assessment for PNMs; next we turn to computational matters.

4 Numerical Disintegration

In this section we discuss algorithms to access the output from a Bayesian PNM. The approach considered in this paper is to form an explicit approximation to $\mu^{a}$ that can be sampled. The construction of a sampling scheme can exploit sophisticated Monte Carlo methods and allow probing $B(\mu,a)$ at a computational cost that is de-coupled from the potentially substantial cost of obtaining the information $a$ itself.

The construction of an approximation to $\mu^{a}$ is non-trivial on a technical level. As shown in Section 2.5, under weak conditions on the space $\mathcal{X}$ and the operator $A$ , the disintegration $\mu^{a}$ is well-defined for $A_{\#}\mu$ -almost all $a\in\mathcal{A}$ . The approach considered in this work is based on sampling from an approximate distribution $\mu_{\delta}^{a}$ which converges in an appropriate sense to $\mu^{a}$ in the $\delta\downarrow 0$ limit. This follows in a similar spirit to Ackerman et al. (2017).

4.1 Sequential Approximation of a Disintegration

Suppose that $\mathcal{A}$ is an open subset of $\mathbb{R}^{n}$ and that the distribution $A_{\#}\mu\in\mathcal{P}_{\mathcal{A}}$ , admits a continuous and positive density $p_{A}$ with respect to Lebesgue measure on $\mathcal{A}$ . Further endow $\mathcal{A}$ with the structure of a Hilbert space, with norm $\|\cdot\|_{\mathcal{A}}$ .

Let $\phi\colon\mathbb{R}_{+}\rightarrow\mathbb{R}_{+}$ denote a decreasing function, to be specified, that is continuous at [math], with $\phi(0)=1$ and $\lim_{r\rightarrow\infty}\phi(r)=0$ . Consider

[TABLE]

where the normalisation constant

[TABLE]

is non-zero since $p_{A}$ is bounded away from 0 on a neighbourhood of $a\in\mathcal{A}$ and $\phi$ is bounded away from 0 on a sufficiently small interval $[0,\gamma]$ . Our aim is to approximate $\mu^{a}$ with $\mu^{a}_{\delta}$ for small bandwidth parameter $\delta$ . The construction, which can be considered a mathematical generalisation of approximate Bayesian computation (Del Moral et al., 2012), ensures that $\mu^{a}_{\delta}\ll\mu$ . The role of $\phi$ is to admit states $x\in\mathcal{X}$ for which $A(x)$ is close to $a$ but not necessarily equal. It is assumed to be sufficiently regular:

Assumption 4.1.

There exists $\alpha>0$ such that $C_{\phi}^{\alpha}\coloneqq\int r^{\alpha+n-1}\phi(r)\mathrm{d}r<\infty$ .

To discuss the convergence of $\mu_{\delta}^{a}$ to $\mu^{a}$ we must first select a metric on $\mathcal{P}_{\mathcal{X}}$ . Let $\mathcal{F}$ be a normed space of (measurable) functions $f\colon\mathcal{X}\to\mathbb{R}$ with norm $\left\|\cdot\right\|_{\mathcal{F}}$ . For measures $\nu,\nu^{\prime}\in\mathcal{P}_{\mathcal{X}}$ , define

[TABLE]

This formulation encompasses many common probability metrics such as the total variation distance and Wasserstein distance (Müller, 1997). However, not all spaces of functions $\mathcal{F}$ lead to useful theory. In particular the total variation distance between $\mu^{a}$ and $\mu^{a^{\prime}}$ for $a\neq a^{\prime}$ will be one in general. Furthermore depending on the choice of $\mathcal{F}$ , $d_{\mathcal{F}}$ may be merely a pseudometric111111For a pseudometric, $d_{\mathcal{F}}(x,y)=0\implies x=y$ need not hold.. Sufficient conditions for weak convergence with respect to $\mathcal{F}$ are now established:

Assumption 4.2.

The map $a\mapsto\mu^{a}$ is almost everywhere $\alpha$ -Hölder continuous in $d_{\mathcal{F}}$ , i.e.

[TABLE]

for some constant $C_{\mu}^{\alpha}>0$ and for $A_{\#}\mu$ almost all $a,a^{\prime}\in\mathcal{A}$ .

Sufficient conditions for Assumption 4.2 are discussed in Ackerman et al. (2017), but are somewhat technical.

Theorem 4.3.

Let $\bar{C}_{\phi}^{\alpha}\coloneqq C_{\phi}^{\alpha}/C_{\phi}^{0}$ . Then, for $\delta>0$ sufficiently small,

[TABLE]

for $A_{\#}\mu$ almost all $a\in\mathcal{A}$ .

This result justifies the approximation of $\mu^{a}$ by $\mu_{\delta}^{a}$ when the QoI can be well-approximated by integrals with respect to $\mathcal{F}$ . This result is stronger than that of earlier work, such as Pfanzagl (1979), in that it holds for infinite-dimensional $\mathcal{X}$ , though it also relies upon the stronger Hölder continuity assumption.

The specific form for $\phi$ is not fundamental, but can impact upon rate constants. For the choice $\phi(r)=1[r<1]$ we have $\bar{C}_{\phi}^{\alpha}=\frac{n}{\alpha+n}$ , which can be bounded independent of the dimension $n$ of $\mathcal{A}$ . On the other hand, for $\phi(r)=\exp(-\frac{1}{2}r^{2})$ it can be shown that, for $\alpha\in\mathbb{N}$ ,

[TABLE]

so that the constant $\bar{C}_{\phi}^{\alpha}$ might not be bounded. In general this necessitates effective Monte Carlo methods that are able to sample from the regime where $\delta$ can be extremely small, in order to control the overall approximation error.

4.2 Computation for Series Priors

The series representation of $\mu$ in Eq. (2.6) of Section 2.6 is infinite-dimensional and thus cannot, in general, be instantiated. To this end, define $\mathcal{X}_{N}=x_{0}+\text{span}\{\phi_{0},\dots,\phi_{N}\}$ and define the associated projection operator $P_{N}\colon\mathcal{X}\rightarrow\mathcal{X}_{N}$ as

[TABLE]

A natural approach is to compute with the modified information operator $A\circ P_{N}$ instead of $A$ . This has the effect of updating the distribution of the first $N+1$ coefficients and leaving the tail unchanged, to produce an output $\mu_{\delta,N}^{a}$ . Then computation performed in the Bayesian update step is finite-dimensional, whilst instantiation of the posterior itself remains infinite-dimensional. A “likelihood-informed” choice of basis $\left\{\phi_{i}\right\}$ in such problems was considered in Cui et al. (2016).

Inspired by this approach, we next considered convergence of the output $\mu_{\delta,N}^{a}$ to $\mu_{\delta}^{a}$ in the limit $N\rightarrow\infty$ . In this section it is additionally required that $\phi$ be everywhere continuous with $\phi>0$ . Let $\varphi=-\log\phi$ , so that $\varphi$ is a continuous bijection of $\mathbb{R}_{+}$ to itself. The following are also assumed:

Assumption 4.4.

For each $R>0$ , it holds that $|\varphi(r)-\varphi(r^{\prime})|\leq C_{R}|r-r^{\prime}|$ for some constant $C_{R}$ and all $r,r^{\prime}<R$ .

Assumption 4.5.

$\|A(x)-A\circ P_{N}(x)\|_{\mathcal{A}}\leq\exp(m(\|x\|_{\mathcal{X}}))\Psi(N)$ for all $x\in\mathcal{X}$ , where $m$ is measurable and satisfies $\mathbb{E}_{X\sim\mu}[\exp(2m(\|X\|_{\mathcal{X}}))]<\infty$ and $\Psi(N)$ vanishes as $N$ is increased.

Assumption 4.6.

$\sup_{x\in\mathcal{X}}\|A(x)\|_{\mathcal{A}}<\infty$ .

Assumption 4.7.

$\left\|f\right\|_{\infty}\leq C_{\mathcal{F}}\left\|f\right\|_{\mathcal{F}}$ for some constant $C_{\mathcal{F}}$ and all $f\in\mathcal{F}$ .

Assumption 4.4 holds for the case $\varphi(r)=\frac{1}{2}r^{2}$ with constant $C_{R}=R$ . Assumption 4.5 is standard in the inverse problem literature; for instance it is shown to hold for certain series priors in Theorem 3.4 of Cotter et al. (2010). Assumption 4.6 is, in essence, a compactness assumption, in that it is implied by compactness of the state space $\mathcal{X}$ when $A$ is linear. In this sense it is a strong assumption; however it can be enforced in our experiments, where $\mathcal{X}$ is unbounded, through a threshold map

[TABLE]

where $\lambda_{\max}$ is a large pre-defined constant. Assumption 4.7 places a restriction on the probability metric $d_{\mathcal{F}}$ in which our result is stated.

The following theorem has its proof in the Electronic Supplement:

Theorem 4.8.

For some constant $C_{\delta}$ , dependent on $\delta$ , it holds that $d_{\mathcal{F}}(\mu_{\delta,N}^{a},\mu_{\delta}^{a})\leq C_{\delta}\Psi(N)$ .

An immediate consequence of Theorems 4.3 and 4.8 is that the total approximation error can be bounded by applying the triangle inequality:

[TABLE]

In particular, we have convergence of $\mu_{\delta,N}^{a}$ to $\mu^{a}$ in the $\delta\downarrow 0$ limit provided that the number of basis functions satisfies $C_{\delta}\Psi(N)=o(1)$ .

The approximate posterior $\mu_{\delta,N}^{a}$ analysed above can be sampled when $\mu$ is Gaussian, since the first $N+1$ coefficients can be handled with MCMC and the tail $\sum_{i=N+1}^{\infty}u_{i}\phi_{i}$ , being Gaussian, can be sampled. However, when $\mu$ is non-Gaussian the tail is not recognised in a form that can be sampled. For the experiments in Section 6, in which both Gaussian and non-Gaussian priors $\mu$ are considered, the series in Eq. (2.6) was truncated at level $N+1$ , with the resultant prior denoted $\mu_{N}$ . The associated posterior was then entirely supported on the finite-dimensional subspace $\mathcal{X}_{N}$ ; this is mathematically equivalent to working with the projected output $P_{N}\mu_{\delta,N}^{a}$ . Analysis of prior truncation, as opposed to modification of the information operator just reported, is known to be difficult. Indeed, while $\mu_{N}$ converges to $\mu$ weakly, it does not do so in total variation, and this deficiency generally transfers to the associated posteriors. In general the impact of prior perturbation is a subtle topic — see e.g. Owhadi et al. (2015) and the references therein — and we therefore defer theoretical analysis of this approximation to future work.

4.3 Monte Carlo Methods for Numerical Disintegration

The previous sections established a sequence of well-defined distributions $\mu_{\delta}^{a}$ (or $\mu_{\delta,N}^{a}$ for non-Gaussian models) which converge (in a specific weak sense) to the exact disintegration $\mu^{a}$ . From construction, $\mu_{\delta}^{a}\ll\mu^{a}$ and this is sufficient to allow standard Monte Carlo methods to be used. The construction of Monte Carlo methods is de-coupled from the core material in the main text and the main methodological considerations are well-documented (e.g. Girolami and Calderhead, 2011).

For the experiments reported in subsequent sections two approaches were explored; a Sequential Monte Carlo (SMC) method (Doucet et al., 2001) and a parallel tempering method (Geyer, 1991). This provided a transparent sampling scheme, whose non-asymptotic approximation error can be theoretically understood. In particular, they provide robust estimators of model evidence that can be used for Bayesian model comparison. Full details of the Monte-Carlo methods used for this work, along with associated theoretical analysis for the SMC method, are contained in Section S4.1 of the Electronic Supplement.

5 Computational Pipelines and PNM

The last theoretical development in this paper concerns composition of several PNMs. Most analysis of numerical methods focuses on the error incurred by an individual method. However, real-world computational procedures typically rely on the composition of several numerical methods. The manner in which accumulated discretisation error affects computational output may be highly non-trivial (Roy, 2010; Anderson, 2011; Babuška and Söderlind, 2016). An extreme example occurs when one of the numerical methods in a pipeline is charged with integration of a chaotic dynamical system (Strogatz, 2014).

In recent work, Chkrebtii et al. (2016), Conrad et al. (2016) and Cockayne et al. (2016) each used PNMs within a broader statistical procedure to estimate unknown parameters in systems of ODEs and PDEs. The probabilistic description of discretisation error was incorporated into the data-likelihood, resulting in posterior distributions for parameters with inflated uncertainty to properly account for the inferential impact of discretisation error. However, beyond these limited works, no examination of the composition of PNMs has been performed. In particular, the question of which PNMs can be composed, and when the output of such a composition is meaningful, has not been addressed. This is important; for instance, if the output of a composition of PNMs is to be used for analysis of variance to elucidate the main sources of discretisation error, then it is important that such output is meaningful.

This section defines a pipeline as an abstract graphical object that may be combined with a collection of compatible PNMs. It is proven that when compatible Bayesian PNMs are employed in the pipeline, the distributional output of the pipeline carries a Bayesian interpretation under an explicit conditional independence condition on the prior $\mu$ .

To build intuition, for the simple case where two Bayesian PNMs are composed in series, our results provide conditions for when, informally, the output $B_{2}(B_{1}(\mu,a_{1}),a_{2})$ corresponds to a single Bayesian procedure $B(\mu,(a_{1},a_{2}))$ . To reduce the notational and technical burden, in this section we will not provide rigorous measure theoretic details; however we note that those details broadly follow the same pattern as in Section 2.5.

5.1 Computational Pipelines

To analyse pipelines of PNMs, we consider $n$ such methods $M_{1},\dots,M_{n}$ , where each method $M_{i}=(A_{i},B_{i})$ is defined on a common121212This is without loss of generality, since $\mathcal{X}$ can be taken as the union of all state spaces required by the individual methods. state space $\mathcal{X}$ and targets a QoI $Q_{i}\in\mathcal{Q}_{i}$ . A pipeline will be represented as a directed graphical model, wherein the QoIs $Q_{i}$ from parent methods constitute information operators for child methods. It may be that a method will take quantities from multiple parents as input. To allow for this, we suppose that the information operator $A_{i}\colon\mathcal{X}\rightarrow\mathcal{A}_{i}$ can be decomposed into components $A_{i,j}\colon\mathcal{X}\rightarrow\mathcal{A}_{i,j}$ such that $A_{i}=(A_{i,1},\dots,A_{i,m(i)})$ and $\mathcal{A}_{i}=\mathcal{A}_{i,1}\times\dots\times\mathcal{A}_{i,m(i)}$ . Thus, each component $A_{i,j}$ can be thought of as the QoI output by one of the parents of the method $M_{i}$ .

Without loss of generality we designate the $n$ th QoI $Q_{n}$ to be the principal QoI. That is, the purpose of the computational pipeline is to estimate $Q_{n}$ . The case of multiple principal QoI is a simple extension not described herein. Nodes with no immediate children are called terminal nodes, while nodes with no immediate parents are called source nodes. We denote by $A$ the set of all source nodes.

Definition 5.1 (Pipeline).

A pipeline $P$ is a directed acyclic graph defined as follows:

•

Nodes are of two kinds: Information nodes are depicted by $\square$ , and method nodes are depicted by $\blacksquare$ .

•

The graph is bipartite, so that edges connect a method node to an information node or vice-versa. That is, edges are of the form $\square\rightarrow\blacksquare$ or $\blacksquare\rightarrow\square$ .

•

There are $n$ method nodes, each with a unique label in $\{1,\dots,n\}$ .

•

The method node labelled $i$ has $m(i)$ parents and one child. Its in-edges are assigned a unique label in $\{1,\dots,m(i)\}$ .

•

There is a unique terminal node and it is the child of method node $n$ . This represents the principal QoI $Q_{n}$ .

Example 5.2 (Distributed Integration).

Recall the numerical integration problem of Example 3.4 and, as a thought experiment, consider partitioning the domain of integration in order to distribute computation:

[TABLE]

To keep presentation simple we consider an integral over $[0,1]$ with $2m+1$ equidistant knots $t_{i}=i/2m$ . Let $M_{1}$ be a Bayesian PNM for estimating $Q_{1}(x)=$ (a) and $M_{2}$ be a Bayesian PNM for estimating $Q_{2}(x)=$ (b).

In terms of our notational convention, we divide the information operator into four components; $A_{i,j}$ , for $i,j\in\left\{1,2\right\}$ . $A_{1,1}$ and $A_{2,2}$ contain the information unique to $M_{1}$ and $M_{2}$ . Specifically

[TABLE]

$A_{1,2}$ and $A_{2,1}$ contain the information that is shared between the two methods; that is $A_{1,2}=A_{2,1}=\left\{x(t_{m})\right\}$ . To complete the specification we need a third PNM for estimation of $Q_{3}(x)=$ (c) which we denote $M_{3}$ and which combines the outputs of $M_{1}$ and $M_{2}$ by simply adding them together. Formally this has information operator $A_{3}(x)=(A_{3,1}(x),A_{3,2}(x))$ where $A_{3,1}(x)=$ (a) and $A_{3,2}(x)=$ (b). Its belief update operator is given by:

[TABLE]

An intuitive graphical representation of this set-up is shown in Figure 1. The pipeline $P$ itself, which is identical to Figure 1 but with additional node and edge labels, is shown in Figure 2.

In general, the method node labelled $i$ is taken to represent the method $M_{i}$ . The in-edge to this node labelled $j$ is taken to represent the information provided by the relationship $A_{i,j}(x)=a_{i,j}$ . Here $a_{i,j}$ can either be deterministic information provided to the pipeline, or statistical information derived from the output of another PNM. To make this formal and to “match the input-output spaces” we next define what it means for the collection of methods $M_{i}$ to be compatible with the pipeline $P$ . Informally, this describes the conditions that must be satisfied for method nodes in a pipeline to be able to connect to each other.

Definition 5.3 (Compatible).

The collection $(M_{1},\dots,M_{n})$ of PNMs is compatible with the pipeline $P$ if the following two requirements are satisfied:

(i)

(Method nodes which share an information node must have consistent information spaces and information operators.) For a motif

$i$$j$$i^{\prime}$$j^{\prime}$

we have that $A_{i,i^{\prime}}=A_{j,j^{\prime}}$ and $\mathcal{A}_{i,i^{\prime}}=\mathcal{A}_{j,j^{\prime}}$ . 2. (ii)

(The space $\mathcal{Q}_{i}$ for the output of a previous method must be consistent with the information space of the next method.) For a motif

$i$$j$$j^{\prime}$

we have that $\mathcal{Q}_{i}=\mathcal{A}_{j,j^{\prime}}$ .

Note that we do not require the converse of (i) at this stage; that is, the same information can be represented by more than one node in the pipeline. This permits redundancy in the pipeline, in that information is not recycled. It will transpire that pipelines with such redundancy are non-Bayesian.

The role of the pipeline $P$ is to specify the order in which information, either deterministic of statistical, is propagated through the collection of PNMs. This is illustrated next:

Example 5.4 (Propagation of Information).

For the pipeline in Figure 2, the propagation of information proceeds as follows::

The source nodes, representing $A(x)=\{A_{1,1}(x),A_{1,2}(x)=A_{2,1}(x),A_{2,2}(x)\}$ are evaluated as $\{a_{1,1},a_{1,2}=a_{2,1},a_{2,2}\}$ . This represents all the information on $x$ that is provided. 2. 2.

The distributions

[TABLE]

are computed. 3. 3.

The push-forward distribution

[TABLE]

is computed.

Here $\mu^{(1)}\times\mu^{(2)}$ is defined on the Cartesian product $\Sigma_{\mathcal{A}_{3,1}}\times\Sigma_{\mathcal{A}_{3,2}}$ with independent components $\mu^{(1)}$ and $\mu^{(2)}$ . The notation $(B_{3})_{\#}$ refers to the push-forward of the function $B_{3}(\mu,\cdot)$ over its second argument. The distribution $\mu^{(3)}$ is the output of the pipeline and is a distribution over the principal QoI $Q_{3}(x)$ .

The procedure in Example 5.4 can be formalised, but to keep the presentation and notation succinct, we leave this implicit:

Definition 5.5 (Computation).

For a collection $(M_{1},\dots,M_{n})$ of PNMs that are compatible with a pipeline $P$ , the computation $P(M_{1},\dots,M_{n})$ is defined as the PNM with information operator $A$ and belief update operator $B$ that takes $\mu$ and $A(x)=a$ as input and returns the distribution $\mu^{(n)}$ as its output $B(\mu,a)$ , obtained through the procedure outlined in Example 5.4.

That is, the computation $P(M_{1},\dots,M_{n})$ is a PNM for the principal QoI $Q_{n}$ . Note that this definition includes a classical numerical work-flow just as a PNM encompasses a standard numerical method.

5.2 Bayesian Computational Pipelines

Noting that $P(M_{1},\dots,M_{n})$ is itself a PNM, there is a natural definition for when such a computation can be called Bayesian:

Definition 5.6 (Bayesian Computation).

Denote by $(A,B)$ the information and belief operators associated with the computation $P(M_{1},\dots,M_{n})$ and let $\{\mu^{a}\}_{a\in\mathcal{A}}$ be a disintegration of $\mu$ with respect to the information operator $A$ . The computation $P(M_{1},\dots,M_{n})$ is said to be Bayesian for the QoI $Q_{n}$ if

[TABLE]

This is clearly an appealing property; the output of a Bayesian computation can be interpreted as a posterior distribution over the QoI $Q_{n}(x)$ given the prior $\mu$ and the information $A(x)$ . Or, more informally, the “pipeline is lossless with information”. However, at face value it seems difficult to verify whether a given computation $P(M_{1},\dots,M_{n})$ is Bayesian, since it depends on both the individual PNMs $M_{i}$ and the pipeline $P$ that combines them. Our next aim is to establish verifiable sufficient conditions, for which we require another definition:

Definition 5.7 (Dependence Graph).

The dependence graph of a pipeline $P$ is the directed acyclic graph $G(P)$ obtained by taking the pipeline $P$ , removing the method nodes and replacing all $\square\rightarrow\blacksquare\rightarrow\square$ motifs with direct edges $\square\rightarrow\square$ .

The dependency graph for Example 5.2 is shown in Figure 3.

For a computation $P(M_{1},\dots,M_{n})$ , each of the $J$ distinct nodes in $G(P)$ can be associated with a random variable $Y_{j}$ where either $Y_{j}=A_{k,l}(X)$ for some $k,l$ , when the node is a source, or otherwise $Y_{j}=Q_{k}(X)$ , for some $k$ . Randomness here is understood to be due to $X\sim\mu$ , so that the distribution of the $\{Y_{j}\}_{j=1}^{J}$ is a function of $\mu$ . The convention used here is that the $Y_{j}$ are indexed according to a topological ordering on $G(P)$ , which has the properties that (i) the source nodes correspond to indices $1,\dots,I$ , and (ii) the final random variable is $Y_{J}=Q_{n}(X)$ .

Definition 5.8 (Coherence).

Consider a computation $P(M_{1},\dots,M_{n})$ . Denote by $\pi(j)\subseteq\{1,\dots,j-1\}$ the parent set of node $j$ in the dependence graph $G(P)$ . Then we say that $\mu\in\mathcal{P}_{\mathcal{X}}$ is coherent for the computation $P(M_{1},\dots,M_{n})$ if the implied joint distribution of the random variables $Y_{1},\dots,Y_{J}$ satisfies:

[TABLE]

for all $j=I+1,\dots,J$ .

Note that this is weaker than the Markov condition for directed acyclic graphs (see Lauritzen, 1991), since we do not insist that the variables represented by the source nodes are independent. It is emphasised that, for a given $\mu\in\mathcal{P}_{\mathcal{X}}$ , the coherence condition can in general be checked and verified.

The following result provides sufficient and verifiable conditions which ensure that a computation composed of individual Bayesian PNMs is a Bayesian computation:

Theorem 5.9.

Let $M_{1},\dots,M_{n}$ be Bayesian PNMs and let $\mu\in\mathcal{P}_{\mathcal{X}}$ be coherent for the computation $P(M_{1},\dots,M_{n})$ . Then it holds that the computation $P(M_{1},\dots,M_{n})$ is Bayesian for the QoI $Q_{n}$ .

Conversely, if non-Bayesian PNM are combined then the computation $P(M_{1},\dots,M_{n})$ need not be Bayesian in general.

Example 5.10 (Example 5.2, continued).

The random variables $Y_{i}$ in this example are:

[TABLE]

From $G(P)$ in Figure 3, coherence condition in Definition 5.8 requires that the non-trivial conditional independences $Y_{4}\mathbin{\perp\!\!\!\perp}Y_{3}\;|\;\{Y_{1},Y_{2}\}$ and $Y_{5}\mathbin{\perp\!\!\!\perp}Y_{1}\;|\;\{Y_{2},Y_{3}\}$ hold. Thus the distribution $\mu$ is coherent for the computation $P(M_{1},M_{2},M_{3})$ if and only if, for $X\sim\mu$ , the associated information variables satisfy $\int_{0}^{0.5}X(t)\mathrm{d}t\mathbin{\perp\!\!\!\perp}\{X(t_{i})\}_{i=m+1}^{2m}|\{X(t_{i})\}_{i=1}^{m}$ and $\int_{0.5}^{1}X(t)\mathrm{d}t\mathbin{\perp\!\!\!\perp}\{X(t_{i})\}_{i=1}^{m-1}|\{X(t_{i})\}_{i=m}^{2m}$ .

The distribution $\mu$ induced by the Weiner process on $x$ in Example 3.4 satisfies these conditions. Indeed, under $\mu$ the stochastic process $\{x(t):t>t_{m}\}$ is conditionally independent of its history $\{x(t):t<t_{m}\}$ given the current state $x(t_{m})$ . Thus for this choice of $\mu$ , from Theorem 5.9 we have that $P(M_{1},M_{2},M_{3})$ is Bayesian and parallel computation of $(a)$ and $(b)$ in Eq. (5.1) can be justified from a Bayesian statistical standpoint.

However, for the alternative of belief distributions induced by the Weiner process on $\partial^{s}x$ , this condition is not satisfied and the computation $P(M_{1},M_{2},M_{3})$ is not Bayesian. To turn this into a Bayesian procedure for these alternative belief distributions it would be required that $A_{1,2}(x)$ provides information about the derivatives $\partial^{k}x(t_{m})$ for all orders $k\leq s$ .

5.3 Monte Carlo Methods for Probabilistic Computation

The most direct approach to access $\mu^{(n)}$ is to sample from each Bayesian PNM and treat the output samples as inputs to subsequent PNM. This is sometimes known as ancestral sampling in the Bayesian network literature (e.g. Paige and Wood, 2015), and is illustrated in the following example:

Example 5.11 (Ancestral Sampling for PNM).

For Example 5.2, ancestral sampling proceeds as follows:

Draw initial samples

[TABLE] 2. 2.

Draw a final sample

[TABLE]

Then $q_{3}$ is a draw from $\mu^{(3)}$ .

Ancestral sampling requires that PNM outputs can be sampled. Such sampling methods were discussed in Section 4.3. For a more general approach, sequential Monte Carlo methods can be used to propagate a collection of particles through the pipeline $P$ , similar to work on SMC for general graphical models (Briers et al., 2005; Ihler and McAllester, 2009; Lienart et al., 2015; Lindsten et al., 2017; Paige and Wood, 2015).

6 Numerical Experiments

In this final section of the paper we present three numerical experiments. The first is a linear PDE, the second is a nonlinear ODE and the third is an application to a problem in industrial process monitoring, described by a pipeline of PNM. In each case we experiment with non-Gaussian belief distributions and, in doing so, go beyond previous work.

6.1 Poisson Equation

Our first illustration is an instance of the Poisson equation, a linear PDE with mixed Dirichlet-Neumann boundary conditions:

[TABLE]

A model solution to this system, generated with a finite-element method on a fine mesh, is shown in Figure 4.

As the spatial domain for this problem is two-dimensional, the basis used for specification of the belief distribution is more complex. Here tensor products of orthogonal polynomials have been used: $\phi_{i}(t)=C_{j}(2t_{1}-1)C_{k}(2t_{2}-1)$ , $i+j\leq N_{c}$ . The polynomials $C_{i}$ were chosen to be normalised Chebyshev polynomials of the first kind. Prior specification then follows the formulation given in Section 2.6, where the remaining parameters were chosen to be $x_{0}\equiv 1$ , and $\gamma_{i}=\alpha(i+1)^{-2}$ . The random variables $\xi$ were taken to be either Gaussian or Cauchy and the polynomial basis was truncated to $N=45$ terms, corresponding to a maximum polynomial degree of $N_{C}=8$ . For both priors the parameter $\alpha$ was set to $\alpha=1$ . Note that closed-form expressions are available for analysis under the Gaussian prior (Cockayne et al., 2016) but, to simplify interpretation of empirical results, were not exploited. Mathematical background on Cauchy priors can be found in Sullivan (2016).

The information operator was defined by a set of locations $t_{i}\in[0,1]^{2}$ , $i=0,\dots,N_{t}$ , where either the interior condition or one of the boundary conditions was enforced. Denote by $\left\{t^{I,i}\right\}$ the set of interior points, $\left\{t^{D,j}\right\}$ the set of Dirichlet boundary points and $\left\{t^{N,k}\right\}$ the set of Neumann boundary points, where $i=1,\dots,N_{I}$ , $j=1,\dots,N_{D}$ and $k=1,\dots,N_{N}$ , with $n=N_{I}+N_{D}+N_{N}$ . Then, the information operator is given by the concatenation of the conditions defined above:

[TABLE]

The Bayesian PNM output was approximated by numerical disintegration and sampled with a Monte Carlo method whose description is reserved for the Electronic Supplement. In Figure 5 the mean and pointwise standard-deviations of the posterior distributions are plotted for Gaussian and Cauchy priors with $n=16$ . There is little qualitative difference between the posterior distributions for the Gaussian and Cauchy priors. The mean functions match closely to the mean function from the model solution, as given in Figure 4. The posterior variance is lowest near to the Dirichlet boundaries where the solution is known, and peaks where the Neumann condition is imposed. This is to be expected, as evaluations of the Neumann boundary condition provide less information about the solution itself.

Next, the posterior distribution of the spectrum $\{u_{i}\}$ was investigated. In Figure 6 the posterior distribution over these coefficients is plotted and it is seen that the correlation structure between coefficients is non-trivial, c.f. the joint distribution between $u_{0}$ and $u_{3}$ .

Last, in Figure 7 convergence of the posterior distribution is plotted as the number of design points is varied, for $n=16,25,36$ . In each case a Gaussian prior was used. As expected, the standard deviation in the posterior distribution is seen to decrease as the number of design points is increased. At $n=36$ , the shape of the region of highest uncertainty changes markedly, with the most uncertain region lying between the Dirichlet boundary and the first evaluation points on the Neumann boundary. This is likely due to the fact that the number of evaluation points is approaching the size of the polynomial basis; when the number of points equals the size of the basis the system is completely determined for a linear model. Thus, we need $N\gg n$ in order for discretisation error to be quantified.

6.2 The Painlevé ODE

In this section a Bayesian PNM is developed to solve a nonlinear ODE based on Painlevé’s first transcendental

[TABLE]

To permit computation, the right-boundary condition was relaxed by truncating the domain to $[0,10]$ and using the modified condition $x(10)=\sqrt{10}$ .

Two distinct solutions are known, illustrated in Figure 8 (left). These model solutions were obtained using the deflation technique described in Farrell et al. (2015). The spectrum plot in Figure 8 (right) represents the coefficients $\{u_{i}\}$ obtained when each solution is represented over a basis of normalised Chebyshev polynomials. As those polynomials are orthonormal with respect to the $L_{2}$ -inner-product, the slower decay for the negative solution compared to the positive solution is equivalent to the negative solution having a larger $L_{2}$ -norm. This explains the preference that optimisation-based numerical solvers have for returning the positive solution in general, and also explains some of the results now presented.

Such systems for which multiple solutions exist have been studied before in the context of PNM, both in Chkrebtii et al. (2016) and in Cockayne et al. (2016). It was noted in both papers that existence of multiple solutions can present a substantial challenge to classical numerical methods.

To build a Bayesian PNM, a prior $\mu$ for this problem was defined by using a series expansion as in Eq. (2.6). The basis functions were $\phi_{i}(t)=C_{i}(\frac{1}{2}(t-5))$ where the $C_{i}$ were normalised Chebyshev polynomials of the first kind. Both Gaussian and Cauchy priors were considered by taking $u_{i}:=\gamma_{i}\xi_{i}$ , where $\xi_{i}$ were taken to be either standard Gaussian or standard Cauchy and in in each case $x_{0}(t)\equiv 0$ . In accordance with the exponential convergence rate for spectral methods when the solution to the system is a smooth function, the sequence of scale parameters was set to $\gamma_{i}=\alpha\beta^{-i}$ , where $\alpha=8$ and $\beta=1.5$ . These values were chosen by inspection of the true spectra (obtained with Matlab’s “chebfun” package) to ensure that both solutions were in the support of the prior.

The information operator $A$ was defined by the choice of locations $\left\{t_{j}\right\}$ , $j=1,\dots,m$ , which determine the locations at which the posterior will be constrained. Analysis for several values of $m$ was performed. In each case $t_{1}=0$ , $t_{m}=10$ and the remaining $t_{j}$ were equally spaced on $[0,10]$ . To be explicit, the information operator was

[TABLE]

with the last two elements enforcing the boundary conditions. Thus our information was $a=[-t_{1},\dots,-t_{m},0,\sqrt{10}]$ , which is $n=m+2$ dimensional.

The Bayesian PNM output $B(\mu,a)$ was approximated via numerical disintegration with the first $N=40$ terms of the series representation used. This was sampled with Monte Carlo methods, the details of which are reserved for the Electronic Supplement.

Results for a selection of bandwidths $\delta$ , with $n=17$ , are shown in Figure 9. Note that a strong preference for the positive solution is expressed at the smallest $\delta$ , with mass around both solutions at larger $\delta$ . For the Gaussian prior, some mass remained around the negative solution at the smallest $\delta$ , while this was not so for the Cauchy prior. This reflects the fact that, for a collection of independent univariate Cauchy random variables, one element is likely to be significantly larger in magnitude than the others, which favours faster decay for the remaining elements.

Using the calculation described in Section S4.4, model evidence was computed for both the Gaussian and the Cauchy prior at $n=15$ . The Bayes factor for the Cauchy, compared to the Gaussian prior, was found to be $20.26$ , which constitutes strong evidence in favour of a Cauchy prior for this problem at the given level of discretisation.

In Figure 10 the posterior distributions for first six coefficients $u_{i}$ at $n=17$ and $\delta=1$ are plotted. Strong multimodality is clear, as well as skewed correlation structure between the coefficients. Illustration of such posteriors for smaller $\delta$ is difficult as the posteriors become extremely peaked.

Figure 11 displays convergence of the posterior distributions as $n$ is increased. Of particular interest is that for $n=12$ , the posterior distribution based on a Gaussian prior becomes trimodal. For each prior, the posterior mass settles on the positive solution to the system at $n=22$ . This is in accordance with the fact that this solution has smaller $L_{2}$ -norm. This perhaps reflects the fact that, while in the limiting case both solutions should have an equal likelihood, the curvature of the likelihood at each mode may differ. Prior truncation may also be influential; in Figure 12 the log-likelihood of the negative solution increases at a slower rate than that of the positive solution. Thus, while in the setting of an infinite prior series neither solution should be preferred, in practice truncation might bias one solution over the other. Lastly, it is clear that the parameters $\alpha$ and $\beta$ may also have a significant effect on which solution is preferred. Further theoretical work will be required to understand many of the phenomena that we have just described.

Of particular interest is how a preference for the negative solution could be encoded into a PNM. Owing to the flexible specification the information operator, there is considerable choice in this matter. An elegant approach is the introduction of additional, inequality-based information

[TABLE]

Such information can be difficult to incorporate in standard numerical algorithms, but is of interest in many physical problems (Kinderlehrer and Stampacchia, 2000). For Bayesian PNM we can extend the information operator to include $1[x^{\prime}(0)\leq 0]$ . Posterior distributions for the Gaussian prior at $n=17$ are shown in Figure 13. Note that posterior mass has settled close to the negative solution. This highlights the simplicity with which Bayesian PNMs can encode a preference for a particular solution when a multiplicity of solutions exist.

6.3 Application to Industrial Process Monitoring

This final application illustrates how statistical models for discretisation error can be propagated through a pipeline of computation to model how these errors are accumulated.

Hydrocyclones are machines used to separate solid particles from a liquid in which they are suspended, or two liquids of different densities, using centrifugal forces. High pressure fluid is injected into the top of a tank to create a vortex. The induced centrifugal force causes denser material to move to the wall of the tank while lighter material concentrates in the centre, where it can be extracted. They have widespread applications, including in areas such as environmental engineering and the petrochemical industry (Sripriya et al., 2007). An illustration of the operation is given in Figure 14.

To ensure the materials are well-separated the hydrocyclone must be moitored to allow adjustment of the input flow-rate. This is also important for safe operation, owing to the high pressures involved (Bradley, 2013). However, direct monitoring is impossible owing to the opaque walls of the equipment and the high interior pressure. For this purpose electrical impedance tomography (EIT) has been proposed to allow monitoring of the contents (Gutierrez et al., 2000).

EIT is a technique which allows recovery of an interior conductivity field based upon measurements of voltage obtained from applying a stimulating current on the boundary. It is suited to this problem, as the two materials in the hydrocyclone will generally be of different conductivities. In its simplified form due to Calderón (1980), EIT is described by a linear partial differential equation similar to that in Section 6.1, but with modified boundary conditions to incorporate the stimulating currents and measured voltages:

[TABLE]

where $D$ denotes the domain, modelling the hydrocyclone tank, $e$ indexes the stimulating electrodes, $t_{e}\in\partial D$ are the corresponding locations of the electrodes on $\partial D$ , $a$ is the unknown conductivity field to be determined and $\frac{\partial}{\partial n}$ denotes the derivative with respect to the outward pointing normal vector. The electrode $t^{1}$ is referred to as the reference electrode. The vector $c=(c_{1},\dots,c_{N_{e}})$ denotes the stimulation current pattern. Several stimulation patterns were considered, denoted $c^{j}$ , $j=1,\dots,N_{j}$ .

The experimental data described in West et al. (2005) were considered. In the experiment, a cylindrical perspex tank was used with a single ring of eight electrodes. Translation invariance in the vertical direction means that the contents are effectively a single 2D region and electrical conductivity can be modelled as a 2D field. At the start of the experiment, a mixing impeller was used to create a rotational flow. This was then removed and, after a few seconds, concentrated potassium chloride solution was carefully injected into the tap water initially filling the tank. Data, denoted $y_{\tau}$ , were collected at regular time intervals by application of several stimulation patterns $c^{1},\dots,c^{M}$ .

To formulate the statistical problem, consider parameterising the conductivity field as $a(\tau,t)$ , where $\tau\in[0,T]$ is a temporal index while $t\in D$ is the spatial coordinate and $D$ is the circular domain representing the perspex tank in the experiment. A log-Gaussian prior was placed over the conductivity field so that $\log a$ is a Gaussian process with separable covariance function $k_{a}((\tau,t),(\tau^{\prime},t^{\prime})):=\lambda\min(\tau,\tau^{\prime})\exp\left(-\frac{\left\|t-t^{\prime}\right\|^{2}}{2\ell^{2}}\right)$ where $\ell$ is a length-scale parameter representing the anticipated spatial variation of the conductivity field and $\lambda$ is a parameter controlling the amplitude of the field. Here $\ell$ was fixed to $\ell=0.3$ , while $\lambda=10^{-3}$ . The problem of estimating $a$ based on data can be well-posed in the Bayesian framework (Dunlop and Stuart, 2016). Full details of this experiment can be found in the accompanying report Oates et al. (2017).

Our aim is to use a PNM to account for the effect of discretisation on inferences that are made on the conductivity field. For fixed $\tau$ , a Gaussian prior was posited for $x$ , with covariance $k_{x}(t,t^{\prime}):=\exp\left(-\frac{\left\|t-t^{\prime}\right\|^{2}}{2\ell_{x}^{2}}\right)$ where $\ell_{x}$ was fixed to $\ell_{x}=0.3$ . The associated Bayesian PNM, a probabilistic meshless method (PMM), was described in Example 2.4.

The statistical inference procedure is formulated in a pipeline of computations in Figure 15. It is assumed that the desired outcome is to monitor the contents of the tank while the current contents are being mixed. This suggests a particle filter approach where a PMM $M_{\tau}$ is employed to handle the intractable likelihood $p(y_{\tau}|a_{\tau})$ that involves the exact solution of a PDE. The distribution of $a_{\tau}$ given $y_{1},\dots,y_{\tau}$ is denoted $\pi_{\tau}$ an the computation $P(M_{1},\dots,M_{\tau})$ is Bayesian only if the particle approximation error due to the use of a particle filter is overlooked.

To briefly illustrate the method, Figure 16 presents posterior means for the field $a(\tau,\cdot)$ , for each post-injection time point $\tau=1,\dots,8$ . These are based on a particle approximation of size $P=500$ , with method nodes based upon a Bayesian PNM, as in Example 2.4, with $n=119$ design points. The high conductivity region representing the potassium chloride solution can be seen rotating through the domain in the frames after injection, with its conductivity reducing as it mixes with the water. The full posterior distribution over the conductivity field is inflated as a result of explicitly modelling the discretisation error; an extensive analysis of these results will be reported in the upcoming Oates et al. (2017).

In Figure 17, the integrated standard-deviation $\int_{D}\sigma(t)\>\mathrm{d}t$ is shown for $\tau=1,\dots,8$ for both the “pipeline”, as described above, and a “static” approach in which no uncertainty was propagated. In this static approach a symmetric collocation PDE solver131313Recall that the PMM has a corresponding symmetric collocation solution to the PDE as its mean function. was used to solve the forward problem, and a separate Bayesian inversion problem was solved at each time point. The parameters of the symmetric collocation solver were identical to those used in the PMM. In the left panel we observe some structural periodicity, present in both the pipeline and the static approach. We speculate that this may be due to the rotation of the medium causing the area of high conductivity to periodically reach an area of the domain, relative to the 8 sensors, in which it is particularly easy to recover. With this periodicity subtracted in the right panel, there was a clear increase in posterior uncertainty in the pipeline compared to the static approach, which is depicted. Temporal regularisation would usually be expected to reduce uncertainty; thus, the fact that the overall uncertainty increased with $\tau$ , relative to the static formulation, demonstrates that we have quantified and propagated uncertainty due to successive discretisation of the PDE at each time point.

7 Discussion

This paper has established statistical foundations for PNMs and investigated the Bayesian case in detail. Through connection to Bayesian inverse problems (Stuart, 2010), we have established when Bayesian PNM can be well-defined and when the output can be considered meaningful. The presentation touched on several important issues and a brief discussion of the most salient points is now provided.

Bayesian vs Non-Bayesian PNMs

The decision to focus on Bayesian PNMs was motivated by the observation that the output of a pipeline of PNMs can only be guaranteed to admit a valid Bayesian interpretation if the constituent PNMs are each Bayesian and the prior distribution is coherent. Indeed, Theorem 5.9 demonstrated that prior coherence can be established at a local level, essentially via a local Markov condition, so that Bayesian PNMs provide a extensible modelling framework as required to solve more challenging numerical tasks. These results support a research strategy that focuses on Bayesian PNMs, so that error can be propagated in a manner that is meaningful.

On the other hand, there are pragmatic reasons why either approximations to Bayesian PNMs, or indeed, non-Bayesian PNMs might be useful. The predominant reason would be to circumvent the off-line computational costs that can be associated with Bayesian PNMs, such as the use of numerical disintegration developed in this research. Recent research efforts, such as Schober et al. (2014, 2016) and Kersting and Hennig (2016) for the solution of ODEs, have aimed for computational costs that are competitive with classical methods, at the expense of fully Bayesian estimation for the solution of the ODE. Such methods are of interest as non-Bayesian PNMs, but their role in pipelines of PNMs is unclear. Our contribution serves to make this explicit.

Computational Cost

The present research focused on the more fundamental cost of access to the information $A(x)$ , rather than the additional CPU time required to obtain the PNM output. Indeed, numerical disintegration constituted the predominant computational cost in the applications that were reported. However, we stress that in many challenging applications gated by discretisation error, such as occur with climate models, the fundamental cost of the information $A(x)$ will be dominant. Furthermore, the Monte Carlo methods that were employed for numerical disintegration admit substantial improvements (e.g. in a similar vein to Botev and Kroese, 2012; Koskela et al., 2016). The objective of this paper was to establish statistical foundations that will permit the development of more sophisticated and efficient Bayesian PNMs.

Prior Elicitation

Throughout this work we assumed that a belief distribution $\mu$ was provided. The question of whose belief is represented in $\mu$ has been discussed by several authors and a chronology is included in the Electronic Supplement. Of these perspectives we mention in particular Hennig et al. (2015), wherein $\mu$ is the belief of an agent that “we get to design”. This offers a connection to frequentist statistics, in that an agent can be designed to ensure favourable frequentist properties hold.

A robust statistics perspective is also relevant and one such approach would be to consider a generalised Bayes risk (Eq. (3.1)) wherein the state variable $X$ used for assessment is assumed to be drawn from a distribution $\tilde{\mu}\neq\mu$ . This offers an opportunity to derive Bayesian PNMs that are robust to certain forms of prior mis-specification. This direction was not considered in the present paper, but has been pursued in the ACA literature for classical numerical methods (see Chapter IV, Section 4 of Ritter, 2000).

In general, the specification of prior distributions for robust inference on an infinite-dimensional state space can be difficult. The consistency and robustness of Bayesian inference procedures — particularly with respect to perturbations of the prior such as those arising from numerical approximations — in such settings is a subtle topic, with both positive (Castillo and Nickl, 2014; Doob, 1949; Kleijn and van der Vaart, 2012; Le Cam, 1953) and negative (Diaconis and Freedman, 1986; Freedman, 1963; Owhadi et al., 2015) results depending upon fine topological and geometric details.

In the context of computational pipelines, the challenge of eliciting a coherent prior is closely connected to the challenge of eliciting a single unified prior based on the conflicting input of multiple experts (French, 2011; Albert et al., 2012).

Consistent Estimation

The present paper focused on foundations. Further methodological work will be required to establish sufficient conditions for when $B(\mu,A_{n}(x^{\dagger}))$ collapses to an atom on a single element $q^{\dagger}=Q(x^{\dagger})$ representing the data-generating QoI in the limit as the amount of information, $n$ , is increased. There are two questions here; (i) when is $q^{\dagger}$ identifiable from the given information, and (ii) at what rate does $B(\mu,A_{n}(x^{\dagger}))$ concentrate on $q^{\dagger}$ .

Generalisation and Extensions

Two more directions are highlighted for extension of this work. First, note that in this paper the information operator $A:\mathcal{X}\rightarrow\mathcal{A}$ was treated as a deterministic object. However, in some applications there is auxiliary randomness in the acquisition of information. For our integration example, nodes $t_{i}$ might arise as random samples from a reference distribution on $[0,1]$ . Or, observations $x(t_{i})$ themselves might occur with measurement error, for example due to finite precision arithmetic. Then a more elaborate model $A\colon\mathcal{X}\times\Omega\rightarrow\mathcal{A}$ would be required, where $\Omega$ is a probability space that injects randomness into the information operator. This is the setting of, for instance, randomised quasi-Monte Carlo methods. Future work will extend the framework of PNMs to include randomised information operators of this kind.

As a second direction, recall that in an adaptive algorithm the choice of the information is made in an iterative procedure that is informed by the information observed up to that point. For the canonical illustration in Example 3.4 and its generalisations discussed there, it can be proven that adaptive algorithms do not out-perform non-adaptive algorithms in average case error (Lee and Wasilkowski, 1986). However, outside this setting adaptation can be beneficial and should be investigated in the context of Bayesian PNM.

Connection with Probabilistic Programming

The central goal of probabilistic programming (PP) is to automate statistical computation, through symbolic representation of statistical objects and operations on those objects. The formalism of pipelines as graphical models presented in this work can be compared to similar efforts to establish PP languages (Goodman et al., 2012). For instance, a method node in a pipeline can be related to a monad aggregating several distributions into a single output distribution (Ścibior et al., 2015). An important challenge in PP is the automation of computing conditional distributions (Shan and Ramsey, 2017). Numerical disintegration and extensions thereof might be of independent interest to this field (e.g. extending Wood et al., 2014).

Acknowledgements

CJO was supported by the Australian Research Council (ARC) Centre of Excellence for Mathematical and Statistical Frontiers. TJS was supported by the Excellence Initiative of the German Research Foundation (DFG) through the Free University of Berlin. MG was supported by the Engineering and Physical Sciences (EPSRC) grants EP/J016934/1, EP/K034154/1, an EPSRC Mathematical Sciences Established Career Research Fellowship and a Lloyds Register Foundation grant for Programme on Data-Centric Engineering. This material was based upon work partially supported by the National Science Foundation under Grant DMS-1127914 to the Statistical and Applied Mathematical Sciences Institute. Any opinions, findings, and conclusions or recommendations expressed in this material are those of the author(s) and do not necessarily reflect the views of the National Science Foundation.

The authors are grateful to Amazon for the provision of AWS credits and to the authors of the Eigen and Eigency libraries in Python.

Appendices

Appendix A Proofs

Proof of Theorem 3.3.

The following observation will be required; the joint density of $X$ and $A=A(X)$ can be expressed in two ways:

[TABLE]

which holds almost everywhere from the definition of a disintegration $\{\mu^{a}\}_{a\in\mathcal{A}}$ . Note that our integrability assumption justifies the interchange of integrals from Fubini’s theorem.

The Bayes risk for a Bayesian PNM $M_{\text{BPNM}}=(A,B_{\text{BPNM}})$ , $B_{\text{BPNM}}(\mu,a)=Q_{\#}\mu^{a}$ , can be expressed as:

[TABLE]

On the other hand, let

[TABLE]

be a Bayes act. Then the Bayes risk associated with such a method $M_{\text{BR}}=(A,B_{\text{BR}})$ , $B_{\text{BR}}(\mu,a)=\delta(b(a))$ , can be expressed as:

[TABLE]

Next we use the inner product structure on $\mathcal{Q}$ and the form of the loss function as $L(q,q^{\prime})=\|q-q^{\prime}\|_{\mathcal{Q}}^{2}$ to argue that $R(\mu,M_{\text{BPNM}})=2R(\mu,M_{\text{BR}})$ , which in turn implies that the optimal information $A_{\mu}$ for Bayesian PNM and $A_{\mu}^{*}$ for ACA are identical.

For this final step, fix $a\in\mathcal{A}$ and denote the random variables $Q^{a}(X)=Q(X)-b(a)$ that are induced according to $X\sim\mu^{a}$ . Denote by $\tilde{Q}^{a}$ an independent copy of $Q^{a}$ generated from $\tilde{X}\sim\mu^{a}$ . The notation $\mathbb{E}$ will be used to refer to the expectation taken over $X,\tilde{X}$ . Then we have

[TABLE]

and moreover, from Theorem 3.2 the posterior mean of $Q(X)$ is $b(a)$ and thus $\mathbb{E}[Q^{a}]=\mathbb{E}[\tilde{Q}^{a}]=0$ . Then

[TABLE]

as required. ∎

Proof of Theorem 4.3.

Fix $f\in\mathcal{F}$ and $a\in\mathcal{A}$ . Then:

[TABLE]

Thus

[TABLE]

Now consider the random variable

[TABLE]

induced from $X\sim\mu$ . The existence of a continuous and positive density $p_{A}$ implies that $R$ also admits a density on $[0,\infty)$ , denoted $p_{R,\delta}$ . The fact that $p_{A}$ is uniform on an infinitesimal neighbourhood of $a$ implies that $p_{R,\delta}(r)$ is proportional to the surface area of a hypersphere of radius $\delta r$ centred on $a\in\mathcal{A}$ :

[TABLE]

This is valid since $\mathcal{A}$ is open and the hypersphere will be contained in $\mathcal{A}$ for $r$ sufficiently small. Eq. (A.2) can then be evaluated:

[TABLE]

Thus, for $\delta$ sufficiently small, Eq. (A.5) can be bounded above by $\delta^{\alpha}(1+\bar{C}_{\phi}^{\alpha})$ where $\bar{C}_{\phi}^{\alpha}\coloneqq C_{\phi}^{\alpha}/C_{\phi}^{0}$ and “ $1$ ” is in this case an arbitrary positive constant. This establishes the upper bound

[TABLE]

for $\delta$ sufficiently small and completes the proof. ∎

Proof of Theorem 5.9.

To reduce the notation, suppose that the random variables $Y_{1},\dots,Y_{J}$ admit a joint density $p(y_{1},\dots,y_{J})$ , However, we emphasise that existence of a density is not required for the proof to hold. To further reduce notation, denote $y_{a:b}=(y_{a},\dots,y_{b})$ .

The output of the computation $P(M_{1},\dots,M_{n})$ was defined algorithmically in Definition 5.5 and illustrated in Example 5.4. Our aim is to show that this algorithmic output coincides with the distribution $(Q_{n})_{\#}\mu^{a}$ on $\mathcal{Q}_{n}$ , which is identified in the present notation with $p(y_{J}|y_{1:I})$ .

For $j\in\{I+1,\dots,J\}$ , the coherence condition on $Y_{1},\dots,Y_{J}$ translates into the present notation as $p(y_{j}|y_{1:j-1})=p(y_{j}|y_{\pi(j)})$ . This allows us to deduce that:

[TABLE]

The right hand side is recognised as the output of the computation $P(M_{1},\dots,M_{n})$ , as defined in Definition 5.5. This completes the proof. ∎

S1 Philosophical Status of the Belief Distribution

The aim of this section is to discuss in detail the semantic status of the belief distribution $\mu$ in a probabilistic numerical method (PNM). In Section S1.1 we survey historical work on this topic, while in Section S1.2 more recent literature is covered. Then in Section S1.3 we highlight some philosophical objections and their counter-arguments.

S1.1 Historical Precedent

The use of probabilistic and statistical methods to model a deterministic mathematical object can be traced back to Poincaré [1912], who used a stochastic model to construct interpolation formulae. In brief, Poincaré formulated a polynomial

[TABLE]

whose coefficients $a_{i}$ were modelled as independent Gaussian random variables. Thus Poincaré in effect constructed a Gaussian measure over the Hilbert space with basis $\{1,x,\dots,x^{m}\}$ . This pre-empted Kimeldorf and Wahba [1970a, b] and others, which associated spline interpolation formulae to the means of Gaussian measures over Hilbert spaces.

The first explicit statistical model for numerical error (of which we are aware) was in the literature on rounding error in the numerical solution of ordinary differential equations (ODE), as summarised in Hull and Swenson [1966]. Therein it was supposed that rounding, by which we mean representation of a real number

[TABLE]

in a truncated form

[TABLE]

is such that the error $e=x-\hat{x}$ can be reasonably modelled by a uniform random variable on $[-5\times 10^{-(n+1)},5\times 10^{-(n+1)}]$ . This implies a distribution $\mu$ over the unknown value of $x$ given $\hat{x}$ . The contribution of Hull and Swenson [1966] and others was to replace the last digit $a_{n}$ , in each stored number that arises in the numerical solution of an ODE, with a uniformly chosen element of $\{0,\dots,9\}$ . This performs approximate propagation of the numerical uncertainty due to rounding error through further computation and, in their case, induces a distribution over the solution space of the ODE. Note that this work focused on rounding error, rather than the (time) discretisation error that is intrinsic to numerical ODE solvers; this could reflect the limited precision arithmetic that was available from the computer hardware of the period.

Larkin [1972] was an important historical paper for PNMs, being the first to set out the modern statistical agenda for PNMs:

In any particular problem situation we are given certain specific properties of the solution, e.g. a finite number of ordinate or derivative values at fixed abscissae. If we can assume no more than this basic information we can conclude only that our required solution is a member of that class of functions which possesses the given properties - a tautology which is unlikely to appeal to an experimental scientist! Clearly, we need to be given, or to assume, extra information in order to make more definite statements about the required function.

Typically, we shall assume general properties, such as continuity or non-negativity of the solution and/or its derivatives, and use the given specific properties in order to assist in making a selection from the class $K$ of all functions possessing the assumed general properties. We shall choose $K$ either to be a Hilbert space or to be simply related to one.

This description defines a set $K$ of permissible functions, rather than an explicit distribution over $K$ , but it is clear that Larkin envisaged numerical analysis as an instance of statistical estimation:

In the present approach, an a priori localisation is achieved effectively by making an assumption about the relative likelihoods of elements of the Hilbert space of possible candidates for the solution to the original problem. Among other things, this permits, at least in principle, the derivation of joint probability density functions for functionals on the space and also allows us to evaluate confidence limits on the estimate of a required functional (in terms of given values of other functionals) without any extra information about the norm of the function in question.

Later, Diaconis [1988] re-iterated this argument for the construction of $K$ more explicitly, considering numerical integration of the function

[TABLE]

over the unit interval. In particular, Diaconis asked:

“What does it mean to ‘know’ a function?” The formula says some things (e.g. $f$ is smooth, positive and bounded by $20$ on $[0,1]$ ) but there are many other facts about $f$ that we don’t know (e.g. is $f$ monotone, unimodal or convex?)

This argument was provided as justification for belief distributions that encode certain basic features, such as the smoothness of the integrand. The belief distributions that were then considered in Diaconis’ paper were Gaussian distributions on $K$ . Diaconis, as well as Larkin [1972], Kadane and Wasilkowski [1983], observed that some classical numerical methods are Bayes rules in this context.

The arguments of these papers are intrinsic to modern PNMs. However, the associated theoretical analysis of computation under finite information has proceeded outside of statistics, in the applied mathematical literature, where it is usually presented without a statistical context. That research is reviewed next.

S1.2 Contemporary Outlook

The mathematical foundations of computation based on finite information are established in the field of information-based complexity (IBC). The monograph of Traub et al. [1988] presents the foundations of IBC. In brief, the starting point for IBC is the mantra that

To compute fast you need to compute with partial information ( $\sim$ Houman Owhadi, SIAM UQ 2016)

This motivates the search for optimal approximations based on finite information, in either the worst-case or average-case sense of optimal. The particular development of PNMs that we presented in the main text is somewhat aligned to average-case analysis (ACA) and we focus on that literature in what follows.

Among the earliest work on ACA, Sul*′*din [1959, 1960] studied numerical integration and $L_{2}$ function approximation in the setting where $\mu$ was induced from the Weiner process, with a focus on optimal linear methods. Later, Sacks and Ylvisaker [1970] moved from analysis with fixed $\mu$ to analysis over a class of $\mu$ defined by the smoothness properties of their covariance kernels. At the same time Kimeldorf and Wahba [1970a, b] established optimality properties of splines in reproducing kernel Hilbert spaces in the ACA context. Kadane and Wasilkowski [1985], Diaconis [1988] discussed the connection between ACA and Bayesian statistics. A general framework for ACA was formalised in the IBC monograph of Traub et al. [1988], while Ritter [2000] provides a more recent account.

Game theoretic arguments have recently been explored in Owhadi [2015], who argued that the optimal prior for probabilistic meshless methods [Cockayne et al., 2016] is a particular Gaussian measure under a game theoretic framework where the energy norm is the loss function. This provides one route to the specification of default or objective priors for PNMs which deserves further exploration in general.

The question of “whose” belief is captured in $\mu$ was addressed in Hennig et al. [2015], where it was argued that the prior information in $\mu$ represents that of a hypothetical agent (numerical analyst) which

[ $\dots$ ] we are allowed to design ( $\sim$ Michael Osborne, personal correspondence, 2016).

This represents a more pragmatic approach to the design of PNM.

S1.3 Paradise Lost?

Typical numerical algorithms contain several different sources of discretisation error. Consider the solution of the wave equation: A standard finite element method involves both spatial and temporal discretisations, a series of numerical quadrature problems, as well as the use of finite precision arithmetic for all numerical calculations. Yet, decades of numerical analysis have led to highly optimised computer codes such that these methods can be routinely used. To develop PNM for solution of the wave equation, which accounts for each separate source of discretisation error, is it required to unpick and reconstruct such established numerical algorithms? This would be an unattractive prospect that would detract from further research into PNMs.

Our view is that there is a choice for which discretisation errors to model. In practice the PNMs implemented in this work were run on floating point precision machines, yet we did not model rounding error in their output. This was because, in our examples, floating point error is insignificant compared to discretisation error and so we chose not to model it. This is in line with the view that a model is a useful simplification of the real world.

S2 Existence of Non-Randomised Bayes Rule

In this section we recall an argument for the general existence of non-randomised Bayes rules, that was stated without proof in the main text. Sufficient conditions for Fubini’s theorem to hold are assumed.

Proposition S2.1.

Let $\mathfrak{B}(A)$ be non-empty. Then $\mathfrak{B}(A)$ contains a classical numerical method of the form $B(\mu,a)=\delta\circ b(a)$ where $b(a)$ is a Bayes act for each $a\in\mathcal{A}$ .

Proof.

Let $\mathfrak{C}$ be the set of belief update operators of the classical form $B(\mu,a)=\delta\circ b(a)$ . Suppose there exists a belief update operator $B^{*}\in\mathfrak{B}(A)\setminus\mathfrak{C}$ . Then $B^{*}$ can be characterised as a non-atomic distribution $\pi$ over the elements of $\mathfrak{C}$ . Its risk can be computed as:

[TABLE]

If we had $R(\mu,(A,B^{*}))<R(\mu,(A,\delta\circ b))$ for all $\delta\circ b\in\mathfrak{C}$ we would have a contradiction, so it follows that $\mathfrak{B}(A)\cap\mathfrak{C}$ is non-empty. This completes the proof. ∎

S3 Optimal Information: A Counterexample

In this section we demonstrate that the optimal information $A_{\mu}$ for Bayesian PNM and the optimal information $A_{\mu}^{*}$ from average case analysis are different in general.

Let $\mathcal{X}=\{\spadesuit,\diamondsuit,\heartsuit,\clubsuit\}$ be a discrete set, with quantity of interest $Q(x)=1[x=\spadesuit]$ and information operator $A(x)=1[x\in S]$ so that $\mathcal{Q}=\mathcal{A}=\left\{0,1\right\}$ . In particular, $\mathcal{Q}$ is not a vector space and hence not an inner product space as specified in Theorem 3.3.

Consider two possible choices, $S=\{\spadesuit,\diamondsuit\}$ and $S=\{\spadesuit,\diamondsuit,\heartsuit\}$ . Assume a uniform prior over $\mathcal{X}$ . Consider the 0-1 loss function $L(q,q^{\prime})=1[q\neq q^{\prime}]$ . It will be shown that ACA optimal information for this example can be based on either $S=\{\spadesuit,\diamondsuit\}$ or $S=\{\spadesuit,\diamondsuit,\heartsuit\}$ whereas PNM optimal information must be based on $S=\{\spadesuit,\diamondsuit,\heartsuit\}$ . Thus Bayesian PNM optimal information $A_{\mu}$ and ACA optimal information $A_{\mu}^{*}$ need not coincide in general.

The classical case considers a method of the form $M_{\text{BR}}=(A,B_{\text{BR}})$ , $B_{\text{BR}}=\delta\circ b$ , where

[TABLE]

for some $c_{0},c_{1}\in\{0,1\}$ . The Bayes risk is

[TABLE]

Case of $S=\{\spadesuit,\diamondsuit\}$ :

We have

[TABLE]

which is minimised by $c_{1}\in\{0,1\}$ and $c_{0}=0$ to obtain a minimum Bayes risk of $\frac{1}{4}$ .

Case of $S=\{\spadesuit,\diamondsuit,\heartsuit\}$ :

We have

[TABLE]

which is minimised by $c_{0}=0$ and $c_{1}=0$ to again obtain a minimum Bayes risk of $\frac{1}{4}$ . Thus the ACA optimal information can be based on either $S=\{\spadesuit,\diamondsuit\}$ or $S=\{\spadesuit,\diamondsuit,\heartsuit\}$ .

On the other hand, for the Bayesian PNM we have that $M_{\text{BPNM}}=(A,B_{\text{BPNM}})$ , $B_{\text{BPNM}}=Q_{\#}\mu^{A}$ and

[TABLE]

Case of $S=\{\spadesuit,\diamondsuit\}$ :

We have

[TABLE]

Case of $S=\{\spadesuit,\diamondsuit,\heartsuit\}$ :

We have

[TABLE]

Thus the PNM optimal information is $S=\{\spadesuit,\diamondsuit\}$ and not $S=\{\spadesuit,\diamondsuit,\heartsuit\}$ . Hence, PNM and ACA optimal information differ in general.

S4 Monte Carlo Methods for Numerical Disintegration

In this section, Monte Carlo methods for sampling from the distribution $\mu_{\delta}^{a}$ (or $\mu_{\delta,N}^{a}$ ; the $N$ subscript will be suppressed to reduce notation in the sequel) are considered. The Monte Carlo approximation of $\mu_{\delta}^{a}$ is, in effect, a problem in rare event simulation as most of the mass of $\mu_{\delta}^{a}$ will be confined to a set $S$ such that $\mu(S)$ is small. Rare events pose some difficulties for classical Monte Carlo, as an enormous number of draws can be required to study the rare event of interest.

In the literature there are two major solutions proposed. Importance sampling [Robert and Casella, 2013] samples from a modified process, under which the event of interest is more likely, then re-weights these samples to compensate for the adjustment. Conversely, in splitting [Botev and Kroese, 2012] trajectories of the process are constructed in a genetic fashion, by retaining and duplicating those which approach the events of interest and discarding others. Splitting is closely related to SMC [Cérou et al., 2012] and Feynman–Kac models [Del Moral, 2004].

The splitting approach is described in the following section, while in Section S4.3 a parallel tempering (PT) algorithm is described. In spirit these approaches are similar in that they employ a tempering approach to ease sampling the relaxed posterior distribution for a small value of $\delta$ . The SMC method employs a particle approximation to accomplish this, while the PT algorithm uses coupled Markov chains.

S4.1 Sequential Monte Carlo Algorithms for Numerical Disintegration

Let $\left\{\delta_{i}\right\}_{i=0}^{m}$ be such that $\delta_{0}=\infty$ , $\delta_{m}=\delta$ and $\delta_{i}>\delta_{i+1}>0$ for all $i<m-1$ . Furthermore let $\left\{K_{i}\right\}_{i=1}^{m}$ be some set of Markov transition kernels that leave $\mu^{a}_{\delta_{i}}$ invariant, for which $K_{i}(\cdot,S)$ is measurable for all $S\in\Sigma_{\mathcal{X}}$ and $K_{i}(x,\cdot)$ is an element of $\mathcal{P}_{\mathcal{X}}$ for all $x\in\mathcal{X}$ . Then our SMC for numerical disintegration (SMC-ND) algorithm, based on $P$ particles, is given in Algorithm 1. Here we have used $\text{Discrete}(\{x_{j}\}_{j=1}^{P};\{w_{j}\}_{j=1}^{P})$ to denote the discrete distribution which puts mass proportional to $w_{j}$ on the state $x_{j}\in\mathcal{X}$ .

The output of the SMC-ND algorithm is an empirical approximation141414The bandwidth parameter $\delta$ and the use of $\delta$ to denote an atomic distribution should not be confused.

[TABLE]

to $\mu^{a}_{\delta_{m}}$ based on a population of $P$ particles $\{x_{j}^{m}\}_{j=1}^{P}$ . There is substantial room to extend and improve the SMC-ND algorithm based on the wide body of literature available on this subject [e.g. Doucet et al., 2001, Del Moral et al., 2006, Beskos et al., 2017, Ellam et al., 2016], but we defer all such improvements for future work. Our aim in the remainder is to establish the approximation properties of the SMC-ND output. This will be based on theoretical results in Del Moral et al. [2006].

Assumption S4.1.

$\phi>0$ on $\mathbb{R}_{+}$ .

Assumption S4.2.

For all $i=0,\dots,m-1$ and all $x,y\in\mathcal{X}$ , it holds that $K_{i+1}(x,\cdot)\ll K_{i+1}(y,\cdot)$ . Furthermore there exist constants $\epsilon_{i}>0$ such that the Radon–Nikodým derivative

[TABLE]

Assumption S4.1 ensures that Algorithm 1 is well-defined, else it can happen that all particles are assigned zero weight and re-sampling will fail. However, the result that we obtain in Theorem S4.3 below can also be established in the special case of an indicator function $\phi(r)=1[r<1]$ . The details for this variation of the results are also included in the sequel.

The interpretation of Assumption S4.2 is that, for fixed $i$ , transition kernels do not allocate arbitrarily large or small amounts of mass to different areas of the state space, as a function of their first argument. This poses a constraint on the choice of Markov kernels for the SMC-ND algorithm.

Theorem S4.3.

For all $\delta\in\{\delta_{i}\}_{i=0}^{m}$ and fixed $p\geq 1$ it holds that

[TABLE]

for some constant $C_{p}$ independent of $P$ but dependent on $\{\delta_{i}\}_{i=0}^{m}$ , $p$ and $\{\epsilon_{i}\}_{i=0}^{m-1}$ .

The proof of Theorem S4.3 is presented next. Note that the established bound is independent of $\delta\in\{\delta_{i}\}_{i=0}^{m}$ ; this is therefore a uniform convergence result. The assumptions and the conclusion of Theorem S4.3 can be weakened in several directions, as discussed in detail in [Del Moral et al., 2006]. Development of SMC methods in the context of high-dimensional and infinite-dimensional state spaces has also been considered in Beskos et al. [2014, 2015].

S4.2 Proof of Theorem S4.3

In this section we establish the uniform convergence of the SMC-ND algorithm as claimed in Theorem S4.3. This relies on a powerful technical result from Del Moral [2004], whose context is now established.

S4.2.1 Feynman–Kac Models

Let $(E_{i},\mathcal{E}_{i})$ for $i=0,\dots,m$ be a collection of measurable spaces. Let $\eta_{0}$ be a measure on $E_{0}$ and let $\Gamma_{i}$ index a collection of Markov transition kernels from $E_{i-1}$ to $E_{i}$ . Let $G_{i}\colon E_{i}\rightarrow(0,1]$ be a collection of functions, which are referred to as potentials. The triplets $(\eta_{0},G_{i},\Gamma_{i})$ are associated with Feynman–Kac measures $\eta_{i}$ on $E_{i}$ defined as, for bounded and measurable functions $f_{i}$ on $E_{i}$ ;

[TABLE]

where the expectation is taken with respect to the Markov process $X^{i}$ defined by $X^{0}\sim\eta_{0}$ and $X^{i}|X^{i-1}\sim\Gamma_{i}(X^{i-1},\cdot)$ .

The Feynman–Kac measures can be associated with a (non-unique) McKean interpretation of the form $\eta_{i+1}=\eta_{i}\Lambda_{i+1,\eta_{i}}$ where the $\Lambda_{i+1,\eta}$ are a collection of Markov transitions for which the following compatibility condition holds:

[TABLE]

Then the $\eta_{i}$ can be interpreted as the $i$ th step marginal distribution of the non-homogeneous Markov chain defined by $X^{0}\sim\eta_{0}$ and $X^{i+1}|X^{i}\sim\Lambda_{i+1,\eta_{i}}(X^{i},\cdot)$ . The corresponding $P$ -particle model is defined on $E_{i}^{P}=E_{i}\times\dots\times E_{i}$ and has

[TABLE]

where $\eta_{i}^{P}=\frac{1}{P}\sum_{j=1}^{P}\delta(X_{j}^{i})$ is an empirical (random) measure on $E_{i}$ . The SMC-ND algorithm can be cast as an instance of such a $P$ -particle model, as is made clear later.

The result that we require from Del Moral [2004] is given next. Denote by $\text{Osc}_{1}(E_{i})$ the set of measurable functions $f_{i}$ on $E_{i}$ for which $\sup\{|f_{i}(x^{i})-f_{i}(y^{i})|\;:\;x^{i},y^{i}\in E_{i}\}\leq 1$ .

Theorem (Theorem 7.4.4 in Del Moral [2004]).

Suppose that:

$(G)$

There exist $\epsilon_{i}^{G}\in(0,1]$ such that $G_{i}(x^{i})\geq\epsilon_{i}^{G}G_{i}(y^{i})>0$ for all $x^{i},y^{i}\in E_{i}$ .

$(M_{1})$

There exist $\epsilon_{i}^{\Gamma}\in(0,1)$ such that $\Gamma_{i+1}(x^{i},\cdot)\geq\epsilon_{i}^{\Gamma}\Gamma_{i+1}(y^{i},\cdot)$ for all $x^{i},y^{i}\in E_{i}$ .

Then for $p\geq 1$ and any valid McKean interpretation $\Lambda_{i,\eta}$ , the associated $P$ -particle model $\eta_{i}^{P}$ satisfies the uniform (in $i$ ) bound

[TABLE]

for some constant $C_{p}$ independent of $P$ but dependent on $\{\epsilon_{i}^{G}\}_{i=0}^{m}$ and $\{\epsilon_{i}^{\Gamma}\}_{i=0}^{m-1}$ .

The actual statement in Del Moral [2004] contains a more general version of $(M_{1})$ and a more explicit decomposition of the constant $C_{p}$ ; however the simpler version presented here is sufficient for the purposes of the present paper.

S4.2.2 Case A: Positive Function $\phi(r)>0$

First we prove Theorem S4.3 as it is stated. Later the assumption of $\phi>0$ will be relaxed.

SMC-ND as a Feynman–Kac Model

The aim here is to demonstrate that the SMC-ND algorithm fits into the framework of Section S4.2.1 for a specific McKean interpretation. This connection will then be used to establish uniform convergence for the SMC-ND algorithm as a consequence of Theorem 7.4.4 in Del Moral [2004].

For the state spaces we associate each $E_{i}=\mathcal{X}$ and $\mathcal{E}_{i}=\Sigma_{\mathcal{X}}$ . For the potentials we associate

[TABLE]

which clearly does not vanish and takes values in $(0,1]$ since $\delta_{i}>\delta_{i+1}$ and $\phi$ is decreasing. For the Markov transitions we associate $\Gamma_{i+1}$ with $K_{i+1}$ .

The Feynman–Kac measures associated with the SMC-ND algorithm can be cast as a non-homogeneous Markov chain with transitions $\Lambda_{i+1,\eta}$ . Here $\Lambda_{i+1,\eta_{i}}$ acts on the current measure $\eta_{i}$ on $\mathcal{X}$ by first propagating as $\eta_{i}K_{i+1}$ and then “warping” this measure with the potential $G_{i}$ ; i.e.

[TABLE]

This demonstrates that the SMC-ND algorithm is the $P$ -particle model corresponding to the McKean interpretation $\Lambda_{i+1,\eta}$ of the Feynman–Kac triplet $(\eta_{0},G_{i},\Gamma_{i})$ . Thus the SMC-ND algorithm can be studied in the context of Section S4.2.1, which we report next.

Note that it is common in applications of SMC to perform the “Re-sample” step before the “Move” step - our choice of order was required for the McKean framework that is the basis of the theoretical results in Del Moral et al. [2006]. It is known in the SMC “folk lore” that the order of these steps can be interchanged.

Proof of Uniform Convergence Result for SMC-ND

It remains to verify the hypotheses of Theorem 7.4.4 in Del Moral [2004]. Condition $(G)$ is satisfied if and only if

[TABLE]

is bounded below, since

[TABLE]

is bounded above by 1. Since $\phi$ is continuous, decreasing and satisfies $\phi>0$ (Assumption S4.1), it suffices to show that its argument $\frac{1}{\delta_{i+1}}\|A(x)-a\|_{\mathcal{A}}$ is upper-bounded. This is the content of Assumption 4.6 in the main text, which shows that

[TABLE]

Condition $(M_{1})$ requires that

[TABLE]

for all $x^{i},y^{i}\in E_{i}$ and $S\in\mathcal{E}_{i+1}$ . From construction this is equivalent to

[TABLE]

for all $x^{i},y^{i}\in\mathcal{X}$ and $S\in\Sigma_{\mathcal{X}}$ . This is the content of Assumption S4.2.

Thus we have established the hypotheses of Theorem 7.4.4 in Del Moral [2004] for the SMC-ND algorithm. Theorem S4.3 is a re-statement of this result. For the statement of the result we used the $\|f\|_{\mathcal{F}}$ norm, based on the fact that (from Assumption 4.7) $\|f_{i}\|_{\text{Osc}(E_{i})}\leq 2\|f\|_{\infty}\leq 2C_{\mathcal{F}}\|f\|_{\mathcal{F}}$ .

S4.2.3 Case B: Indicator Function $\phi(r)=1[r<1]$

The previous analysis required that $\phi>0$ on $\mathbb{R}_{+}$ . However, the most basic choice for $\phi$ is the indicator function $\phi(r)=1[r<1]$ which can take the value 0. The case of an indicator function demands special attention, since Algorithm 1 can fail in this case if all particles are assigned zero weight. If this occurs, then we just define $\mu_{\delta,P}^{\alpha}(f)=0$ . To be specific, the SMC-ND algorithm associated to the indicator function $\phi$ for approximation of the integral $\mu_{\delta}^{a}(f)$ is stated as Algorithm 2 next.

Let $\mathcal{X}_{\delta}^{a}=\{x\in\mathcal{X}:\|A(x)-a\|_{\mathcal{A}}<\delta\}$ . If there is some iteration $i$ at which, after applying the kernel $K_{i}$ to each particle, no particle lies within $\mathcal{X}_{\delta_{i}}^{a}$ , the algorithm fails. As a result it is critical to ensure that the distance between successive $\delta_{i}$ is small so that the probability of failure is controlled. This requirement is made formal next. To establish the approximation properties of the random measure $\mu^{a}_{\delta_{m},P}$ , two assumptions are required. These are intended to replace Assumptions S4.1, S4.2 and Assumption 4.6 from the main text:

Assumption S4.4.

For all $i=0,\dots,m-1$ and all $x^{i}\in\mathcal{X}^{a}_{\delta_{i}}$ , it holds that $K_{i+1}(x^{i},\mathcal{X}^{a}_{\delta_{i+1}})>0$ .

Assumption S4.5.

For all $i=0,\dots,m-1$ and all $x^{i},y^{i}\in\mathcal{X}^{a}_{\delta_{i}}$ , $K_{i+1}(x^{i},\cdot)\ll K_{i+1}(y^{i},\cdot)$ . Furthermore there exist constants $\epsilon_{i}>0$ such that the Radon–Nikodým derivative

[TABLE]

Assumption S4.4 requires that the probability of reaching $\mathcal{X}^{a}_{\delta_{i+1}}$ when starting in $\mathcal{X}^{a}_{\delta_{i}}$ and applying the transition kernel $K_{i+1}$ , is bounded away from zero. Assumption S4.5 ensures that, for fixed $i$ , transition kernels do not allocate arbitrarily large or small amounts of mass to different areas of the state space, as a function of their first argument.

Theorem S4.6.

For the alternative situation of an indicator function, it holds that for all $\delta\in\{\delta_{i}\}_{i=0}^{m}$ and fixed $p\geq 1$ ,

[TABLE]

for some constant $C_{p}$ independent of $P$ but dependent on $p$ and $\{\epsilon_{i}\}_{i=0}^{m-1}$ .

Cérou et al. [2012] proposed an algorithm similar to the one herein but focussed on approximation of the probability of a rare event rather than sampling from the rare event itself. In particular the theoretical results provided are in terms of these probabilities rather than how well the measure restricted to the rare event is approximated. Furthermore, many of the results therein focused upon an idealised version of the problem, in which it was assumed that the intermediate restricted measures can be sampled directly; this avoids the issues with vanishing potentials indicated in Del Moral [2004]. A similar algorithm was discussed in Ścibior et al. [2015] but was not shown to be theoretically sound.

The remainder of this Section establishes Theorem S4.6.

SMC-ND as a Feynman–Kac Model

The aim here is to demonstrate that Algorithm 2 fits into the framework of Section S4.2.1 for a specific McKean interpretation. This is analogous to the proof of Theorem S4.3.

A technical complication is that the potentials $G_{i}$ must take values in $(0,1]$ , which precludes the “obvious” choice of $E_{i}=\mathcal{X}$ and $G_{i}(x^{i})$ as indicator functions for the sets $\mathcal{X}_{\delta_{i}}^{a}$ . Instead, we associate $E_{i}=\mathcal{X}_{\delta_{i}}^{a}$ and $\mathcal{E}_{i}$ with the corresponding restriction of $\Sigma_{\mathcal{X}}$ . For the potentials we then take $G_{i}(x^{i})=1$ for all $x_{i}\in E_{i}$ , which clearly does not vanish and takes values in $(0,1]$ . For the Markov transitions $\Gamma_{i+1}$ from $E_{i}$ to $E_{i+1}$ we consider

[TABLE]

which is the restriction of $K_{i+1}$ to $E_{i+1}$ . For the latter to be well-defined it is required that the normalisation constant

[TABLE]

for all $x^{i}\in E_{i}$ , so that there is a positive probability of reaching $E_{i+1}$ from $E_{i}$ . This is the content of Assumption S4.4.

The Feynman–Kac measures associated with Algorithm 2 can be cast as a non-homogeneous Markov chain with transitions $\Lambda_{i+1,\eta}$ . Here $\Lambda_{i+1,\eta_{i}}$ acts on the current measure $\eta_{i}$ on $E_{i}$ by first propagating as $\eta_{i}K_{i+1}$ and then restricting this measure to $E_{i+1}$ . This procedure is seen to be identical to the Markov transition $\Gamma_{i+1}$ defined above and, since the potentials $G_{i}\equiv 1$ , it follows that

[TABLE]

This demonstrates that Algorithm 2 is the $P$ -particle model corresponding to the McKean interpretation $\Lambda_{i+1,\eta}$ of the Feynman–Kac triplet $(\eta_{0},G_{i},\Gamma_{i})$ . Thus the SMC-ND algorithm can be studied in the context of Section S4.2.1, which we report next.

Proof of Uniform Convergence Result for SMC-ND

It remains to verify the hypotheses of Theorem 7.4.4 in Del Moral [2004]. Condition $(G)$ is satisfied with no further assumption, since $G_{i}\equiv 1$ and we can take $\epsilon_{i}^{G}=1$ . Condition $(M_{1})$ requires that

[TABLE]

for all $x^{i},y^{i}\in E_{i}$ and $S\in\mathcal{E}_{i+1}$ . From construction this is equivalent to

[TABLE]

for all $x^{i},y^{i}\in E_{i}$ and $S\in\mathcal{E}_{i+1}$ . This is the content of Assumption S4.5.

Thus we have established the hypotheses of Theorem 7.4.4 in Del Moral [2004] for Algorithm 2 and in doing so have established Theorem S4.6.

S4.3 Parallel Tempering for Numerical Disintegration

Let $K_{i}$ , $\left\{\delta_{i}\right\}_{i=1}^{m}$ be as in Section S4.1. The PT algorithm [Geyer, 1991] for sampling from $\mu_{\delta_{m}}^{a}$ runs $m$ Markov chains in parallel, one for each temperature, by alternately applying $K_{i}$ , then randomly proposing to “swap” the current state of two of the chains. Commonly only swaps of adjacent chains are considered; to this end suppose at iteration $j$ an index $q\in\{0,\dots,m-1\}$ has been selected. Denote by $x^{q}$ the state of the chain with $\mu^{a}_{\delta_{q}}$ as its invariant measure. Then to ensure the correct invariant distribution of all chains is maintained, the swap of state $x^{q}$ and $x^{q+1}$ is accepted with probability

[TABLE]

where $\pi_{q}$ denotes the density of the target distribution $\mu^{a}_{\delta_{q}}$ with respect to a suitable reference measure. The density notation can be justified since in our experiments the sampler was applied to the finite-dimensional distributions $\mu^{a}_{\delta_{q},N}$ and so the reference measure can be taken to be the Lebesgue measure on $\mathbb{R}^{N}$ .

The PT algorithm for numerical disintegration is described in Algorithm 3. The samples $\{x_{j}^{m}\}_{j=1}^{P}$ are approximate draws from the distribution $\mu_{\delta_{m}}^{a}$ .

Algorithms 1 and 3 are each valid for sampling from a target measure $\mu_{\delta}^{a}$ . The choice of which algorithm to use is problem dependent, and each algorithm has been applied in the experiments in Section 6.

S4.4 Estimation of Model Evidence

The model evidence $p_{A}(a)$ was estimated as a by-product of the numerical disintegration algorithm developed. Attention is restricted to the specific relaxation function $\phi(r)=\exp(-r^{2})$ . Then the thermodynamic integral identity [Gelman and Meng, 1998] can be exploited to calculate the model evidence:

[TABLE]

where the parameterisation $\delta\mapsto\delta/\sqrt{t}$ is such that $t=0$ corresponds to the prior, while $t=1$ corresponds to the distribution $\mu^{a}_{\delta}$ .

To approximate this integral, the outer integral is first discretised. To this end, fix a sequence $\infty=\delta_{0}<\delta_{1}<\dots<\delta_{m}$ of relaxation parameters. For convenience this may be the same sequence as used to apply numerical disintegration. Then for $\delta_{m}$ small, and letting $\sqrt{t_{i}}=\delta_{m}/\delta_{i}$ :

[TABLE]

Thus we obtain a consistent approximation

[TABLE]

The terms $(*)$ were estimated via Monte Carlo, based on samples from the distributions $\mu_{\delta_{i}}^{a}$ obtained through numerical disintegration. Higher-order quadrature rules and variance reduction techniques can be used, but were not implemented for this work [Oates et al., 2016b].

S4.5 Monte Carlo Details for Painlevé Transcendental

Sampling of the posterior was performed for a temperature schedule of $m=1600$ steps, equally spaced on a logarithmic scale from $10$ to $10^{-4}$ , for an ensemble of $P=200$ particles.

Specification of appropriate transition kernels $K_{i}$ for this problem was challenging due both to the high dimension and the empirical observation that, for small $\delta$ , mixing of the chains tends to be poor. This is likely due to the nonlinearity of the information operator which leads to highly a complex posterior structure. For this reason a gradient-based sampler was used to construct the transition kernel; the Metropolis-adjusted Langevin algorithm (MALA) [Roberts and Tweedie, 1996].

Denote by $u^{k}$ the coefficients $[u^{k}_{j}]_{j=1}^{N}$ at iteration $k$ of MALA. Then, recall that MALA has proposals given by

[TABLE]

where $W$ is a standard Gaussian distribution and $\Gamma\in\mathbb{R}^{N\times N}$ is a positive definite preconditioning matrix. The $\tau_{i}$ were taken to be fixed for each kernel $K_{i}$ to a value found empirically to provide a reasonable acceptance rate. $\pi_{i}$ denotes the unnormalised target distribution for $K_{i}$ , here given by

[TABLE]

where $x^{N}=\sum_{i=0}^{N}u_{i}\phi_{i}$ and $q^{N}(\cdot)$ denotes the prior density of the coefficients $[u_{j}]_{j=1}^{N}$ .

To ensure proposals were scaled to match the decay of the prior for the coefficients, we took $\Gamma=\textrm{diag}(\gamma)$ , the diagonal matrix which has the coefficients $\gamma_{i}$ on its diagonal. Even with such a transition kernel, mixing is generally poor. To compensate $k$ was taken to be large; for $n=12,17$ we took $k=10,000$ , while for $n=22$ we took $k=40,000$ . We note that such a large number of temperature levels and transitions makes computation expensive, highlighting the importance of future work toward methods for approximating the Bayesian posterior in a more computationally efficient manner.

S4.6 Monte Carlo Details for Poisson Equation

The posterior distribution was obtained by use of the PT algorithm, for $m=20$ temperatures equally spaced on a logarithmic scale between $10^{-2}$ and $10^{-4}$ . The transition kernels $K_{i}$ were given by 10 iterations of a MALA sampler, with preconditioner as described earlier and parameter $\tau$ chosen to achieve a good acceptance rate. The number of iterations $P$ was taken to be $10^{6}$ when $n=25$ and $10^{7}$ when $n=25$ or $n=36$ .

S5 Truncation of the Prior Distribution (Proof of Theorem 4.8)

In this section we present the proof of Theorem 4.8 in the main text. We use a general result on the well-posedness of Bayesian inverse problems:

Theorem S5.1 (Theorem 4.6 in Sullivan [2016]).

Let $\mathcal{X}$ and $\mathcal{A}$ be separable quasi-Banach spaces over $\mathbb{R}$ . Suppose that

[TABLE]

where the potential function $\Phi_{\delta}$ satisfies:

S0

$\Phi_{\delta}(x;\cdot)$ * is continuous for each $x\in\mathcal{X}$ , $\Phi_{\delta}(\cdot;a)$ is measurable for each $a\in\mathcal{A}$ , and for every $r>0$ , there exists $M_{0,r,\delta}\in\mathbb{R}$ such that, for all $(x,a)\in\mathcal{X}\times\mathcal{A}$ with $\|x\|_{\mathcal{X}}<r$ and $\|a\|_{\mathcal{A}}<r$ ,*

[TABLE] 2. S1

For every $r>0$ , there exists a measurable $M_{1,r,\delta}\colon\mathbb{R}_{+}\to\mathbb{R}$ such that, for all $(x,a)\in\mathcal{X}\times\mathcal{A}$ with $\|a\|_{\mathcal{A}}<r$ ,

[TABLE] 3. S2

For every $r>0$ , there exists a measurable $M_{2,r,\delta}\colon\mathbb{R}_{+}\to\mathbb{R}_{+}$ such that, for all $(x,a,\tilde{a})\in\mathcal{X}\times\mathcal{A}\times\mathcal{A}$ with $\|a\|_{\mathcal{A}}<r$ , $\|\tilde{a}\|_{\mathcal{A}}<r$ ,

[TABLE]

Let $\Phi_{\delta,N}$ be an approximation to $\Phi_{\delta}$ that satisfies (S1-S3) with $M_{i,r,\delta}$ independent of $N$ , and such that

S3

$\Psi\colon\mathbb{N}\to\mathbb{R}_{+}$ * is such that, for every $r>0$ , there exists a measurable $M_{3,r,\delta}\colon\mathbb{R}_{+}\to\mathbb{R}_{+}$ , such that, for all $(x,a)\in\mathcal{X}\times\mathcal{A}$ with $\|a\|_{\mathcal{A}}<r$ ,*

[TABLE] 2. S4

For some $r>0$ ,

[TABLE]

Let $d_{\text{H}}$ denote the Hellinger distance on $\mathcal{P}_{\mathcal{X}}$ . Then there exists a constant $C_{\delta}$ , independent of $N$ , such that

[TABLE]

where $\mu_{\delta,N}^{a}$ is the posterior distribution based on the potential function $\Phi_{\delta,N}$ instead of $\Phi_{\delta}$ .

This allows us to establish conditions on $A$ and $\mu$ that guarantee stability under truncation of the prior:

Proof of Theorem 4.8.

Let $\varphi$ be as in Section 4.1, and let

[TABLE]

Our task is to check the conditions of Theorem S5.1 hold for $\Phi_{\delta}$ and $\Phi_{\delta,N}$ .

S0

First, note that $\Phi_{\delta}(x;\cdot)$ is continuous (since $\varphi$ is continuous from Assumption 4.1 and $\Phi_{\delta}(x;\cdot)$ is a composition of continuous functions) and that $\Phi_{\delta}(\cdot;a)$ is measurable (since $\phi$ is measurable and $\Phi_{\delta}(\cdot;a)$ is a composition of measurable functions). Second, note that $\varphi$ is a continuous bijection from $(0,\infty)$ to itself with $\varphi(0)=0$ . Thus $\varphi^{-1}$ exists and we can consider

[TABLE]

Thus we can take $M_{0,r,\delta}=\varphi(\frac{1}{\delta}\sup_{x\in\mathcal{X}}\|A(x)\|_{\mathcal{A}}+\frac{r}{\delta})$ . 2. S1

Since $\Phi_{\delta}(x;a)\geq 0$ we can take $M_{1,r,\delta}=0$ . 3. S2

Given $r>0$ let $R=\frac{1}{\delta}\sup_{x\in\mathcal{X}}\|A(x)\|_{\mathcal{A}}+\frac{r}{\delta}$ , which is finite by Assumption 4.6. The upper bound

[TABLE]

demonstrates that we can take $M_{2,r,\delta}=\max\{0,\log(\frac{C_{R}}{\delta})\}$ .

Minor variation on the above arguments show that S1-3 also hold for $\Phi_{\delta,N}$ with the same constants $M_{i,r,\delta}$ .

S3

Let $C_{R}$ be defined as in S2. The upper bound

[TABLE]

demonstrates that we can take $M_{3,r,\delta}(\|x\|_{\mathcal{X}})=\max\{0,\log(\frac{C_{R}}{\delta})+m(\|x\|_{\mathcal{X}})\}$ . 2. S4

Let $C_{R}$ be defined as in S2. The upper bound

[TABLE]

establishes the last of the conditions for Theorem S5.1 to hold.

Thus from Theorem S5.1, $d_{\text{H}}\bigl{(}\mu_{\delta,N}^{a},\mu_{\delta}^{a}\bigr{)}\leq C_{\delta}\Psi(N)$ . The proof is completed since Assumption 4.7 implies that $d_{\mathcal{F}}\leq C_{\mathcal{F}}^{-1}d_{\text{TV}}$ where $d_{\text{TV}}$ is the total variation distance based on $\mathcal{F}=\{f:\|f\|_{\infty}\leq 1\}$ ; in turn it is a standard fact that $d_{\text{TV}}\leq\sqrt{2}d_{\text{H}}$ . ∎

Bibliography138

The reference list from the paper itself. Each links out to its DOI / PubMed record.

1Ackerman et al. [2017] N. L. Ackerman, C. E. Freer, and D. M. Roy. On computability and disintegration. Mathematical Structures in Computer Science , 2017. To appear.
2Albert et al. [2012] I. Albert, S. Donnet, C. Guihenneuc-Jouyaux, S. Low-Choy, K. Mengersen, and J. Rousseau. Combining expert opinions in prior elicitation. Bayesian Anal. , 7(3):503–531, 2012. 10.1214/12-BA 717 . · doi ↗
3Anderson [2011] T. V. Anderson. Efficient, accurate, and non-gaussian error propagation through nonlinear, closed-form, analytical system models. Master’s thesis, Department of Mechanical Engineering, Brigham Young University, 2011.
4Babuška and Söderlind [2016] I. Babuška and G. Söderlind. On round-off error growth in elliptic problems, 2016. In preparation.
5Bartels and Hennig [2016] S. Bartels and P. Hennig. Probabilistic approximate least-squares. In Proceedings of Artificial Intelligence and Statistics (AISTATS) , 2016.
6Berger [1985] J. O. Berger. Statistical Decision Theory and Bayesian Analysis . Springer Series in Statistics. Springer-Verlag, New York, second edition, 1985. 10.1007/978-1-4757-4286-2 . · doi ↗
7Beskos et al. [2014] A. Beskos, D. Crisan, and A. Jasra. On the stability of sequential Monte Carlo methods in high dimensions. Ann. Appl. Probab. , 24(4):1396–1445, 2014. 10.1214/13-AAP 951 . · doi ↗
8Beskos et al. [2015] A. Beskos, A. Jasra, E. A. Muzaffer, and A. M. Stuart. Sequential Monte Carlo methods for Bayesian elliptic inverse problems. Stat. Comput. , 25(4):727–737, 2015. 10.1007/s 11222-015-9556-7 . · doi ↗

TL;DR

Contribution

Findings

Abstract

Peer Reviews

Videos

Bayesian Probabilistic Numerical Methods

Abstract

1 Introduction

Open Problems

Contributions

Structure of the Paper

2 Probabilistic Numerical Methods

2.1 Notation

2.2 Definition of a PNM

Definition 2.1** (Belief Distribution).**

Definition 2.2** (Probabilistic Numerical Method).**

Example 2.3** (Probabilistic Integration).**

Example 2.4** (Probabilistic Meshless Method).**

2.2.1 Classical Numerical Methods

2.3 Bayesian PNMs

Definition 2.5** (Bayesian Probabilistic Numerical Method).**

2.4 Model Evidence

2.5 The Disintegration Theorem

Definition 2.6** (Disintegration).**

Theorem 2.7** (Disintegration Theorem; Thm. 1 of Chang and Pollard (1997)).**

2.6 Prior Construction

2.6.1 Dichotomy of Existing PNMs

3 Decision-Theoretic Treatment

3.1 Loss and Risk

Definition 3.1** (Contraction).**

3.2 Bayes Decision Rules

Theorem 3.2**.**

3.3 Optimal Information

3.4 Connection to Average Case Analysis

Theorem 3.3**.**

Example 3.4** (Optimal Information for Probabilistic Integration).**

4 Numerical Disintegration

4.1 Sequential Approximation of a Disintegration

Assumption 4.1**.**

Assumption 4.2**.**

Theorem 4.3**.**

4.2 Computation for Series Priors

Assumption 4.4**.**

Assumption 4.5**.**

Assumption 4.6**.**

Assumption 4.7**.**

Theorem 4.8**.**

4.3 Monte Carlo Methods for Numerical Disintegration

5 Computational Pipelines and PNM

5.1 Computational Pipelines

Definition 5.1** (Pipeline).**

Example 5.2** (Distributed Integration).**

Definition 5.3** (Compatible).**

Example 5.4** (Propagation of Information).**

Definition 5.5** (Computation).**

5.2 Bayesian Computational Pipelines

Definition 5.6** (Bayesian Computation).**

Definition 5.7** (Dependence Graph).**

Definition 5.8** (Coherence).**

Theorem 5.9**.**

Example 5.10** (Example 5.2, continued).**

5.3 Monte Carlo Methods for Probabilistic Computation

Example 5.11** (Ancestral Sampling for PNM).**

6 Numerical Experiments

6.1 Poisson Equation

6.2 The Painlevé ODE

6.3 Application to Industrial Process Monitoring

7 Discussion

Bayesian vs Non-Bayesian PNMs

Computational Cost

Prior Elicitation

Consistent Estimation

Generalisation and Extensions

Connection with Probabilistic Programming

Acknowledgements

Appendices

Appendix A Proofs

Proof of Theorem 3.3.

Proof of Theorem 4.3.

Definition 2.1 (Belief Distribution).

Definition 2.2 (Probabilistic Numerical Method).

Example 2.3 (Probabilistic Integration).

Example 2.4 (Probabilistic Meshless Method).

Definition 2.5 (Bayesian Probabilistic Numerical Method).

Definition 2.6 (Disintegration).

Theorem 2.7 (Disintegration Theorem; Thm. 1 of Chang and Pollard (1997)).

Definition 3.1 (Contraction).

Theorem 3.2.

Theorem 3.3.

Example 3.4 (Optimal Information for Probabilistic Integration).

Assumption 4.1.

Assumption 4.2.

Theorem 4.3.

Assumption 4.4.

Assumption 4.5.

Assumption 4.6.

Assumption 4.7.

Theorem 4.8.

Definition 5.1 (Pipeline).

Example 5.2 (Distributed Integration).

Definition 5.3 (Compatible).

Example 5.4 (Propagation of Information).

Definition 5.5 (Computation).

Definition 5.6 (Bayesian Computation).

Definition 5.7 (Dependence Graph).

Definition 5.8 (Coherence).

Theorem 5.9.

Example 5.10 (Example 5.2, continued).

Example 5.11 (Ancestral Sampling for PNM).

Proposition S2.1.

Case of $S=\{\spadesuit,\diamondsuit\}$ :

Case of $S=\{\spadesuit,\diamondsuit,\heartsuit\}$ :

Case of $S=\{\spadesuit,\diamondsuit\}$ :

Case of $S=\{\spadesuit,\diamondsuit,\heartsuit\}$ :

Assumption S4.1.

Assumption S4.2.

Theorem S4.3.

Theorem (Theorem 7.4.4 in Del Moral [2004]).

S4.2.2 Case A: Positive Function $\phi(r)>0$

S4.2.3 Case B: Indicator Function $\phi(r)=1[r<1]$

Assumption S4.4.

Assumption S4.5.

Theorem S4.6.

Theorem S5.1 (Theorem 4.6 in Sullivan [2016]).