The Trio Identity for Quasi-Monte Carlo Error

Fred J. Hickernell

arXiv:1702.01487·math.NA·August 18, 2017

The Trio Identity for Quasi-Monte Carlo Error

Fred J. Hickernell

PDF

TL;DR

This paper introduces the trio identity for Quasi-Monte Carlo error, linking integrand variation, sampling discrepancy, and confounding to better understand and reduce integration errors.

Contribution

It formalizes the trio identity for Monte Carlo errors, highlighting the role of confounding and demonstrating how low discrepancy sampling improves accuracy.

Findings

01

Confounding is a key factor in cubature error.

02

Low discrepancy sampling reduces integration error.

03

Rewriting the integrand can further decrease error.

Abstract

Monte Carlo methods approximate integrals by sample averages of integrand values. The error of Monte Carlo methods may be expressed as a trio identity: the product of the variation of the integrand, the discrepancy of the sampling measure, and the confounding. The trio identity has different versions, depending on whether the integrand is deterministic or Bayesian and whether the sampling measure is deterministic or random. Although the variation and the discrepancy are common in the literature, the confounding is relatively unknown and under-appreciated. Theory and examples are used to show how the cubature error may be reduced by employing the low discrepancy sampling that defines quasi-Monte Carlo methods. The error may also be reduced by rewriting the integral in terms of a different integrand. Finally, the confounding explains why the cubature error might decay at a rate different…

Tables2

Table 1. Table 1: Different versions of the trio identity

	Sampling Measure, $\hat{ν}$
Integrand, $f$	Deterministic	Random
Deterministic	Deterministic $=^{D}$	Randomized $=^{R}$
Gaussian Process	Bayesian $=^{B}$	Bayesian Randomized $=^{B R}$

Table 2. Table 2: Confounding orders for deterministic randomized settings and two different sets of equi-weighted random sampling measures. Sufficient smoothness of the integrand is assumed. The order of the error equals the order of the discrepancy times the order of the confounding.

\begin{matrix} RMS & Typical & Unscr. & Typical \\ Deterministic setting & L^{2} ​ {- DSC}^{D} & \hat{ν} & Worst & IID & Sobol’ & Scr. Sobol’ \\ IID Sampling & 𝒪 ​ (n^{- 1 / 2}) & {CNF}^{D} & 𝒪 ​ (1) & 𝒪 ​ (n^{- 1 / 2 + ϵ}) & 𝒪 ​ (n^{- 1 + ϵ}) \\ Scr. Sobol’ Sampling & 𝒪 ​ (n^{- 1 + ϵ}) & {CNF}^{D} & 𝒪 ​ (1) & 𝒪 ​ (n^{- 1 / 2 + ϵ}) \\ μ - \hat{μ} & 𝒪 ​ (1) & 𝒪 ​ (n^{- 1 / 2}) & 𝒪 ​ (n^{- 1 + ϵ}) & 𝒪 ​ (n^{- 3 / 2 + ϵ}) \\ Randomized setting & {DSC}^{R} \\ IID Sampling & 𝒪 ​ (n^{- 1 / 2}) & {CNF}^{R} & 𝒪 ​ (n^{1 / 2}) & 𝒪 ​ (1) & 𝒪 ​ (n^{- 1 / 2 + ϵ}) & 𝒪 ​ (n^{- 1 + ϵ}) \\ Scr. Sobol’ Sampling & 𝒪 ​ (n^{- 3 / 2 + ϵ}) & {CNF}^{R} & 𝒪 ​ (n^{1 / 2 + ϵ}) & 𝒪 ​ (1) \end{matrix}

Equations130

μ = \int_{X} f (x) ν (d x),

μ = \int_{X} f (x) ν (d x),

μ = i = 1 \sum n f (x_{i}) w_{i} = \int_{X} f (x) ν (d x) .

μ = i = 1 \sum n f (x_{i}) w_{i} = \int_{X} f (x) ν (d x) .

M_{S} := {i = 1 \sum n w_{i} δ_{x_{i}} : w_{1}, \dots, w_{n} \in R, x_{1}, \dots, x_{n} \in X, n \in N},

M_{S} := {i = 1 \sum n w_{i} δ_{x_{i}} : w_{1}, \dots, w_{n} \in R, x_{1}, \dots, x_{n} \in X, n \in N},

μ - μ = CNF (f, ν - ν) DSC (ν - ν) VAR (f),

μ - μ = CNF (f, ν - ν) DSC (ν - ν) VAR (f),

VAR^{D} (f) := ∥ f - T (f) ∥_{F} \forall f \in F .

VAR^{D} (f) := ∥ f - T (f) ∥_{F} \forall f \in F .

M_{⊥} := {{η \in M : η (X) = 0}, M, {T (f) : f \in F} \neq = {0}, {T (f) : f \in F} = {0} .

M_{⊥} := {{η \in M : η (X) = 0}, M, {T (f) : f \in F} \neq = {0}, {T (f) : f \in F} = {0} .

\left\lVert\eta\right\rVert_{{\mathcal{M}}_{\bot}}:=\sup_{f\in{\mathcal{F}}:f\neq 0}\frac{\displaystyle\biggl{\lvert}\int_{{\mathcal{X}}}f({\bm{x}})\,\eta(\mathrm{d}{\bm{x}})\biggr{\rvert}}{\left\lVert f\right\rVert_{{\mathcal{F}}}},\qquad\textup{DSC}^{\textup{D}}(\nu-\widehat{\nu}):=\left\lVert\nu-\widehat{\nu}\right\rVert_{{\mathcal{M}}_{\bot}}.

\left\lVert\eta\right\rVert_{{\mathcal{M}}_{\bot}}:=\sup_{f\in{\mathcal{F}}:f\neq 0}\frac{\displaystyle\biggl{\lvert}\int_{{\mathcal{X}}}f({\bm{x}})\,\eta(\mathrm{d}{\bm{x}})\biggr{\rvert}}{\left\lVert f\right\rVert_{{\mathcal{F}}}},\qquad\textup{DSC}^{\textup{D}}(\nu-\widehat{\nu}):=\left\lVert\nu-\widehat{\nu}\right\rVert_{{\mathcal{M}}_{\bot}}.

CNF^{D} (f, ν - ν) := ⎩ ⎨ ⎧ \frac{\int _{X} f ( x ) ( ν - ν ) ( d x )}{VAR ^{D} ( f ) DSC ^{D} ( ν - ν )}, 0, VAR^{D} (f) DSC^{D} (ν - ν) \neq = 0, otherwise .

CNF^{D} (f, ν - ν) := ⎩ ⎨ ⎧ \frac{\int _{X} f ( x ) ( ν - ν ) ( d x )}{VAR ^{D} ( f ) DSC ^{D} ( ν - ν )}, 0, VAR^{D} (f) DSC^{D} (ν - ν) \neq = 0, otherwise .

μ - μ = CNF^{D} (f, ν - ν) DSC^{D} (ν - ν) VAR^{D} (f) .

μ - μ = CNF^{D} (f, ν - ν) DSC^{D} (ν - ν) VAR^{D} (f) .

μ - μ = \int_{X} f (x) (ν - ν) (d x) .

μ - μ = \int_{X} f (x) (ν - ν) (d x) .

\displaystyle\bigl{\lvert}\textup{CNF}(f,\nu-\widehat{\nu})\bigr{\rvert}

\displaystyle\bigl{\lvert}\textup{CNF}(f,\nu-\widehat{\nu})\bigr{\rvert}

\displaystyle=\frac{\biggl{\lvert}\displaystyle\int_{{\mathcal{X}}}[f({\bm{x}})-T(f)]\,(\nu-\widehat{\nu})(\mathrm{d}{\bm{x}})\biggr{\rvert}}{\left\lVert f-T(f)\right\rVert_{{\mathcal{F}}}\textup{DSC}^{\textup{D}}(\nu-\widehat{\nu})}\quad\text{by \eqref{FJH:eq:detvardef} and \eqref{FJH:eq:Mperpdef}}

\leq 1 by \eqref FJH:eq:detdiscdef,

K (\cdot, t) \in F and f (t) = ⟨ K (\cdot, t), f ⟩_{F} \forall f \in F, t \in X .

K (\cdot, t) \in F and f (t) = ⟨ K (\cdot, t), f ⟩_{F} \forall f \in F, t \in X .

η_{err} (t) = ⟨ K (\cdot, t), η_{err} ⟩_{F} = \int_{X} K (x, t) (ν - ν) (d x) .

η_{err} (t) = ⟨ K (\cdot, t), η_{err} ⟩_{F} = \int_{X} K (x, t) (ν - ν) (d x) .

μ - μ = ⟨ η_{err}, f ⟩_{F} = CNF^{D} (f, ν - ν) \frac{⟨ η _{err} , f ⟩ _{F}}{∥ f - T ( f ) ∥ _{F} ∥ η _{err} ∥ _{F}} DSC^{D} (ν - ν) ∥ η_{err} ∥_{F} VAR^{D} (f) ∥ f - T (f) ∥_{F}

μ - μ = ⟨ η_{err}, f ⟩_{F} = CNF^{D} (f, ν - ν) \frac{⟨ η _{err} , f ⟩ _{F}}{∥ f - T ( f ) ∥ _{F} ∥ η _{err} ∥ _{F}} DSC^{D} (ν - ν) ∥ η_{err} ∥_{F} VAR^{D} (f) ∥ f - T (f) ∥_{F}

T (f) [ν (X) - ν (X)] = 0.

T (f) [ν (X) - ν (X)] = 0.

[DSC^{D} (ν - ν)]^{2}

[DSC^{D} (ν - ν)]^{2}

= \int_{X \times X} K (x, t) (ν - ν) (d x) (ν - ν) (d t)

= \int_{X \times X} K (x, t) ν (d x) ν (d t)

- 2 i = 1 \sum n w_{i} \int_{X} K (x_{i}, t) ν (d t) + i, j = 1 \sum n w_{i} w_{j} K (x_{i}, x_{j}) .

\displaystyle{\bm{w}}=\bigl{(}w_{i}\bigr{)}_{i=1}^{n},\qquad k_{0}=\int_{{\mathcal{X}}}K({\bm{x}},{\bm{t}})\,\nu(\mathrm{d}{\bm{x}})\,\nu(\mathrm{d}{\bm{t}}),

\displaystyle{\bm{w}}=\bigl{(}w_{i}\bigr{)}_{i=1}^{n},\qquad k_{0}=\int_{{\mathcal{X}}}K({\bm{x}},{\bm{t}})\,\nu(\mathrm{d}{\bm{x}})\,\nu(\mathrm{d}{\bm{t}}),

\displaystyle{\bm{k}}=\biggl{(}\int_{{\mathcal{X}}}K({\bm{x}}_{i},{\bm{t}})\,\nu(\mathrm{d}{\bm{t}})\biggr{)}_{i=1}^{n},\qquad\mathsf{K}=\bigl{(}K({\bm{x}}_{i},{\bm{x}}_{j})\bigr{)}_{i,j=1}^{n},

[DSC^{D} (ν - ν)]^{2} = k_{0} - 2 k^{T} w + w^{T} K w .

K (x, t) = k = 1 \prod d [2 - max (x_{k}, t_{k})] .

K (x, t) = k = 1 \prod d [2 - max (x_{k}, t_{k})] .

\textup{VAR}^{\textup{D}}(f)=\left\lVert f-f({\bm{1}})\right\rVert_{\mathcal{F}}=\Bigl{\lVert}\bigl{(}\left\lVert\partial^{\mathfrak{u}}f\right\rVert_{L^{2}}\bigr{)}_{\emptyset\subsetneq{\mathfrak{u}}\subseteq 1:d}\Bigr{\rVert}_{\ell^{2}},\qquad\partial^{\mathfrak{u}}f:=\frac{\partial^{\lvert{\mathfrak{u}}\rvert}f}{\partial{\bm{x}}_{\mathfrak{u}}}\biggr{\rvert}_{{\bm{x}}_{\bar{{\mathfrak{u}}}}={\bm{1}}}.

\textup{VAR}^{\textup{D}}(f)=\left\lVert f-f({\bm{1}})\right\rVert_{\mathcal{F}}=\Bigl{\lVert}\bigl{(}\left\lVert\partial^{\mathfrak{u}}f\right\rVert_{L^{2}}\bigr{)}_{\emptyset\subsetneq{\mathfrak{u}}\subseteq 1:d}\Bigr{\rVert}_{\ell^{2}},\qquad\partial^{\mathfrak{u}}f:=\frac{\partial^{\lvert{\mathfrak{u}}\rvert}f}{\partial{\bm{x}}_{\mathfrak{u}}}\biggr{\rvert}_{{\bm{x}}_{\bar{{\mathfrak{u}}}}={\bm{1}}}.

[DSC^{D} (ν - ν)]^{2}

[DSC^{D} (ν - ν)]^{2}

\displaystyle=\Bigl{\lVert}\bigl{(}\left\lVert\nu([{\bm{0}},\cdot_{\mathfrak{u}}])-\widehat{\nu}([{\bm{0}},\cdot_{\mathfrak{u}}])\right\rVert_{L^{2}}\bigr{)}_{\emptyset\subsetneq{\mathfrak{u}}\subseteq 1:d}\Bigr{\rVert}_{\ell^{2}}.

{\mathbb{E}}\bigl{\{}[\textup{DSC}^{\textup{D}}(\nu-\widehat{\nu})]^{2}\bigr{\}}=\frac{1}{n}\left[\int_{{\mathcal{X}}}K({\bm{x}},{\bm{x}})\,\nu(\mathrm{d}{\bm{x}})-\int_{{\mathcal{X}}\times{\mathcal{X}}}K({\bm{x}},{\bm{t}})\,\nu(\mathrm{d}{\bm{x}})\,\nu(\mathrm{d}{\bm{t}})\right].

{\mathbb{E}}\bigl{\{}[\textup{DSC}^{\textup{D}}(\nu-\widehat{\nu})]^{2}\bigr{\}}=\frac{1}{n}\left[\int_{{\mathcal{X}}}K({\bm{x}},{\bm{x}})\,\nu(\mathrm{d}{\bm{x}})-\int_{{\mathcal{X}}\times{\mathcal{X}}}K({\bm{x}},{\bm{t}})\,\nu(\mathrm{d}{\bm{x}})\,\nu(\mathrm{d}{\bm{t}})\right].

{\mathbb{E}}\bigl{\{}[\textup{DSC}^{\textup{D}}(\nu-\widehat{\nu})]^{2}\bigr{\}}=\frac{1}{n}\left[\left(\frac{3}{2}\right)^{d}-\left(\frac{4}{3}\right)^{d}\right].

{\mathbb{E}}\bigl{\{}[\textup{DSC}^{\textup{D}}(\nu-\widehat{\nu})]^{2}\bigr{\}}=\frac{1}{n}\left[\left(\frac{3}{2}\right)^{d}-\left(\frac{4}{3}\right)^{d}\right].

\displaystyle{\mathbb{E}}_{\widehat{\nu}}\biggl{\lvert}\int_{{\mathcal{X}}}f({\bm{x}})\,\widehat{\nu}(\mathrm{d}{\bm{x}})\biggr{\rvert}^{2}<\infty\qquad\forall f\in{\mathcal{F}},

\displaystyle{\mathbb{E}}_{\widehat{\nu}}\biggl{\lvert}\int_{{\mathcal{X}}}f({\bm{x}})\,\widehat{\nu}(\mathrm{d}{\bm{x}})\biggr{\rvert}^{2}<\infty\qquad\forall f\in{\mathcal{F}},

{T (f) : f \in F} = {0} or ν (X) = ν (X) almost surely.

\textup{DSC}^{\textup{R}}(\nu-\widehat{\nu}):=\sup_{f\in{\mathcal{F}}:f\neq 0}\frac{\displaystyle\sqrt{{\mathbb{E}}_{\widehat{\nu}}\biggl{\lvert}\int_{{\mathcal{X}}}f({\bm{x}})\,(\nu-\widehat{\nu})(\mathrm{d}{\bm{x}})\biggr{\rvert}^{2}}}{\left\lVert f\right\rVert_{{\mathcal{F}}}}.

\textup{DSC}^{\textup{R}}(\nu-\widehat{\nu}):=\sup_{f\in{\mathcal{F}}:f\neq 0}\frac{\displaystyle\sqrt{{\mathbb{E}}_{\widehat{\nu}}\biggl{\lvert}\int_{{\mathcal{X}}}f({\bm{x}})\,(\nu-\widehat{\nu})(\mathrm{d}{\bm{x}})\biggr{\rvert}^{2}}}{\left\lVert f\right\rVert_{{\mathcal{F}}}}.

CNF^{R} (f, ν - ν) := ⎩ ⎨ ⎧ \frac{\int _{X} f ( x ) ( ν - ν ) ( d x )}{VAR ^{D} ( f ) DSC ^{R} ( ν - ν )}, 0, VAR^{D} (f) DSC^{R} (ν - ν) \neq = 0, otherwise .

CNF^{R} (f, ν - ν) := ⎩ ⎨ ⎧ \frac{\int _{X} f ( x ) ( ν - ν ) ( d x )}{VAR ^{D} ( f ) DSC ^{R} ( ν - ν )}, 0, VAR^{D} (f) DSC^{R} (ν - ν) \neq = 0, otherwise .

μ - μ = CNF^{R} (f, ν - ν) DSC^{R} (ν - ν) VAR^{D} (f) almost surely .

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Full text

∎

11institutetext: Fred J. Hickernell 22institutetext: Department of Applied Mathematics, Illinois Institute of Technology, 10 W. 32 ${}^{\text{nd}}$ Street, RE 208, Chicago, IL 60616, USA 22email: [email protected]

The Trio Identity for Quasi-Monte Carlo Error

Fred J. Hickernell

Abstract

Monte Carlo methods approximate integrals by sample averages of integrand values. The error of Monte Carlo methods may be expressed as a trio identity: the product of the variation of the integrand, the discrepancy of the sampling measure, and the confounding. The trio identity has different versions, depending on whether the integrand is deterministic or Bayesian and whether the sampling measure is deterministic or random. Although the variation and the discrepancy are common in the literature, the confounding is relatively unknown and under-appreciated. Theory and examples are used to show how the cubature error may be reduced by employing the low discrepancy sampling that defines quasi-Monte Carlo methods. The error may also be reduced by rewriting the integral in terms of a different integrand. Finally, the confounding explains why the cubature error might decay at a rate different from that of the discrepancy.

1 Introduction

Monte Carlo methods are used to approximate multivariate integrals that cannot be evaluated analytically, i.e., integrals of the form

[TABLE]

where $f:{\mathcal{X}}\to{\mathbb{R}}$ is a measurable function, ${\mathcal{X}}$ is a measurable set, and $\nu$ is a probability measure. Here, $\mu$ is the weighted average of the integrand. Also, $\mu={\mathbb{E}}[f({\bm{X}})]$ , where the random variable ${\bm{X}}$ has probability measure $\nu$ . Monte Carlo methods take the form of a weighted average of values of $f$ at a finite number of data sites, ${\bm{x}}_{1},\ldots,{\bm{x}}_{n}$ :

[TABLE]

The sampling measure, $\widehat{\nu}$ , assigns a weight $w_{i}$ to the function value at ${\bm{x}}_{i}$ and lies in the vector space

[TABLE]

where $\delta_{{\bm{t}}}$ denotes a Dirac measure concentrated at point ${\bm{t}}$ . The data sites, the weights, and the sample size may be deterministic or random. Later, we impose some constraints to facilitate the analysis.

We are particularly interested in sampling measures that choose the data sites more cleverly than independently and identically distributed (IID) with the aim of obtaining smaller errors for the same computational effort. Such sampling measures are the hallmark of quasi-Monte Carlo methods. It is common to choose $w_{1}=\cdots=w_{n}=1/n$ , in which case the sampling quality is determined solely by the choice of the data sites.

This tutorial describes how to characterize and analyze the cubature error, $\mu-\widehat{\mu}$ , as a trio identity:

[TABLE]

introduced by Xiao-Li Meng Men16a . Each term in this identity contributes to the error, and there are ways to decrease each.

$\textup{VAR}(f)$

measures the variation of the integrand from a typical value. The variation is positively homogeneous, i.e., $\textup{VAR}(cf)=\bigl{\lvert}c\bigr{\rvert}\textup{VAR}(f)$ . The variation is not the variance. Expressing $\mu$ in terms of a different integrand by means of a variable transformation may decrease the variation.

$\textup{DSC}(\nu-\widehat{\nu})$

measures the discrepancy of the sampling measure from the probability measure that defines the integral. The convergence rate of the discrepancy to zero as $n\to\infty$ characterizes the quality of the sampling measure.

$\textup{CNF}(f,\nu-\widehat{\nu})$

measures the confounding between the integrand and the difference between the measure defining the integral and the sampling measure. The magnitude of the confounding is bounded by one in some settings and has an expected square value of one in other settings. When the convergence rate of $\widehat{\mu}\to\mu$ differs from the convergence rate of $\textup{DSC}(\nu-\widehat{\nu})\to 0$ , the confounding is behaving unusually.

There are four versions of the trio identity corresponding to different models for the integrand and for the sampling measure as depicted in Table 1. The integrand may be an arbitrary (deterministic) element of a Banach space or it may be a Gaussian stochastic process. The sampling measure may be an arbitrary (deterministic) element of ${\mathcal{M}}_{\textup{S}}$ or chosen randomly. Here we derive and explain these four different versions of the trio identity and draw a baker’s dozen of key lessons, which are repeated at the end of this article.

Lesson 1

The trio identity (TRIO) decomposes the cubature error into a product of three factors: the variation of the integrand, the discrepancy of the sampling measure, and the confounding. This identity shows how the integrand and the sampling measure each contribute to the cubature error.

2 A Deterministic Trio Identity for Cubature Error

We start by generalizing the error bounds of Koksma Kok42 and Hlawka Hla61 . See also the monograph of Niederreiter Nie92 . Suppose that the integrand lies in some Banach space, $({\mathcal{F}},\left\lVert\cdot\right\rVert_{{\mathcal{F}}})$ , where function evaluation at any point in the domain, ${\mathcal{X}}$ , is a bounded, linear functional. This means that $\sup_{f\in{\mathcal{F}}}\bigl{\lvert}f({\bm{t}})\bigr{\rvert}/\left\lVert f\right\rVert_{{\mathcal{F}}}<\infty$ for all ${\bm{t}}\in{\mathcal{X}}$ and that $\int_{{\mathcal{X}}}f({\bm{x}})\,\delta_{{\bm{t}}}(\mathrm{d}{\bm{x}})=f({\bm{t}})$ for all $f\in{\mathcal{F}},\ {\bm{t}}\in{\mathcal{X}}$ . For example, one might choose ${\mathcal{F}}=C[0,1]^{d}$ , but ${\mathcal{F}}=L^{2}[0,1]^{d}$ is unacceptable. Let $T:{\mathcal{F}}\to{\mathbb{R}}$ be some bounded linear functional providing a typical value of $f$ , e.g., $T(f)=f({\bm{1}})$ or $T(f)=\int_{{\mathcal{X}}}f({\bm{x}})\,\nu(\mathrm{d}{\bm{x}})$ . If $\{T(f):f\in{\mathcal{F}}\}\neq\{0\}$ , then ${\mathcal{F}}$ is assumed to contain constant functions. The deterministic variation is a semi-norm that is defined as the norm of the function minus its typical value:

[TABLE]

Let ${\mathcal{M}}$ denote the vector space of signed measures for which integrands in ${\mathcal{F}}$ have finite integrals: ${\mathcal{M}}:=\bigl{\{}\text{signed measures }\eta:\bigl{\lvert}\int_{{\mathcal{X}}}f({\bm{x}})\,\eta(\mathrm{d}{\bm{x}})\bigr{\rvert}<\infty\ \ \forall f\in{\mathcal{F}}\bigr{\}}$ . We assume that our integral of interest is defined, so $\nu\in{\mathcal{M}}$ . Since function evaluation is bounded, ${\mathcal{M}}$ includes ${\mathcal{M}}_{\textup{S}}$ defined in (1) as well. Define the subspace

[TABLE]

For example, if $\widehat{\nu}({\mathcal{X}})=\nu({\mathcal{X}})$ , which is common, then $\nu-\widehat{\nu}$ is automatically in ${\mathcal{M}}_{\bot}$ . However, in some situations $\widehat{\nu}({\mathcal{X}})\neq\nu({\mathcal{X}})$ , as is noted in the discussion following (8) below. A semi-norm on ${\mathcal{M}}_{\bot}$ is induced by the norm on ${\mathcal{F}}$ , which provides the definition of discrepancy:

[TABLE]

Finally, define the confounding as

[TABLE]

The above definitions lead to the deterministic trio identity for cubature error.

Theorem 2.1 (Deterministic Trio Error Identity)

For the spaces of integrands and measures defined above, and for the above definitions of variation, discrepancy, and confounding, the following error identity holds for all $f\in{\mathcal{F}}$ and $\nu-\widehat{\nu}\in{\mathcal{M}}_{\bot}$ :

[TABLE]

Moreover, $\bigl{\lvert}\textup{CNF}^{\textup{D}}(f,\nu-\widehat{\nu})\bigr{\rvert}\leq 1$ .

Proof

The proof of this identity follows from the definitions above. It follows from (INT) and (MC) that for all $f\in{\mathcal{F}}$ and $\nu-\widehat{\nu}\in{\mathcal{M}}_{\bot}$ , the cubature error can be written as a single integral:

[TABLE]

If $\textup{VAR}^{\textup{D}}(f)=0$ , then $f=T(f)$ , and the integral above vanishes by the definition of ${\mathcal{M}}_{\bot}$ . If $\textup{DSC}^{\textup{D}}(\nu-\widehat{\nu})=0$ , then the integral above vanishes by (4). Thus, for $\textup{VAR}^{\textup{D}}(f)\textup{DSC}^{\textup{D}}(\nu-\widehat{\nu})=0$ the trio identity holds. If $\textup{VAR}^{\textup{D}}(f)\textup{DSC}^{\textup{D}}(\nu-\widehat{\nu})\neq 0$ , then the trio identity also holds by the definition of the confounding.

Next, we bound the magnitude of the confounding for $\textup{VAR}^{\textup{D}}(f)\textup{DSC}^{\textup{D}}(\nu-\widehat{\nu})\neq 0$ :

[TABLE]

since $\textup{VAR}^{\textup{D}}(f)\neq 0$ and so $f-T(f)\neq 0$ .

Because $\bigl{\lvert}\textup{CNF}^{\textup{D}}(f,\nu-\widehat{\nu})\bigr{\rvert}\leq 1$ , the deterministic trio identity implies a deterministic error bound: $\bigl{\lvert}\mu-\widehat{\mu}\bigr{\rvert}\leq\textup{DSC}^{\textup{D}}(\nu-\widehat{\nu})\,\textup{VAR}^{\textup{D}}(f)$ . However, there is value in keeping the confounding term as noted below in Lesson 5.

The error in approximating the integral of $cf$ is $c$ times that for approximating the integral of $f$ . This is reflected in the fact that $\textup{VAR}^{\textup{D}}(cf)=\lvert c\rvert\textup{VAR}^{\textup{D}}(f)$ and $\textup{CNF}(cf,\nu-\widehat{\nu})=\textup{sign}(c)\textup{CNF}(f,\nu-\widehat{\nu})$ , while $\textup{DSC}^{\textup{D}}(\nu-\widehat{\nu})$ does not depend on the integrand.

When ${\mathcal{F}}$ is a Hilbert space with reproducing kernel $K$ , the discrepancy has an explicit expression in terms of $K$ . The reproducing kernel is the unique function, $K:{\mathcal{X}}\times{\mathcal{X}}\to{\mathbb{R}}$ satisfying these two properties (Aro50, , Sec. 1):

[TABLE]

The Riesz Representation Theorem implies that the representer of cubature error is

[TABLE]

Thus, the deterministic trio identity for the reproducing kernel Hilbert space (RKHS) case is

[TABLE]

provided that

[TABLE]

The squared discrepancy takes the form Hic99a

[TABLE]

Assuming that the single integral and double integral of the reproducing kernel can be evaluated analytically, the computational cost to evaluate the discrepancy is ${\mathcal{O}}(n^{2})$ unless the kernel has a special form that speeds up the calculation of the double sum.

Lesson 2

The deterministic discrepancy when ${\mathcal{F}}$ is an RKHS has a simple, explicit form involving three terms.

In the RKHS case, the confounding corresponds to the cosine of the angle between $f-T(f)$ and the cubature error representer, $\eta_{\textup{err}}$ . This cosine is no greater than one in magnitude, as expected.

The square deterministic discrepancy for an RKHS may be expressed in terms of vectors and matrices:

[TABLE]

Given fixed data sites, the optimal cubature weights to minimize the discrepancy are ${\bm{w}}=\mathsf{K}^{-1}{\bm{k}}$ . If ${\bm{1}}^{T}\mathsf{K}^{-1}{\bm{k}}=1$ , which is possible but not automatic, then $\widehat{\nu}({\mathcal{X}})=\nu({\mathcal{X}})=1$ for these optimal weights, and (DTRIO) holds for general $T$ . Otherwise, one must define $T(f)=0$ for all $f\in{\mathcal{F}}$ to satisfy condition (7) for these optimal cubature weights.

A particular example of this RKHS setting corresponds to the uniform probability measure $\nu$ on the $d$ -dimensional unit cube, ${\mathcal{X}}=[0,1]^{d}$ , and the reproducing kernel defined by Hic97a

[TABLE]

In this example, $T(f)=f({\bm{1}})$ , and the variation is

[TABLE]

Here $1\!:\!d$ means $=\{1,\ldots,d\}$ , $x_{\mathfrak{u}}$ means $(x_{k})_{k\in{\mathfrak{u}}}$ , and $\bar{{\mathfrak{u}}}$ denotes the complement of ${\mathfrak{u}}$ . The square discrepancy for the equally weighted case with $w_{1}=\cdots=w_{n}=1/n$ is

[TABLE]

This discrepancy has a geometric interpretation: $\nu([{\bm{0}},{\bm{x}}_{\mathfrak{u}}])$ corresponds to the volume of the $\lvert{\mathfrak{u}}\rvert$ -dimensional box $[{\bm{0}},{\bm{x}}_{\mathfrak{u}}]$ , and $\widehat{\nu}([{\bm{0}},{\bm{x}}_{\mathfrak{u}}])$ corresponds to the proportion of data sites lying in the box $[{\bm{0}},{\bm{x}}_{\mathfrak{u}}]$ . The discrepancy in (10), which is called the $L^{2}$ -discrepancy, depends on difference between this volume and this proportion for all ${\bm{x}}\in[0,1]^{d}$ and for all $\emptyset\subsetneq{\mathfrak{u}}\subseteq 1\!:\!d$ .

If the data sites ${\bm{x}}_{1},\ldots,{\bm{x}}_{n}$ are chosen to be IID with probability measure $\nu$ , and $w_{1}=\cdots=w_{n}=1/n$ , then the mean square discrepancy for the RKHS case is

[TABLE]

For the $L^{2}$ -discrepancy in (10) this becomes

[TABLE]

Quasi-Monte Carlo methods generally employ sampling measures of the form $\widehat{\nu}=n^{-1}\sum_{i=1}^{n}\delta_{{\bm{x}}_{i}}$ , but choose the data sites $\{{\bm{x}}_{i}\}_{i=1}^{n}$ to be better than IID in the sense of discrepancy. For integration over ${\mathcal{X}}=[0,1]^{d}$ with respect to the uniform measure, these low discrepancy data sites may come from

•

a digital sequence DicPil10a , such as that proposed by Sobol’ Sob67 , Faure Fau82 , Niederreiter Nie88 , or Niederreiter and Xing NieXin96 , or

•

a sequence of node sets of an integration lattice SloJoe94 .

The constructions of such sets are described in the references above and L’Ecuyer’s tutorial in this volume. The $L^{2}$ -discrepancy defined in (10) and its relatives are ${\mathcal{O}}(n^{-1+\epsilon})$ as $n\to\infty$ for any positive $\epsilon$ for these low discrepancy data sites Nie92 .

Fig. 1 displays examples of IID and randomized low discrepancy data sites. Fig. 2 shows the rates of decay for the $L^{2}$ -discrepancy for various dimensions. The scaled discrepancy is the empirically computed root mean square discrepancy divided by its value for $n=1$ . Although the decay for the low discrepancy points is ${\mathcal{O}}(n^{-1+\epsilon})$ for large enough $n$ , the decay in Fig. 2 resembles ${\mathcal{O}}(n^{-1/2})$ for large dimensions and modest $n$ . The scaled discrepancy for IID samples in Fig. 2 does not exhibit a dimension dependence because it is masked by the scaling. The dimension dependence of the convergence of the discrepancy to zero is addressed later in Sec. 8.

Lesson 3

Quasi-Monte Carlo methods replace IID data sites by low discrepancy data sites, such as Sobol’ sequences and integration lattice nodeset sequences. The resulting sampling measures have discrepancies and cubature errors that decay to zero at a faster rate than in the case of IID sampling.

No sampling scheme can produce a faster convergence rate than ${\mathcal{O}}(n^{-1})$ for the $L^{2}$ -discrepancy. This is due to the limited smoothness of the reproducing kernel defined in (9) and the corresponding limited smoothness of the corresponding Hilbert space of integrands.

3 A Randomized Trio Identity for Cubature Error

For the randomized version of the trio identity, we again assume that the integrands lie in a Banach space, $({\mathcal{F}},\left\lVert\cdot\right\rVert_{{\mathcal{F}}})$ . This space is required to contain constant functions if $\{T(f):f\in{\mathcal{F}}\}\neq\{0\}$ . We assume that $\int_{{\mathcal{X}}}f({\bm{x}})\,\nu(\mathrm{d}{\bm{x}})$ is defined for all $f\in{\mathcal{F}}$ , however, we do not require function evaluation to be a bounded linear functional on ${\mathcal{F}}$ . The definitions of the bounded linear functional $T$ and the variation in the deterministic case in (2) apply here as well.

Now endow the vector space of all sampling measures, ${\mathcal{M}}_{\textup{S}}$ , with a probability distribution. This means that the placement of the data sites, the number of data sites, and the choice of the weights may all be random. We require the following two conditions:

[TABLE]

The first condition implies that $\int_{{\mathcal{X}}}f({\bm{x}})\,\widehat{\nu}(\mathrm{d}{\bm{x}})$ exists almost surely for every $f\in{\mathcal{F}}$ .

The randomized discrepancy is defined as the worst normalized root mean squared error:

[TABLE]

The randomized discrepancy does not depend on the particular instance of the sampling measure but on the distribution of the sampling measure.

Finally, define the confounding as

[TABLE]

Here, the confounding does depend on the particular instance of the sampling measure. The above definitions allow us to establish the randomized trio identity for cubature error.

Theorem 3.1 (Randomized Trio Error Identity)

For the spaces of integrands and measures defined above, and for the above definitions of variation, discrepancy, and confounding, the following error identity holds for all $f\in{\mathcal{F}}$ and $\widehat{\nu}\in{\mathcal{M}}_{\textup{S}}$ :

[TABLE]

Moreover, ${\mathbb{E}}_{\widehat{\nu}}\bigl{\lvert}\textup{CNF}^{\textup{R}}(f,\nu-\widehat{\nu})\bigr{\rvert}^{2}\leq 1$ for all $f\in{\mathcal{F}}$ .

Proof

For all $f\in{\mathcal{F}}$ and $\widehat{\nu}\in{\mathcal{M}}_{\textup{S}}$ , the error can be written as the single integral in (6) almost surely. If $\textup{VAR}^{\textup{D}}(f)=0$ , then $f=T(f)$ , and $\mu-\widehat{\mu}$ vanishes almost surely by (12). If $\textup{DSC}^{\textup{R}}(\nu-\widehat{\nu})=0$ , then $\mu-\widehat{\mu}$ vanishes almost surely by (13). Thus, for $\textup{VAR}^{\textup{D}}(f)\textup{DSC}^{\textup{R}}(\nu-\widehat{\nu})=0$ the trio identity holds. If $\textup{VAR}^{\textup{D}}(f)\textup{DSC}^{\textup{R}}(\nu-\widehat{\nu})\neq 0$ , then the trio identity also holds by the definition of the confounding.

Next, we analyze the magnitude of the confounding for $\textup{VAR}^{\textup{D}}(f)\textup{DSC}^{\textup{D}}(\nu-\widehat{\nu})\neq 0$ :

[TABLE]

since $\textup{VAR}^{\textup{D}}(f)\neq 0$ and so $f-T(f)\neq 0$ .

Consider simple Monte Carlo, where the approximation to the integral is an equally weighted average using IID sampling ${\bm{x}}_{1},{\bm{x}}_{2},\ldots\sim\nu$ . Let the sample size be fixed at $n$ . Let ${\mathcal{F}}=L^{2,\nu}({\mathcal{X}})$ , the space of functions that are square integrable with respect to the measure $\nu$ , and let $T(f)$ be the mean of $f$ . Then the variation of $f$ is just its standard deviation, $\textup{std}(f)$ . The randomized discrepancy is $1/\sqrt{n}$ . The randomized confounding is

[TABLE]

Unlike the deterministic setting, there is no simple expression for the randomized discrepancy under general sampling measures and RKHSs. The randomized discrepancy can sometimes be conveniently calculated or bounded for spaces of integrands that are represented by series expansions, and the randomized sampling measures for the bases of these expansions have special properties HeiHicYue02a ; HicWoz00a .

It is instructive to contrast the variation, discrepancy, and confounding in the deterministic and randomized settings. For some integrand, $f$ , and some sampling measure, $\widehat{\nu}$ , satisfying the conditions defining both (DTRIO) and (RTRIO):

•

the variation in both settings is the same,

•

the randomized discrepancy must be no greater than the deterministic discrepancy by definition, and thus

•

the randomized confounding must be no less than the deterministic confounding.

The deterministic confounding is never greater than one in magnitude. By contrast, the randomized confounding may be arbitrarily large. However, Markov’s inequality implies that it may be larger than $1/\sqrt{\alpha}$ with probability no greater than $\alpha$ . The next section illustrates the differences in the deterministic and randomized trio identities.

4 Multivariate Gaussian Probabilities

Consider the $d$ -variate integral corresponding to the probability of a ${\mathcal{N}}({\bm{0}},\mathsf{\Sigma})$ random variable lying inside the box $[{\bm{a}},{\bm{b}}]$ :

[TABLE]

where $\mathsf{\Sigma}=\mathsf{L}\mathsf{L}^{T}$ is the Cholesky decomposition of the covariance matrix, $\mathsf{L}=\bigl{(}l_{jk}\bigr{)}_{j,k=1}^{d}$ , is a lower triangular matrix, and

[TABLE]

Here, $\Phi$ represents the cumulative distribution function for a standard normal random variable. Genz Gen93 developed this clever transformation of variables above. Not only is the dimension decreased by one, but the integrand is typically made less peaky and more favorable to cubature methods.

The left plot of Fig. 3 shows the absolute errors in computing the multivariate Gaussian probability via the Genz transformation for

[TABLE]

by IID sampling, unscrambled Sobol’ sampling, and scrambled Sobol’ sampling Owe95 ; Owe96 ; Owe97 . Multiple random scramblings of a very large scrambled Sobol’ set were used to infer that $\mu\approx 0.6763373243578$ . For the two randomized sampling measures $100$ replications were taken. The marker denotes the median error and the top of the stem extending above the marker denotes the $90\%$ quantile of the error.

Empirically, the error for scrambled Sobol’ sampling appears to be tending towards a convergence rate of ${\mathcal{O}}(n^{-2})$ . This is a puzzle. It is unknown why this should be or whether this effect is only transient. In the discussion below we assume the expected rate of ${\mathcal{O}}(n^{-3/2+\epsilon})$ .

The orders of the discrepancy and confounding in Table 2 explain the rates of decay of the error and the benefits of randomization. Note that in all cases

[TABLE]

We consider equally weighted cubature rules for two kinds of random sampling measures, IID and scrambled Sobol’, and for both the deterministic and randomized settings. Here, ${\mathcal{F}}$ is assumed to be the RKHS used to define the $L^{2}$ -discrepancy.

For IID sampling both the root mean square $L^{2}$ -discrepancy and the randomized discrepancy are ${\mathcal{O}}(n^{-1/2})$ . The confounding for typical IID sampling is ${\mathcal{O}}(1)$ . In the randomized setting one may have an atypically poor instance of data sites that leads to an atypically high confounding of ${\mathcal{O}}(n^{1/2})$ . On the other hand, unscrambled Sobol’ sampling and scrambled Sobol’ sampling are atypically superior instances of data sites under an IID sampling measure that yield atypically small confoundings of ${\mathcal{O}}(n^{-1/2+\epsilon})$ and ${\mathcal{O}}(n^{-1+\epsilon})$ , respectively.

For scrambled Sobol’ sampling, the root mean square $L^{2}$ -discrepancy is now only ${\mathcal{O}}(n^{-1+\epsilon})$ , an improvement over IID sampling. However, the randomized discrepancy is an even smaller ${\mathcal{O}}(n^{-3/2+\epsilon})$ HeiHicYue02a ; Owe97 . In the deterministic setting, unscrambled Sobol’ sampling has a typical ${\mathcal{O}}(1)$ confounding, whereas typical scrambled Sobol’ sampling has an atypically low ${\mathcal{O}}(n^{-1/2})$ confounding. This is because scrambled Sobol’ sampling can take advantage of the additional smoothness of the given integrand, which is not reflected in the definition of ${\mathcal{F}}$ . In the randomized setting, unscrambled Sobol’ sampling has an atypically high ${\mathcal{O}}(n^{1/2})$ confounding. Thus, unscrambled Sobol’ sampling is among the awful minority of sampling measures under scrambled Sobol’ sampling.

Lesson 4

Randomizing the sampling measure may not only eliminate bias, but it may help improve accuracy by avoiding the awful minority of possible sampling measures.

Lesson 5

Although it has traditionally been ignored, the confounding helps explain why the cubature error may decay to zero much faster or more slowly than the discrepancy.

An alternative to the Genz transformation above is an affine transformation to compute the multivariate Gaussian probability:

[TABLE]

where $\circ$ denotes the Hadamard (term-by-term) product. The right plot in Fig. 3 shows that the error using the affine transformation is much worse than that using the Genz transformation even though the two convergence rates are the same. The difference in the magnitudes of the errors is primarily because $\textup{VAR}^{\textup{D}}(f_{\textup{aff}})$ is greater than $\textup{VAR}^{\textup{D}}(f_{\textup{Genz}})$ .

Lesson 6

Well-chosen variable transformations may reduce cubature error by producing an integrand with a smaller variation than otherwise.

5 Option Pricing

The prices of financial derivatives can often be modeled by high dimensional integrals. If the underlying asset is described in terms of a Brownian motion, $B$ , at times $t_{1},\ldots,t_{d}$ , then ${\bm{Z}}=(B(t_{1}),\ldots,B(t_{d}))\sim{\mathcal{N}}({\bm{0}},\mathsf{\Sigma})$ , where $\mathsf{\Sigma}=\bigl{(}\min(t_{j},t_{k})\bigr{)}_{j,k=1}^{d}$ , and the fair price of the option is

[TABLE]

where the function $\textup{payoff}(\cdot)$ describes the discounted payoff of the option,

[TABLE]

In this example, $\mathsf{L}$ may be any square matrix satisfying $\mathsf{\Sigma}=\mathsf{L}\mathsf{L}^{T}$ .

Fig. 4 shows the cubature error using IID sampling, unscrambled Sobol’ sampling, and scrambled Sobol’ sampling for the Asian arithmetic mean call option with the following parameters:

[TABLE]

The convergence rates for IID and unscrambled Sobol’ sampling are the same as in Fig. 3 for the previous example of multivariate probabilities. However, for this example scrambling the Sobol’ set improves the accuracy but not the convergence rate. The convergence rate for scrambled Sobol’ sampling, $\widehat{\nu}$ , is poorer than hoped for because $f$ is not smooth enough for $\textup{VAR}^{\textup{D}}(f)$ to be finite in the case where $\textup{DSC}^{\textup{R}}(\nu-\widehat{\nu})={\mathcal{O}}(n^{-3/2+\epsilon})$ .

Lesson 7

The benefits of sampling measures with asymptotically smaller discrepancies are limited to those integrands with finite variation.

The left plot in Fig. 4 chooses $\mathsf{L}=\mathsf{V}\mathsf{\Lambda}^{1/2}$ , where the columns of $\mathsf{V}$ are the normalized eigenvectors of $\mathsf{\Sigma}$ , and the diagonal elements of the diagonal matrix $\mathsf{\Lambda}$ are the eigenvalues of $\mathsf{\Sigma}$ . This is also called a principal component analysis (PCA) construction. The advantage is that the main part of the Brownian motion affecting the option payoff is concentrated in the smaller dimensions. The right plot of Fig. 4 contrasts the cubature error for two choices of $\mathsf{L}$ : one chosen by the PCA construction and the other coming from the Cholesky decomposition of $\mathsf{\Sigma}$ . This latter choice corresponds to constructing the Brownian motion by time differences. The Cholesky decomposition of $\mathsf{\Sigma}$ gives a poorer rate of convergence, illustrating again Lesson 6. The superiority of the PCA construction was observed in AcwBroGla97 .

6 A Bayesian Trio Identity for Cubature Error

An alternative to the deterministic integrand considered thus far is to assume that the integrand that is a stochastic process. Random input functions have been hypothesized by Diaconis Dia88a , O’Hagan OHa91a , Ritter Rit00a , Rasmussen and Ghahramani RasGha03a , and others. Specifically, suppose that $f\sim{\mathcal{G}}\!\!{\mathcal{P}}(0,s^{2}C_{{\bm{\theta}}})$ , a zero mean Gaussian process. The covariance of this Gaussian process is $s^{2}C_{{\bm{\theta}}}$ , where $s$ is a scale parameter, and $C_{{\bm{\theta}}}:{\mathcal{X}}\times{\mathcal{X}}\to{\mathbb{R}}$ is defined by a shape parameter ${\bm{\theta}}$ . The sample space for this Gaussian process, ${\mathcal{F}}$ , does not enter significantly into the analysis. Define the vector space of measures

[TABLE]

and let $C_{{\bm{\theta}}}$ be such that ${\mathcal{M}}$ contains both $\nu$ and the Dirac measures $\delta_{{\bm{t}}}$ for all ${\bm{t}}\in{\mathcal{X}}$ .

For a Gaussian process, all vectors of linear functionals of $f$ have a multivariate Gaussian distribution. It then follows that for a deterministic sampling measure, $\widehat{\nu}=\sum_{i=1}^{n}w_{i}\delta_{{\bm{x}}_{i}}$ , the cubature error, $\mu-\widehat{\mu}$ , is distributed as ${\mathcal{N}}\bigl{(}0,s^{2}(c_{0}-2{\bm{c}}^{T}{\bm{w}}+{\bm{w}}^{T}\mathsf{C}{\bm{w}})\bigr{)}$ , where

[TABLE]

The dependence of $c_{0}$ , ${\bm{c}}$ , and $\mathsf{C}$ on the shape parameter ${\bm{\theta}}$ is suppressed in the notation for simplicity. We define the Bayesian variation, discrepancy and confounding as

[TABLE]

Theorem 6.1 (Bayesian Trio Error Identity)

Let the integrand be an instance of a zero mean Gaussian process with covariance $s^{2}C_{{\bm{\theta}}}$ and that is drawn from a sample space ${\mathcal{F}}$ . For the variation, discrepancy, and confounding defined in (17), the following error identity holds:

[TABLE]

Moreover, $\textup{CNF}^{\textup{B}}(f,\nu-\widehat{\nu})\sim{\mathcal{N}}(0,1)$ .

Proof

Although $\int_{{\mathcal{X}}}f({\bm{x}})\,\nu(\mathrm{d}{\bm{x}})$ and $f({\bm{t}})=\int_{{\mathcal{X}}}f({\bm{x}})\,\delta_{\bm{t}}(\mathrm{d}{\bm{x}})$ may not exist for all $f\in{\mathcal{F}}$ , these two quantities exist almost surely because ${\mathbb{E}}_{f}[\int_{{\mathcal{X}}}f({\bm{x}})\,\nu(\mathrm{d}{\bm{x}})]^{2}=s^{2}c_{0}$ , and ${\mathbb{E}}_{f}[f({\bm{x}})]^{2}=s^{2}C_{{\bm{\theta}}}({\bm{x}},{\bm{x}})$ are both well-defined and finite. The proof of the Bayesian trio identity follows directly from the definitions above. The distribution of the confounding follows from the distribution of the cubature error.

The choice of cubature weights that minimizes the Bayesian discrepancy in (17a) is ${\bm{w}}=\mathsf{C}^{-1}{\bm{c}}$ , which results in $\textup{DSC}^{\textup{B}}(\nu-\widehat{\nu})=\sqrt{c_{0}-{\bm{c}}^{T}\mathsf{C}^{-1}{\bm{c}}}$ and $\mu-\widehat{\mu}\sim{\mathcal{N}}\bigl{(}0,s^{2}(c_{0}-{\bm{c}}^{T}\mathsf{C}^{-1}{\bm{c}})\bigr{)}$ . However, computing the weights requires ${\mathcal{O}}(n^{3})$ operations unless $\mathsf{C}$ has some special structure. This computational cost is significant and may be a deterrent to the use of optimal weights unless the weights are precomputed. For smoother covariance functions, $C_{\bm{\theta}}$ , there is often a challenge of $\mathsf{C}$ being ill-conditioned.

The conditional distribution of the cubature error, $\mu-\widehat{\mu}$ , given the observed data $\{f({\bm{x}}_{i})=y_{i}\}_{i=1}^{n}$ is ${\mathcal{N}}\bigl{(}{\bm{y}}^{T}(\mathsf{C}^{-1}{\bm{c}}-{\bm{w}}),s^{2}(c_{0}-{\bm{c}}^{T}\mathsf{C}^{-1}{\bm{c}})\bigr{)}$ . To remove the bias one should again choose ${\bm{w}}=\mathsf{C}^{-1}{\bm{c}}$ . This also makes the conditional distribution of the cubature error the same as the unconditional distribution of the cubature error.

Because the cubature error is a normal random variable, we may use function values to perform useful inference, namely,

[TABLE]

However, unlike our use of random sampling measures that are constructed via carefully crafted random number generators, there is no assurance that our integrand is actually drawn from a Gaussian process whose covariance we have assumed.

The covariance function, $s^{2}C_{{\bm{\theta}}}$ , should be estimated, and one way to do so is through maximum likelihood estimation (MLE), using the function values drawn for the purpose of estimating the integral. The log-likelihood function for the data $\{f({\bm{x}}_{i})=y_{i}\}_{i=1}^{n}$ is

[TABLE]

Maximizing with respect to $s^{2}$ , yields the MLE scale parameter:

[TABLE]

Plugging this into the log likelihood leads to the MLE shape parameter:

[TABLE]

which requires numerical optimization to evaluate. Using MLE estimates, the probabilistic error bound in (18) becomes

[TABLE]

Note that the value of ${\bm{\theta}}_{\textup{MLE}}$ and the above Bayesian cubature error bound is unchanged by replacing $C_{\bm{\theta}}$ by a positive multiple of itself.

Let’s revisit the multivariate normal probability example of Sec. 4, and perform Bayesian cubature with a covariance kernel with modest smoothness from the Matérn family:

[TABLE]

Using $100$ randomly scrambled Sobol’ samples, the Bayesian cubature method outlined above was used to compute the multivariate normal probability $\mu$ . We used MLE scale and shape parameters and optimal cubature weights ${\bm{w}}=\mathsf{C}_{\theta_{{\textup{MLE}}}}^{-1}{\bm{c}}_{\theta_{{\textup{MLE}}}}$ . The actual errors are plotted in Fig. 5, which also provides a contrast of the actual error and the probabilistic error bound. This bound was correct about $83\%$ of the time. Based on the smoothness of the integrand and the kernel, one might expect ${\mathcal{O}}(n^{-2})$ convergence of the answer, but this is not clear from the numerical computations.

Lesson 8

Bayesian cubature provides data-based probabilistic error bounds under the assumption that the integrand is a Gaussian process.

Bayesian cubature offers hope with a dose of caution. The theory is solid, but as this example shows, one cannot know if the actual integrand under consideration is a typical instance of the Gaussian process being assumed, even when using MLE to determine the parameters of the distribution. The success rate of the probabilistic error bound for this example is high, but not as high as the theory would suggest. One may ask whether a larger candidate family of Gaussian processes needs to be considered, but then this might increase the time required for estimation of the parameters. This example was carried out to only a rather modest sample size because of the ${\mathcal{O}}(n^{3})$ operations required to compute each $\widehat{\mu}$ . Efforts to reduce this operation count have been made by Anitescu, Chen, and Stein AniCheSte16a , Parker, Reich and Gotwalt ParEtal17a , and others. Probabilistic numerics, http://www.probabilistic-numerics.org, of which Bayesian cubature is an example, holds promise that deserves further exploration.

The formulas for the Bayesian trio identity are analogous to those for the deterministic trio identity for reproducing kernel Hilbert spaces when $T(f)=0$ for all $f\in{\mathcal{F}}$ . Suppose that the reproducing kernel $K_{\bm{\theta}}$ in the deterministic case is numerically equivalent to the covariance function $C_{\bm{\theta}}$ used in Bayesian cubature. The optimal cubature weights in the Bayesian case then mirror those in the deterministic case. Likewise, for these optimal weights $\textup{DSC}^{\textup{D}}(\nu-\widehat{\nu})$ is numerically the same as $\textup{DSC}^{\textup{B}}(\nu-\widehat{\nu})$ .

Lesson 9

The formula for the Bayesian discrepancy mimics that for the deterministic discrepancy with ${\mathcal{F}}$ an RKHS.

7 A Randomized Bayesian Trio Identity for Cubature Error

So far, we have presented three versions of the trio identity: a deterministic version in Theorem 2.1, a randomized version in Theorem 3.1, and a Bayesian version in Theorem 6.1. The fourth and final version is a randomized Bayesian trio identity. The variation remains unchanged from the Bayesian definition in (17a). The randomized Bayesian discrepancy and confounding are defined as follows:

[TABLE]

The proof of the randomized Bayesian trio error identity is similar to the proofs of the other trio identities and is omitted.

Theorem 7.1 (Randomized Bayesian Trio Error Identity)

Let the integrand be an instance of a zero mean Gaussian process with covariance $s^{2}C_{{\bm{\theta}}}$ and that is drawn from a sample space ${\mathcal{F}}$ . Let the sampling measure be drawn randomly from ${\mathcal{M}}_{\textup{S}}$ according to some probability distribution. For the variation defined in (17a), and the discrepancy and confounding defined in (21), the following error identity holds:

[TABLE]

Moreover, $\textup{CNF}^{\textup{R}\textup{B}}(f,\nu-\widehat{\nu})\sim{\mathcal{N}}(0,1)$ .

Lesson 10

The trio identity has four versions, (DTRIO), (RTRIO), (BTRIO), and (RBTRIO), depending on whether the integrand is deterministic or Bayesian and whether the sampling measure is deterministic or random.

8 Dimension Dependence of the Discrepancy, Cubature Error and

Computational Cost

The statements about the rates of decay of discrepancy and cubature error as the sample size increases have so far hidden the dependence on the dimension of the integration domain. Fig. 4 on the left shows a clear error decay rate of ${\mathcal{O}}(n^{-1+\epsilon})$ for low discrepancy sampling for the option pricing problem with dimension $12$ . However, Fig. 2 shows that the discrepancy for these scrambled Sobol’ points does not decay as quickly as ${\mathcal{O}}(n^{-1+\epsilon})$ for moderate $n$ .

There has been a tremendous effort to understand the effect of the dimension of the integration problem on the convergence rate. Sloan and Woźniakowski SloWoz97 pointed out how the sample size required to achieve a desired error tolerance could grow exponentially with dimension. Such problems are called intractable. This led to a search for settings where the sample size required to achieve a desired error tolerance only grows polynomially with dimension (tractable problems) or is independent of the dimension (strongly tractable problems). The three volume masterpiece by Novak and Woźniakowski NovWoz08a ; NovWoz10a ; NovWoz12a and the references cited therein contain necessary and sufficient conditions for tractability. The parallel idea of effective dimension was introduced by Caflisch, Morokoff, and Owen CafMorOwe97 and developed further in LiuOwe04a .

Here we provide a glimpse into those situations where the dimension of the problem does not have an adverse effect on the convergence rate of the cubature error and the discrepancy. Let’s generalize the reproducing kernel used to define the $L^{2}$ -discrepancy in (9), as well as the corresponding variation and the discrepancy for equi-weighted sampling measures by introducing coordinate weights $\gamma_{1},\gamma_{2},\ldots$ :

[TABLE]

For $\gamma_{1}=\cdots=\gamma_{d}=1$ , we recover the situation in Sec. 2, where the decay rate of the discrepancy is dimension dependent for moderate sample sizes. However if $\gamma_{k}^{2}=k^{-3}$ , then the discrepancies for randomly shifted lattice nodesets and scrambled Sobol’ sequences show only a slight dimension dependence, as shown in Fig. 6.

When the weights $\gamma_{k}$ decay with $k$ , the discrepancy depends less on how evenly the data sites appear in projections involving the higher numbered coordinates. On the other hand, the variation in this case gives heavier weight to the $\partial^{{\mathfrak{u}}}f$ with ${\mathfrak{u}}$ containing large $k$ . For the cubature error decay to mirror the decay of the discrepancy shown in Fig. 6, the integrand must depend only slightly on the coordinates with higher indices, so that the variation will be modest.

Lesson 11

The cubature error for high dimensional problems can often be reduced by arranging for the integrand to depend primarily on those coordinates with lower indices.

For some integration problems the dimension is infinite and so our problem (INT) becomes

[TABLE]

where ${\mathcal{X}}^{(d)}={\mathcal{X}}_{1}\times\cdots\times{\mathcal{X}}_{d}$ , $\nu^{(d)}$ is a measure on ${\mathcal{X}}^{(d)}$ with independent marginals $\nu_{k}$ on ${\mathcal{X}}_{k}$ , and $f^{(1)},f^{(2)},\ldots$ are approximations to an infinite-dimensional integrand. The discrepancy and cubature error analysis for $d\to\infty$ is similar to the large $d$ situation, but now the compuational cost of the approximate integrand is a concern CreuEtal08a ; HicMGRitNiu09a ; WanHic00b ; NiuHic09b .

One could approximate $\mu$ by $\widehat{\mu}^{(d)}$ , the approximation to $\mu^{(d)}$ , for some large $d$ . However, the computational cost of evaluating $f^{(d)}({\bm{x}})$ for a single ${\bm{x}}$ typically requires ${\mathcal{O}}(d)$ operations. So this approach would require a high computational cost of ${\mathcal{O}}(nd)$ operations to compute $\widehat{\mu}^{(d)}$ .

The often better alternative is to decompose the $f^{(d)}$ into pieces $f_{\mathfrak{u}}$ , for ${\mathfrak{u}}\subset 1\!\!:\!\!d$ , such that $f^{(d)}=\sum_{{\mathfrak{u}}\subseteq 1:d}f_{{\mathfrak{u}}}$ and the $f_{\mathfrak{u}}$ depend on ${\mathfrak{u}}$ but not on $d$ . Multi-level Monte Carlo approximates ( $\infty$ INT) by

[TABLE]

for some choice of $d_{l}$ with $d_{1}<\cdots<d_{L}$ . This works well when $\textup{VAR}\bigl{(}f^{(d_{l})}-f^{(d_{l-1})}\bigr{)}$ decreases as $l$ increases and when $\mu-\mu^{(d_{L})}$ is small Gil14a ; Gil15a ; Gil08b ; Hei01a ; HicMGRitNiu09a ; NiuHic09b . The computational cost of $\widehat{\mu}\bigl{(}f^{(d_{l})}-f^{(d_{l-1})}\bigr{)}$ is ${\mathcal{O}}(n_{l}d_{l})$ , and as $d_{l}$ increases, $n_{l}$ decreases, thus moderating the cost. There is bias, since $\mu-\mu^{(d_{L})}$ is not approximated at all, but this can be removed by a clever a randomized sampling method RheGly12a .

The Multivariate Decomposition Method approximates ( $\infty$ INT) by

[TABLE]

where the ${\mathfrak{u}}_{l}$ are the important sets of coordinate indices as judged by $\textup{VAR}^{\textup{D}}(f_{\mathfrak{u}})$ to ensure that $\mu-\sum_{{\mathfrak{u}}\notin\{{\mathfrak{u}}_{1},\ldots,{\mathfrak{u}}_{L}\}}\mu(f_{\mathfrak{u}})$ is small Was13b . The computational cost of each $\widehat{\mu}(f_{{\mathfrak{u}}_{l}})$ is ${\mathcal{O}}(n_{l}\lvert{\mathfrak{u}}_{l}\rvert)$ . If the important sets have small cardinality, $\lvert{\mathfrak{u}}_{l}\rvert$ , the computational cost is moderate.

Lesson 12

Infinite dimensional problems may be efficiently solved by multi-level methods or multivariate decomposition methods, which approximate the integral by a sum of finite dimensional integrals.

9 Automatic Stopping Criteria for Cubature

The trio identity decomposes the cubature error into three factors. By improving the sampling scheme, the discrepancy may be made smaller. By re-writing the integral, the variation of the integrand might be made smaller. For certain situations, we may find that the confounding is small. While the trio identity helps us understand what contributes to the cubature error, it does not directly answer the question of how many samples are required to achieve the desired accuracy, i.e., how to ensure that

[TABLE]

for some predetermined $\varepsilon$ .

Bayesian cubature, as described in Sec. 6, provides data-based cubature error bounds. These can be used to determine how large $n$ must be to satisfy (ErrCrit) with high probability.

For IID Monte Carlo the Central Limit Theorem may be used to construct an approximate confidence interval for $\mu$ , however, this approach relies on believing that $n$ is large enough to have i) reached the asymptotic limit, and ii) obtained a reliable upper bound on the standard deviation in terms of a sample standard deviation. There have been recent efforts to develop a more robust approach to fixed width confidence intervals BayEtal14a ; HicEtal14a ; Jia16a . An upper bound on the standard deviation may be computed by assuming an upper bound on the kurtosis or estimating the kurtosis from data. The standard deviation of an integrand can be confidently bounded in terms of the sample standard deviation if it lies in the cone of functions with a known bound on their kurtosis. A bound on the kurtosis also allows one to use a Berry-Esseen inequality, which is a finite sample version of the Central Limit Theorem, to determine a sufficient sample size for computing the integral with the desired accuracy.

For low discrepancy sampling, independent random replications may be used to estimate the error, but this approach lacks a rigorous justification. An alternative proposed by the author and his collaborators is to decompose the integrand into a Fourier series and estimate the decay rate of the Fourier coefficients that contribute to the error HicJim16a ; HicEtal17a ; JimHic16a . This approach may also be used to satisfy relative error criteria or error criteria involving a function of several integrals HicEtal17a . Our automatic stopping criteria have been implemented in the Guaranteed Automatic Integration Library (GAIL) ChoEtal15a .

Lesson 13

Automatic stopping criteria for (quasi-)Monte Carlo simulations have been developed for integrands that lie in a cone of functions that are not too wild.

10 Summary

To conclude, we repeat the lessons highlighted above. The order may be somewhat different.

The trio identity (TRIO) decomposes the cubature error into a product of three factors: the variation of the integrand, the discrepancy of the sampling measure, and the confounding. This identity shows how the integrand and the sampling measure each contribute to the cubature error. The trio identity has four versions, (DTRIO), (RTRIO), (BTRIO), and (RBTRIO), depending on whether the integrand is deterministic or Bayesian and whether the sampling measure is deterministic or random. The deterministic discrepancy when ${\mathcal{F}}$ is an RKHS has a simple, explicit form involving three terms. The formula for the Bayesian discrepancy mimics that for the deterministic discrepancy with ${\mathcal{F}}$ an RKHS. Although it has traditionally been ignored, the confounding helps explain why the cubature error may decay to zero much faster or more slowly than the discrepancy.

How do good sampling measures, $\widehat{\nu}$ , make the error smaller? Quasi-Monte Carlo methods replace IID data sites by low discrepancy data sites, such as Sobol’ sequences and integration lattice nodeset sequences. The resulting sampling measures have discrepancies and cubature errors that decay to zero at a faster rate than in the case of IID sampling. Randomizing the sampling measure may not only eliminate bias, but it may help improve accuracy by avoiding the awful minority of possible sampling measures. The benefits of sampling measures with asymptotically smaller discrepancies are limited to those integrands with finite variation.

*How can the error be decreased by re-casting the problem with a different integrand, $f$ ? * Well-chosen variable transformations may reduce cubature error by producing an integrand with a smaller variation than otherwise. The cubature error for high dimensional problems can often be reduced by arranging for the integrand to depend primarily on those coordinates with lower indices. Infinite dimensional problems may be efficiently solved by multi-level methods or multivariate decomposition methods, which approximate the integral by a sum of finite dimensional integrals.

*How many samples, $n$ , are required to meet a specified error tolerance? * Bayesian cubature provides data-based probabilistic error bounds under the assumption that the integrand is a Gaussian process. Automatic stopping criteria for (quasi-)Monte Carlo simulations have been developed for integrands that lie in a cone of functions that are not too wild.

Acknowledgements.

The author would like to thank the organizers of MCQMC 2016 for an exceptional conference. The author is indebted to his colleagues in the MCQMC community for all that he has learned from them. In particular, the author thanks Xiao-Li Meng for introducing the trio identity and for discussions related to its development. The author also thanks Lluís Antoni Jiménez Rugama for helpful comments in preparing this tutorial. This work is partially supported by the National Science Foundation grant DMS-1522687.

Bibliography49

The reference list from the paper itself. Each links out to its DOI / PubMed record.

1(1) Acworth, P., Broadie, M., Glasserman, P.: A comparison of some Monte Carlo techniques for option pricing. In: H. Niederreiter, P. Hellekalek, G. Larcher, P. Zinterhof (eds.) Monte Carlo and quasi-Monte Carlo methods 1996, Lecture Notes in Statistics , vol. 127, pp. 1–18. Springer-Verlag, New York (1998)
2(2) Anitescu, M., Chen, J., Stein, M.: An inversion-free estimating equation approach for Gaussian process models. J. Comput. Graph. Statist. (2016)
3(3) Aronszajn, N.: Theory of reproducing kernels. Trans. Amer. Math. Soc. 68 , 337–404 (1950)
4(4) Bayer, C., Hoel, H., von Schwerin, E., Tempone, R.: On nonasymptotic optimal stopping criteria in Monte Carlo Simulations. SIAM J. Sci. Comput. 36 , A 869–A 885 (2014)
5(5) Caflisch, R.E., Morokoff, W., Owen, A.: Valuation of mortgage backed securities using Brownian bridges to reduce effective dimension. J. Comput. Finance 1 , 27–46 (1997)
6(6) Choi, S.C.T., Ding, Y., Hickernell, F.J., Jiang, L., Jiménez Rugama, Ll.A., Tong, X., Zhang, Y., Zhou, X.: GAIL: Guaranteed Automatic Integration Library (versions 1.0–2.1). MATLAB software (2013–2015). URL http://gailgithub.github.io/GAIL_Dev/
7(7) Creutzig, J., Dereich, S., Müller-Gronbach, T., Ritter, K.: Infinite-dimensional quadrature and approximation of distributions. Found. Comput. Math. 9 , 391–429 (2009)
8(8) Diaconis, P.: Bayesian numerical analysis. In: S.S. Gupta, J.O. Berger (eds.) Statistical decision theory and related topics IV, Papers from the 4th Purdue Symp., West Lafayette, Indiana 1986, vol. 1, pp. 163–175. Springer-Verlag, New York (1988)