Efficient estimation of divergence-based sensitivity indices with   Gaussian process surrogates

A.W. Eggels; D.T. Crommelin

arXiv:1904.03859·math.ST·September 18, 2019

Efficient estimation of divergence-based sensitivity indices with Gaussian process surrogates

A.W. Eggels, D.T. Crommelin

PDF

Open Access

TL;DR

This paper introduces a novel approach for estimating divergence-based sensitivity indices using Gaussian process surrogates to improve accuracy and efficiency, especially for complex models with limited evaluations.

Contribution

It proposes a new method combining GP surrogates with KDE and introduces direct sensitivity indices for dependent inputs, enhancing sensitivity analysis accuracy.

Findings

01

GP surrogates improve density estimation accuracy

02

New divergence-based sensitivity indices for dependent inputs

03

Enhanced estimation accuracy with fewer model evaluations

Abstract

We consider the estimation of sensitivity indices based on divergence measures such as Hellinger distance. For sensitivity analysis of complex models, these divergence-based indices can be estimated by Monte-Carlo sampling (MCS) in combination with kernel density estimation (KDE). In a direct approach, the complex model must be evaluated at every input point generated by MCS, resulting in samples in the input-output space that can be used for density estimation. However, if the computational cost of the complex model strongly limits the number of model evaluations, this direct method gives large errors. We propose to use Gaussian process (GP) surrogates to increase the number of samples in the combined input-output space. By enlarging this sample set, the KDE becomes more accurate, leading to improved estimates. To compare the GP surrogates, we use a surrogate constructed by samples…

Figures24

Click any figure to enlarge with its caption.

Tables1

Table 1. Table 1 : Input variables for the Piston function.

Symbol and range	Explanation
$M \in [30, 60]$	piston weight (kg)
$S \in [0.005, 0.020]$	piston surface area (m²)
$V \in [0.002, 0.010]$	initial gas volume (m³)
$k \in [1000, 5000]$	spring coefficient (N/m)
$P \in [90000, 110000]$	atmospheric pressure (N/m²)
$T_{a} \in [290, 296]$	ambient temperature (K)
$T_{0} \in [340, 360]$	filling gas temperature (K)

Equations80

S_{X^{k}} = E [d (Y, Y ∣ X^{k})],

S_{X^{k}} = E [d (Y, Y ∣ X^{k})],

d (Y, Y ∣ X^{k}) = (E (Y) - E (Y ∣ X^{k}))^{2} .

d (Y, Y ∣ X^{k}) = (E (Y) - E (Y ∣ X^{k}))^{2} .

d_{f} (Y, Y ∣ X^{k}) = \int_{R} f (\frac{p _{Y} ( y )}{p _{Y ∣ X^{k}} ( y )}) p_{Y ∣ X^{k}} (y) d y,

d_{f} (Y, Y ∣ X^{k}) = \int_{R} f (\frac{p _{Y} ( y )}{p _{Y ∣ X^{k}} ( y )}) p_{Y ∣ X^{k}} (y) d y,

S_{X^{k}}^{f} = \iint_{R^{2}} f (\frac{p _{Y} ( y ) p _{X^{k}} ( x )}{p _{X^{k}, Y} ( x , y )}) p_{X^{k}, Y} (x, y) d y d x .

S_{X^{k}}^{f} = \iint_{R^{2}} f (\frac{p _{Y} ( y ) p _{X^{k}} ( x )}{p _{X^{k}, Y} ( x , y )}) p_{X^{k}, Y} (x, y) d y d x .

f_{X^{k}} (x)

f_{X^{k}} (x)

f_{Y} (y)

f_{X^{k}, Y} (x, y)

\overline{H}_{X^{k}, f}^{(J)} := \frac{1}{J} j = 1 \sum J f (\frac{f _{X^{k}} ( x _{j} ) f _{Y} ( y _{j} )}{f _{X^{k}, Y} ( x _{j} , y _{j} )}) .

\overline{H}_{X^{k}, f}^{(J)} := \frac{1}{J} j = 1 \sum J f (\frac{f _{X^{k}} ( x _{j} ) f _{Y} ( y _{j} )}{f _{X^{k}, Y} ( x _{j} , y _{j} )}) .

Y_{L^{+}} = μ (X_{L^{+}}),

Y_{L^{+}} = μ (X_{L^{+}}),

Y_{L^{+}}^{(s)} \sim N (μ (X_{L^{+}}), Σ (X_{L^{+}})),

Y_{L^{+}}^{(s)} \sim N (μ (X_{L^{+}}), Σ (X_{L^{+}})),

H_{α} (X) = \frac{1}{1 - α} lo g (\int_{Ω} (p (x))^{α} d x),

H_{α} (X) = \frac{1}{1 - α} lo g (\int_{Ω} (p (x))^{α} d x),

\hat{H}_{α} (X_{N}) = \frac{1}{1 - α} (lo g (\frac{L _{γ} ( X _{N} )}{N ^{α}}) - lo g β_{L, γ}) = \frac{1}{1 - α} lo g (\frac{L _{γ} ( X _{N} )}{β _{L, γ} N ^{α}}),

\hat{H}_{α} (X_{N}) = \frac{1}{1 - α} (lo g (\frac{L _{γ} ( X _{N} )}{N ^{α}}) - lo g β_{L, γ}) = \frac{1}{1 - α} lo g (\frac{L _{γ} ( X _{N} )}{β _{L, γ} N ^{α}}),

L_{γ} (X_{N}) = T (X_{N}) min e \in T (X_{N}) \sum ∣ e ∣^{γ},

L_{γ} (X_{N}) = T (X_{N}) min e \in T (X_{N}) \sum ∣ e ∣^{γ},

β_{L, γ} = N \to \infty lim L_{γ} (X_{N}) / N^{α},

β_{L, γ} = N \to \infty lim L_{γ} (X_{N}) / N^{α},

D_{α} (f, g) = \frac{1}{α - 1} lo g (\int_{Ω} (\frac{f ( x )}{g ( x )})^{α} g (x) d x),

D_{α} (f, g) = \frac{1}{α - 1} lo g (\int_{Ω} (\frac{f ( x )}{g ( x )})^{α} g (x) d x),

D_{1/2} (p_{X Y}, p_{X} p_{Y})

D_{1/2} (p_{X Y}, p_{X} p_{Y})

= - 2 lo g (\int_{Y} \int_{X} p_{X Y} (x, y) p_{X} (x) p_{Y} (y) d x d y) .

D_{1/2} (p_{X Y}, p_{X} p_{Y})

D_{1/2} (p_{X Y}, p_{X} p_{Y})

= - 2 lo g (\int_{Y} \int_{X} (h_{x y} (x^{'}, y^{'}))^{1/2} d x^{'} d y^{'}),

= - H_{1/2} (h),

h (x, y) = \frac{p _{X Y} ( x , y )}{p _{X} ( x ) p _{Y} ( y )} .

h (x, y) = \frac{p _{X Y} ( x , y )}{p _{X} ( x ) p _{Y} ( y )} .

S_{X^{k}}^{H} = \iint_{R^{2}} (\frac{p _{X^{k}} ( x ) p _{Y} ( y )}{p _{X^{k}, Y} ( x , y )} - 1)^{2} p_{X^{k}, Y} (x, y) d y d x,

S_{X^{k}}^{H} = \iint_{R^{2}} (\frac{p _{X^{k}} ( x ) p _{Y} ( y )}{p _{X^{k}, Y} ( x , y )} - 1)^{2} p_{X^{k}, Y} (x, y) d y d x,

S_{X^{k}}^{H} = 2 - 2 \iint_{R^{2}} p_{X^{k}} (x) p_{Y} (y) p_{X^{k}, Y} (x, y) d y d x .

S_{X^{k}}^{H} = 2 - 2 \iint_{R^{2}} p_{X^{k}} (x) p_{Y} (y) p_{X^{k}, Y} (x, y) d y d x .

D_{1/2} (p_{X^{k} Y}, p_{X^{k}} p_{Y}) = - 2 lo g (I), S_{X^{k}}^{H} = 2 - 2 I,

D_{1/2} (p_{X^{k} Y}, p_{X^{k}} p_{Y}) = - 2 lo g (I), S_{X^{k}}^{H} = 2 - 2 I,

I = \iint_{R^{2}} p_{X^{k}} (x) p_{Y} (y) p_{X^{k}, Y} (x, y) d y d x,

I = \iint_{R^{2}} p_{X^{k}} (x) p_{Y} (y) p_{X^{k}, Y} (x, y) d y d x,

S_{X^{k}}^{H} = 2 - 2 exp (\frac{- D _{1/2} ( p _{X^{k}, Y} , p _{X^{k}} p _{Y} )}{2}) .

S_{X^{k}}^{H} = 2 - 2 exp (\frac{- D _{1/2} ( p _{X^{k}, Y} , p _{X^{k}} p _{Y} )}{2}) .

\frac{L _{γ} ( X ^{k} , Y )}{β N},

\frac{L _{γ} ( X ^{k} , Y )}{β N},

S_{X^{k}}^{H} = 2 - 2 \frac{L _{γ}}{β N} .

S_{X^{k}}^{H} = 2 - 2 \frac{L _{γ}}{β N} .

[X Y] \sim N ([00], [1 ρ ρ 1]) .

[X Y] \sim N ([00], [1 ρ ρ 1]) .

G (x, y, z ∣ a, b) = (1 + b z^{4}) sin (x) + a sin^{2} (y)

G (x, y, z ∣ a, b) = (1 + b z^{4}) sin (x) + a sin^{2} (y)

Z = N 000, 1 0.8 0.5 0.8 1 0.8 0.5 0.8 1,

Z = N 000, 1 0.8 0.5 0.8 1 0.8 0.5 0.8 1,

X = - π + 2 π \cdot F (Z),

X = - π + 2 π \cdot F (Z),

R^{2} = 1 - \frac{S S _{r es}}{S S _{t o t}} = 1 - \frac{\frac{1}{L} \sum _{l} ( Y ^ _{l} - Y _{l} ) ^{2}}{\frac{1}{L} \sum _{l} ( Y _{l} - Y ˉ ) ^{2}},

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsProbabilistic and Robust Engineering Design · Advanced Multi-Objective Optimization Algorithms · Optimal Experimental Design Methods

Full text

Efficient estimation of divergence-based sensitivity indices with Gaussian process surrogates

A.W. Eggels, D.T. Crommelin

Abstract

We consider the estimation of sensitivity indices based on divergence measures such as Hellinger distance. For sensitivity analysis of complex models, these divergence-based indices can be estimated by Monte-Carlo sampling (MCS) in combination with kernel density estimation (KDE). In a direct approach, the complex model must be evaluated at every input point generated by MCS, resulting in samples in the input-output space that can be used for density estimation. However, if the computational cost of the complex model strongly limits the number of model evaluations, this direct method gives large errors. We propose to use Gaussian process (GP) surrogates to increase the number of samples in the combined input-output space. By enlarging this sample set, the KDE becomes more accurate, leading to improved estimates. To compare the GP surrogates, we use a surrogate constructed by samples obtained with stochastic collocation, combined with Lagrange interpolation. Furthermore, we propose a new estimation method for these sensitivity indices based on minimum spanning trees. Finally, we also propose a new type of sensitivity indices based on divergence measures, namely direct sensitivity indices. These are useful when the input data is dependent.

1 Introduction

Sensitivity analysis is an essential part of uncertainty quantification and a very active research field [1, 2, 3]. Several types of sensitivity indices have been formulated, such as variance-based (including Sobol’s indices [4]), density-based [5], derivative-based [6] or divergence-based. Broadly speaking, divergence-based sensitivity indices quantify the difference between the joint probability distribution (or density) of model input and output on the one hand, and the product of their marginal distributions on the other hand. A variety of divergence-based indices can be brought in a common framework built on the notion of $f$ -divergence [7], as was shown by Da Veiga [8]. The $f$ -divergence is a generalization of several well-known divergences such as the Kullback-Leibler divergence [9] and the Hellinger distance [10].

In most cases, these sensitivity indices cannot be computed analytically because the distribution of the model output given the input is not known exactly. As an alternative, one can resort to Monte Carlo (MC) sampling combined with kernel density estimation: the input distribution is sampled using MC, the model is evaluated on all sampled input points, and from resulting input-output points the joint and marginal probability densities of input and output are estimated. However, when the number of available output points is low, for example because of high computational cost of the model, the estimated densities will generally be inaccurate, resulting in large errors in the estimated sensitivity indices.

In this study we first propose to increase the number of output samples by using a Gaussian process (GP) surrogate. The GP is constructed on the input-output points that are obtained with the (expensive) model. The main idea is that the additional output samples improve the kernel density estimates even though they introduce a bias due to the difference between the true model and its GP approximation. Our approach is based on both the development of divergence-based indices and the use of Gaussian processes in sensitivity analysis. Therefore, we briefly summarize some of the advancements in these areas. Auder & Iooss [11] presented two sensitivity analysis methods based on Shannon and Kullback-Leiber entropy, respectively, building on work in [12] and [13]. Da Veiga [8] introduced sensitivity indices based on the $f$ -divergence. Recently, KDE also appears in estimators of mutual information measures in [14], where $f$ -divergences are computed between the joint distribution of two random variables and the product of their marginal distributions. In [15], $f$ -divergence measures are computed by a $k$ -nearest neighbor graph.

The use of GPs is discussed in Marrel et al. [16], together with the analytical expressions for Sobol indices that arise from them. To compute the indices, two approaches are considered: one in which the predictor of the GP is used and one in which the full GP is used. The latter approach is found to be superior in convergence and robustness. Furthermore, the modeling error of the GP is integrated through confidence intervals; it is reported that the bias due to the use of the GP is negligible [16]. In a related study, Svenson et al. [17] estimate Sobol indices with GPs, using specific compactly supported kernel functions. Furthermore, combining GPs with derivative-based indices has been investigated by [6] and [18]. In [19], predictions from a GP are used to rank the input variables based on their predictive relevance. Two methods for this are presented in [19], one based on Kullback-Leibler divergence and one based on the variance of the posterior mean.

Despite the developments sketched above, approaches that combine GP surrogate modeling and divergence-based sensitivity analysis have not been explored much yet, although [20] already applied this approach. The methodology proposed in this paper combines these two elements.

We note that for the approach proposed here it is not needed to assume that the inputs are mutually independent, nor does dependency of inputs make it more complicated. We present test cases with independent inputs as well as cases with dependent inputs. For the former, we compare with results obtained with stochastic collocation [21, 22]. In this method, an appropriate set of points, called collocation points, is obtained. These are usually chosen as the zeros of the orthogonal polynomials with respect to the marginal input probability distributions. Then, Lagrange interpolation is used to approximate the output function. For dependent inputs, this method might not be ideal, as [23] already showed.

Second, we propose a new estimation method for the divergence-based sensitivity indices as introduced before. Because the KDE method depends on the choice of both kernel and kernel bandwidth, we propose to use an estimator without parameters which is numerically fast as well. This estimator is based on the approximation of one of the integrals appearing in the sensitivity index by computing a minimum spanning tree [24].

As a third contribution, we propose a new set of sensitivity indices to complement the ones introduced before. This new set computes the direct sensitivity indices, which measure the sensitivity of the output with respect to one input variable only. This is beneficial for cases when the input variables are dependent, because these indices remove indirect effects caused by dependent input variables. To illustrate this, consider an example where $X=(X^{1},X^{2})$ follows a bivariate normal distribution with means [math], variances $1$ and covariance $\rho>0$ , while $u(x_{1},x_{2})=x_{1}$ . Then the direct effect of $X^{2}$ on the output is zero, while the original sensitivity index would be positive due to the dependence between $X^{1}$ and $X^{2}$ .

Section 2 describes the sensitivity indices central to this paper, their estimation method and the complications therein. It also contains our proposed method to enlarge the set of input and output data and the new estimation method. Section 3 applies these estimators to several test cases. Section 5 concludes.

2 Divergence-based sensitivity indices and their estimation

We start by introducing the sensitivity indices derived from the $f$ -divergences in Section 2.1. Section 2.2 discusses the complications in estimating them. Gaussian processes and the two estimators are given in Section 2.3.

2.1 Sensitivity indices from the $f$ -divergence

We consider the situation where a model takes a vector of inputs $(X^{1},...,X^{d})$ and returns a (scalar) output $Y$ . The input vector $X$ is random, and as a result the output $Y$ is a random variable as well. Da Veiga [8] proposed to perform global sensitivity analysis with dependence measures, especially $f$ -divergences (see also [25]). In this way, the impact of the $k$ th input variable $X^{k}$ on the output $Y$ is given by

[TABLE]

where $d(\cdot,\cdot)$ denotes a dissimilarity measure. The unnormalized first-order Sobol indices can also be written in this framework, namely with

[TABLE]

We will use the Csiszár $f$ -divergence [7], which is given by

[TABLE]

with $f(\cdot)$ a convex function with $f(1)=0$ , and $p_{\cdot}(\cdot)$ denotes a probability distribution function. Some well-known choices for $f$ are $f(t)=-\log(t)$ (Kullback-Leibler divergence) and $f(t)=(\sqrt{t}-1)^{2}$ (Hellinger distance). Combining (1) and (2) with basic probability theory gives us

[TABLE]

These sensitivity indices are equal to zero for $X^{k}$ and $Y$ independent and positive otherwise. Furthermore, they are invariant with respect to smooth and uniquely invertible transformation of $X^{k}$ and $Y$ [26], in contrast to Sobol indices which are only invariant with respect to linear transformations. Moreover, it is easy to generalize (3) to multidimensional $X^{k,l}$ .

2.2 Difficulties for estimation

The main problem for computing $S^{f}_{X^{k}}$ is that the probability densities in (3) are not known. In order to estimate $S^{f}_{X^{k}}$ it is necessary to estimate $p_{Y}(\cdot)$ and $p_{X^{k},Y}(\cdot,\cdot)$ , and, depending on the type of input, $p_{X^{k}}(\cdot)$ as well. In [8] it is indicated that if samples $(X_{L},Y_{L})$ are available, only the ratio $r(x,y)=\frac{p_{Y}(y)p_{X^{k}}(x)}{p_{X^{k},Y}(x,y)}$ needs to be estimated.

The estimates of the densities can be obtained with kernel-density estimation (also in [8, 25]). To do so, one chooses a suitable kernel and a suitable value for the kernel bandwidth $h$ . When the density of the input $X$ is known, this information can be used to determine $h$ , otherwise, guidelines are available [27].

Clearly, the estimate of the density $p_{Y}$ will not be perfect, leading to an error in the estimation of $S^{f}_{X^{k}}$ . This is strongly related to the number of samples $(X_{L},Y_{L})$ available for density estimation. If high computational cost of the model limits this number, the estimation of $S^{f}_{X^{k}}$ can be improved by using a surrogate of the model to generate more samples. One possible way to do so is to use stochastic collocation (SC) [21, 22]. Herein, one chooses the samples $X_{L}$ as the collocation points, which are obtained as the zeros of the orthogonal polynomials with respect to the marginal input distributions. Then, at these collocation points, the corresponding output samples are obtained. Finally, an emulator is constructed by Lagrange interpolation on these samples.

As an alternative, we propose to use Gaussian processes [28] as a surrogate model to obtain the larger sample $(X_{+},Y_{+})=(X_{L}\cup X_{L^{+}},Y_{L}\cup Y_{L^{+}})$ , in which $Y_{L^{+}}$ indicates the surrogate model output for the extra input samples $X_{L^{+}}$ . For each data point in $X_{L^{+}}$ , this $Y_{L^{+}}$ is a normal distribution in itself, and for each point in $X_{L}$ it is a degenerated normal distribution (i.e., it has zero variance). An additional advantage may be the availability of confidence intervals for $S^{f}_{X^{k}}$ at almost no extra computational cost. Unfortunately, these confidence intervals do not include the bias from approximating the output by a Gaussian process.

2.3 Estimation using Gaussian processes

We assume the input samples $X_{L}:=\{\mathbf{x}_{l}\}_{l=1}^{L}$ are already available, otherwise one can use Monte Carlo sampling (or Latin hypercube sampling in the case of independent uniform data) to select samples from the data $X$ . Although it may be tempting to use other sample selection methods, it is not guaranteed that they represent the distribution just as naive sampling would. Then, the corresponding output $Y_{L}:=\{y_{l}\}_{l=1}^{L}$ can be obtained as $Y_{L}=G(X_{L})$ with $G$ the process to generate output, which is either a function or a computational model. Then, one needs to fit a Gaussian process $\widetilde{G}_{\{X_{L},Y_{L}\}}(\mathbf{x})=N(\mu(\mathbf{x}),\Sigma(\mathbf{x}))$ to $(X_{L},Y_{L})$ , thereby choosing an appropriate kernel. This Gaussian process is now used to obtain output $Y_{L^{+}}=\widetilde{G}_{\{X_{L},Y_{L}\}}(X_{L^{+}})$ for other input samples $X_{L^{+}}$ . This leads to the augmented dataset $X_{+}=X_{L}\cup X_{L^{+}}$ of size $N=L+L_{+}$ with (partial) surrogate output $Y_{+}=Y_{L}\cup Y_{L^{+}}$ . Note that $Y_{L^{+}}$ does not consist of single values, but rather of multivariate normal distributions.

2.3.1 Kernel density estimation

We now explain how to compute the KDE on $(X_{+},Y_{+})$ and how it is used to approximate (3). Because $S_{X^{k}}^{f}$ is computed per input variable $X^{k}$ , it is here enough to consider one-dimensional kernel densities.

For each input variable $X^{k}$ and output variable $Y$ , the estimators for the kernel density are given by [25]:

[TABLE]

with $(x_{j},y_{j})$ the $j$ th sample of the input data $(X^{k},Y)$ and $J$ the size of the data. Note the input data $X=(X^{1},\ldots,X^{d})$ has to represent the distribution of $X$ . An extension to a higher-dimensional $X^{k}$ is easy to obtain. For our purpose, we either have $J=L$ and $(X,Y)=(X_{L},Y_{L})$ , or we have $J=N$ and $(X,Y)=(X_{+},Y_{+})$ . We choose the Gaussian kernel and $h_{X^{k}}=h_{Y}=h$ according to Scott’s rule [29, p. 152], which is optimized with respect to the normal distribution. Then, the estimator for $S^{f}_{X^{k}}$ as given by [25] is obtained:

[TABLE]

We note this choice of $h$ may not be optimal. We have adapted the bandwidth $h$ previously to the ranges of $X$ and $Y$ , but the results of this are worse than with a single bandwidth. Also, kernel density estimation may not be the best choice when the domain of a variable $X^{k}$ or $Y$ is bounded and this variable has nonzero density at the boundaries.

Until so far, we ignored the fact $Y_{L^{+}}$ is a multivariate normal random variable instead of a single value when $J=N$ . Therefore, there are two options to obtain values for $Y_{L^{+}}$ . The first option is to use the prediction mean $\mu(x)$ and get the resulting output samples

[TABLE]

to be used in (4). The other is to sample from this normal distribution $n_{s}$ times. In that case, one gets the $n_{s}$ output sets

[TABLE]

in which $\sim$ denotes “sampled from the distribution”, and thereby $n_{s}$ estimates of $\overline{H}_{X^{k},f}^{(N)}$ . Note that this also implies the kernel density estimates have to be computed $n_{s}$ times. Because the computation of the kernel density estimate is expensive, we choose not to include this option. We will indicate the estimator $\overline{H}_{X^{k},f}^{(J)}$ by $\widehat{S^{f}_{X^{k}}}$ in the results, where the value of $J$ is clear from the context.

2.3.2 Minimum spanning trees

Before we can explain how to use the minimum spanning trees, we first need to introduce the concept of Rényi entropy. This is a generalization of the continuous Shannon entropy (see e.g. [30]) and is given by

[TABLE]

for $\alpha\in(0,\infty)$ . In the limit of $\alpha\rightarrow 1$ , the Rényi entropy converges to the continuous Shannon entropy. Hero & Michel [31, 24, 32] proposed a direct way to estimate the Rényi entropy for $\alpha\in(0,1)$ given a dataset $X_{N}$ consisting of $N$ samples of the probability distribution $X$ of dimension $d$ . Their estimator is

[TABLE]

in which $\gamma$ can be derived from the relation $\alpha=(d-\gamma)/d$ . The functional $L_{\gamma}(X_{N})$ is given by

[TABLE]

in which $T(X_{N})$ denotes the set of spanning trees on $X_{N}$ and $e$ denotes an edge therein. The parameter $\gamma$ can be computed from the desired value of $\alpha$ and will be within the interval $(0,d)$ . The constant $\beta_{L,\gamma}$ is defined by

[TABLE]

for $X_{N}$ a sample of size $N$ of the uniform distribution in $d$ dimensions. However, we will estimate it for $N$ samples only by computing it for several repetitions of the sample $X_{N}$ .

The estimator (7) is asymptotically unbiased and strongly consistent for $\alpha\in(0,1)$ [33]. We focus on the case $\alpha=1/2$ and $d=2$ wherein $|e|$ denotes the Euclidean distance, hence $\gamma=1$ . To see why we choose $\alpha=1/2$ , we give the following derivation. First, we need to introduce the Rényi divergence by

[TABLE]

for the probability distribution functions $f(\cdot)$ and $g(\cdot)$ . We choose $f(\cdot)=p_{XY}(x,y)$ and $g(\cdot)=p_{X}(x)p_{Y}(y)$ , where $p_{XY}$ is the joint probability distribution function of $X$ and $Y$ and $p_{X}(x)$ and $p_{Y}(y)$ denote the marginal probability distribution functions. Then, their Rényi divergence is given by

[TABLE]

We also have

[TABLE]

with $h(\cdot)$ the (well-defined) probability distribution function given by

[TABLE]

We also see that $S^{H}_{X^{k}}$ , with $H$ denoting the sensitivity index derived from the Hellinger distance, is given through (3) by

[TABLE]

which can be simplified to

[TABLE]

We now see the agreement between (10) and (12). In case the domain of $X^{k}$ and $Y$ is extended to $\mathbb{R}$ by zero density outside of the domain, it is possible to write

[TABLE]

with

[TABLE]

hence

[TABLE]

We can compute $S^{H}_{X^{k}}$ via $D_{1/2}(p_{X^{k},Y},p_{X^{k}}p_{Y})=-H_{1/2}(h)$ (Equation 11). Therefore, we need to estimate $L_{\gamma}(X^{k},Y)$ (8) and $\beta$ (9). Because $I$ can be estimated as

[TABLE]

we estimate the sensitivity indices by

[TABLE]

3 Results

We test the estimators in several ways. The first test case is with regard to random input/output data and is described in Section 3.1. In this case, the estimates should be near zero. The second test case, in Section 3.2, is based on comparing analytic to numerical values of the sensitivity indices. In the third test case, the Ishigami function is used and tests are performed for both independent and dependent input data, of which the results can be found in Section 3.3. The last test case is higher-dimensional and considers the Piston function (Section 3.4). In these tests, we only use the Hellinger distance. All experiments have been performed $n_{r}=10$ times. The error bars in the upcoming figures indicate the minimum and maximum value found. The results are summarized in Section 3.5.

3.1 Random data

First, we check the behavior for random output, in which case the sensitivity indices should be zero. Both the input and output data are one-dimensional, uniformly distributed on $[0,1]$ and have size $N=10^{3}$ , while $L$ is varied from $L=10$ to $L=200$ based on [34]. The results are in Figure 1.

On the right, we show the sensitivity index as computed on the complete, i.e., $L=N$ , data by KDE and MST (blue circle and red pentagon). Herein, no Gaussian process is used. As expected, their mean is around zero. The spread for the KDE method is smaller than for the MST method. The estimates for $\widehat{S^{H}_{X^{k}}}$ based on $L$ samples (blue circles) are also around zero, although their spread is larger than for $L=N$ . Note that due to the numerical implementation, the sensitivity indices can become negative.

We continue with the estimates based on Gaussian processes. Herein, the situation is a little different because the Gaussian process fits a function through the data while there is no functional relation between input and output. Hence, the sensitivity indices will most likely not be equal to zero. When fitting the Gaussian process, two cases appear, which have the same effect. The length scale and the process variance are either both small or both large. As a result, the predictions of the Gaussian process will be inaccurate. This can be seen in the figure for the KDE results (green diamonds) by their mean being away from zero and the large spread of their estimates (the outlier has a value of approximately $0.2$ ). However, due to the nature of this method, high values of $\widehat{S^{H}_{X^{k}}}$ are measured because the predicted output values are the values of the prediction mean function, which is a continuous function. Hence, these predictions are located on a curve. Therefore, the values of $\widehat{S^{H}_{X^{k}}}$ for the MST-based estimator are too large to be visible in this plot for the chosen values of $L$ , except for $L=10^{3}$ , where no emulated output is used. The reference result where we computed the FMST-based sensitivity index on the full data without emulator gave a reasonable result (red pentagon).

Similar results appear for estimates based on emulation by stochastic collocation (SC), where we used collocation samples based on the uniform distribution. A function is fit through the data while no functional relation between input and output exists. Therefore, high values of $\widehat{S^{H}_{X^{k}}}$ are measured, which are not shown in the plot.

We summarize these results as follows: when an emulator (either Gaussian process or SC) is used to augment the data for sensitivity analysis, positive values of $\widehat{S^{H}_{X^{k}}}$ are found because the emulator is designed to fit a functional relation between input and output. The “sample” method does not suffer from this problem. However, this is a very specific test case in which sample-based estimators are preferred over ones which use an augmented dataset.

3.2 Analytic test case

We consider a small test case in which we can compute the sensitivity index analytically. The idea behind this is to compare the KDE and the MST method in case no emulator is used. We have

[TABLE]

We took $N=10^{4}$ and repeated the experiment $n_{r}=10$ times. The results are in Figure 2.

Except for $\rho=0.98$ , the MST method outperforms the KDE method. Furthermore, the MST method is i) not dependent on parameter choices such as kernel and kernel bandwidth and ii) faster to compute. One also needs to take into account that the rule of thumb to choose the kernel bandwidth we used here is based on the assumption that the data comes from a normal distribution and, therefore, this kernel bandwidth is optimal in this test case. When the underlying distribution is not normal, this heuristic may not be optimal.

3.3 Ishigami function

We now continue to a non-trivial synthetic test case, of which the test function is from Ishigami & Homma [35]. This output function is defined by

[TABLE]

on the domain $[-\pi,\pi]^{3}$ (dimension $d=3$ ). We will use the well-known choice $a=7$ , $b=0.1$ in accordance with [36].

Two types of input data are constructed for this test case. One is uniformly distributed and consists of $N=10^{3}$ samples on the domain of the output function. The other is the empirical copula of a multivariate normal distribution on the same domain, which is given by

[TABLE]

such that

[TABLE]

with $F$ the cumulative distribution function of the marginal distributions (which is distributed as $N(0,1)$ ).

For reference, we compute both the KDE-based and MST-based estimate on a larger dataset (with $N=10^{5}$ ) for comparison. Scott’s rule [27] is used for the kernel bandwidth.

In the numerical experiments, we first compute, depending on the dataset, a Latin hypercube sample (LHS) or Monte Carlo sample (MCS) of size $L=\{30,50,100,200\}$ and combine it with KDE. For this data, we computed (4). Then, we fit a Gaussian process with Gaussian kernel to these samples, where the length scales have been estimated by maximum likelihood estimation. Now, we can proceed with KDE on $(X_{+},Y_{+})$ , in which we include the choice $Y_{L^{+}}=\mu(X_{L^{+}})$ (Equation 5). We obtain one estimate for $\overline{H}_{X^{k},f}^{(L+L_{+})}$ for each repetition of the experiment and thereby one value of $|\overline{H}_{X^{k},f}^{(L+L_{+})}-\overline{H}_{X^{k},f}^{(N)}|\approx|\hat{S}_{X^{k}}-S_{X^{k}}|$ which is used as measure of convergence. In a similar way, we can proceed with the MST method on $(X^{+},Y^{+})$ with $Y_{L^{+}}=\mu(X_{L^{+}})$ . Finally, the SC method, based on the uniform distribution and combined with KDE, is used for comparison. Note we showed earlier that KDE has a larger bias than MST, but we look mainly at the convergence here.

The computed reference values of the sensitivity indices are shown in Figure 3.

The values for independent and dependent data are close to each other for variables $1$ and $3$ , while they are far apart for variable $2$ , which is due to the dependency.

We will first show the results for the independent data, followed by the results for the dependent data. We start with determining the goodness-of-fit of the Gaussian process by performing $k$ -fold cross-validation (CV) with $k=10$ and compute the coefficient of determination

[TABLE]

where $\hat{Y}_{l}$ are the CV predictions for $Y_{l}$ and $\bar{Y}=\frac{1}{L}\sum_{l}Y_{l}$ .

The values for $R^{2}$ for independent data are given in Figure 4 and we see its values are near zero for higher values of $L$ . For $L=30$ and $L=50$ , this fraction can become larger than $1$ . In this case, the fit is worse than a constant function. Note that here, the Gaussian process is not fit well, while this is the case for the higher values of $L$ .

Figure 5 shows the convergence of the estimates, where “sample” indicates the KDE is based on only $L$ samples, “SC” indicates stochastic collocation is used (combined with KDE), “GP-KDE” is based on (5) and “GP-MST” is based on (13). From left to right, variables 1 to 3 are shown. This will also be the case for all similar figures in this section.

In this figure, we see several trends. First of all, the samples perform worse than the methods which use augmentation of the dataset. Second, we see the results for SC are not robust and their errors do not decrease in general for increasing $L$ . Third, we see that GP-MST shows in general decreasing errors for increasing $L$ .

The results for dependent data are shown in Figures 6 and 7. Note that LHS is not an appropriate sampling method because the data is dependent, therefore, Monte Carlo sampling is used instead. Furthermore, SC is here also not completely suitable because the input distribution is dependent. The results are similar to previous experiments, although the cross-validation results imply the Gaussian process for this data has been fit better. Another observation is that GP-MST outperforms the other methods for variables 2 and 3, while it is not really worse than GP-KDE for variable 1. Overall, the Gaussian process-based methods outperform the other methods.

3.4 Piston function

We also tested a higher-dimensional test case with independent uniformly distributed input variables. In this case, the output function is defined by the Piston function from [37]. The output here is the cycle time of a piston, as given by

[TABLE]

of which the input ranges are given in Table 1.

For numerical reasons, the data of size $N=10^{3}$ is generated and processed on the unit hypercube: it is only transformed to the input ranges to obtain the output values. The sensitivity indices as computed on a larger dataset of size $N=10^{5}$ are given in Figure 8. The values for KDE and MST differ, although Section 3.2 indicates the MST results are more accurate.

The cross-validation results are in Figure 9. These last results show the Gaussian process has been fit well for $L\geq 50$ and therefore we can continue with the remaining results.

The results for the convergence are in Figure 10. From left to right, top to bottom, variables 1 to 7 are shown.

We see clear differences between variables 1-4 on one hand and variables 5-7 on the other hand. This is due to variables 5-7 for which $\widehat{S^{H}_{X^{k}}}$ is near zero. As indicated in Section 3.1, methods which make use of a fit perform badly in this case. For variables 1-4, GP-MST clearly outperforms the others. The SC method has only been performed with $2$ collocation points for each dimension, which led to $L=2^{7}=128$ . Increasing to $3$ would give us $3^{7}=2187$ collocation points, which is higher than the number data points in the used dataset.

3.5 Recommendation

The Gaussian process-based methods in general outperform the sample-based method and stochastic collocation, except when the value of the sensitivity index is (near-)zero. However, usually one is interested in ordering the input variables based on the sensitivity indices rather than obtaining their values very precisely. Input variables with values of the sensitivity index near zero are usually considered unimportant and in that case, it is also not very important to estimate the value of zero very precisely. We therefore advise to use the GP-MST method, wherein the available samples $(X_{L},Y_{L})$ are augmented to $(X_{+},Y_{+})$ by a Gaussian process, on which the sensitivity index (3) is computed for the Hellinger distance by the minimum spanning tree method.

4 Direct sensitivity indices

We note that the sensitivity indices as described by [8] are total sensitivity indices, which include both direct and indirect effects. Direct effects measure the effect of one input variable only, while indirect effects contain the effect of the other variables due to possible dependencies in the input variables. The indirect effect is the difference between the total and the direct effect. To illustrate this, consider an example where $X=(X^{1},X^{2})$ follows a bivariate normal distribution with means [math], variances $1$ and covariance $\rho>0$ (so that $X^{1}$ and $X^{2}$ are dependent), while $u(x_{1},x_{2})=2x_{1}$ . Then the direct effect of $X^{2}$ is zero, while its total effect is positive (because $u(X^{1},X^{2})$ and $X^{2}$ are dependent through $X^{1}$ ). Hence, the indirect effect of $X^{2}$ is positive as well. Our goal is now to find a measure for the direct effects, i.e., without the effects of the mutual input dependencies.

Although useful, total sensitivity indices do not tell the complete story. While an input variable may be completely irrelevant for the value of the output, it may have a positive sensitivity index due to a dependency with a relevant input. The relevant input variable would then be called a confounder. An example of this is wave height for a computational model of offshore wind energy: although the waves have nothing to do with the power output, they are linked to each other via the wind speed with which they have a dependency. To get rid of this effect, we need to construct indices which measure the effect of only one input variable, without effects due to dependencies in the input. It is in this case necessary to remove the dependencies from the input.

For variance-based sensitivity indices, a distinction is made between first-order, higher-order and total sensitivity indices [38]. In first-order indices, one only measures the effect of varying one variable alone, where in higher-order indices multiple variables are varied at the same time. Because the number of second-order sensitivity indices grows as $d(d-1)/2$ with $d$ the number of input variables, and the total number of sensitivity indices is $2^{d}-1$ , usually not all of them are computed. Instead, one computes the first-order and total sensitivity indices.

In a similar fashion to first-order indices, we define direct sensitivity indices, which measure the effect of varying one variable only. The direct indices then measure the direct effects, while the total indices measure the combination of direct and indirect effects, which also includes effects due to dependencies in the input.

4.1 Theory

The starting point of these new indices is the same divergence-based index as before, namely (12). We repeat (12) here for convenience,

[TABLE]

Now, note that

[TABLE]

with $u(\cdot)$ being the model used to obtain the output $Y$ , hence, both $p_{Y}(y)$ and $p_{X^{k},Y}(x,y)$ depend in theory on all input variables $X^{k}$ . Hence, if we remove the dependencies between the input variables, then these probability distributions change as well. This removal is done by applying a permutation operator $\Pi$ , which is defined on a dataset $X$ in such a way that

[TABLE]

with

[TABLE]

Hence, this operator keeps the marginal distributions the same, but it removes all dependencies ( $\perp$ here denotes statistical independence). The implementation of this operator is detailed at the end of this section.

Now, we create a permuted version of our dataset $X$ , being $\Pi(X)$ . For this dataset, we can define the direct sensitivity index by

[TABLE]

The output $Y=u(X)$ can be replaced by

[TABLE]

in which $\tilde{u}(\cdot)$ denotes the Gaussian process $\widetilde{G}_{\{X_{L},Y_{L}\}}(\cdot)$ constructed earlier. This leads to the estimator

[TABLE]

The problem now is how to define $\Pi(X)$ . A naive implementation could be one in which for each variable, a random permutation of the values is performed. This is fast, but does not guarantee independence of the input variables after transformation. Also, the indexing of the permutations leads to a Latin hypercube design (LHD): each value from $1$ to $N$ (for $N$ data points in the dataset) is used only once. However, this does not guarantee all dependencies are removed. In Latin hypercube sampling (LHS), a comparable problem exists as equally probable subspaces can end up with a different number of sampling points. This is solved by orthogonal sampling [39] or by using a maximin criterion [40].

Inspired by this, we would like to generate an LHD of size $N$ in $d$ dimensions with the maximin criterion which puts the samples at the middle of each interval. This LHD is easily transformed to an indexing, which can be applied to the original data $X$ to obtain $\Pi(X)$ . However, obtaining such an LHD is computationally very expensive because it contains an optimization step and is therefore not feasible for the problem sizes we are looking at.

An alternative to Latin hypercube sampling is quasi-Monte Carlo sampling, which generates data points from low-discrepancy sequences such as Halton’s [41, Chapter 3] and Sobol’s [42]. In this way, we achieve the goal that the proportion of data points in a sequence falling into a subspace is nearly proportional to the probability measure of this subspace (the difference between them is the discrepancy). Hence, we achieve an approximately uniform distribution of data points over the unit hypercube, which means the dimensions are independent of each other. Furthermore, all values generated for a variable are unique, which means they can easily be transformed to the discrete hypercube $\{1,\ldots,N\}^{d}$ . The transformed values can be used as an indexing for $X$ to obtain $\Pi(X)$ . Because the data points generated by the sequence are uniform over the unit hypercube, they lead to an independent dataset when their transformed values are used as indexing. We use the Sobol sequences as described by [42].

4.2 Ishigami

We compute these direct sensitivity indices for the Ishigami test case of Section 3.3. We split the results out to the KDE and the MST estimates. For each of them, we show the estimates of both the independent and dependent direct sensitivity indices and the spread therein for increasing $L$ . We also compare them to the values of the total indices.

We start with the KDE estimates in Figure 11. On the left, we see that the estimates are relatively stable for increasing $L$ and the spread of the estimates decreases. On the right, we compare the estimates for $N=10^{3}$ for the direct indices to the estimates with $N=10^{5}$ for the total indices. We do not compute a reference value for the direct indices because of computational cost. For the dependent data, the indices work as expected: the total indices are larger than the direct sensitivity indices. For the independent data, this is not the case, as for variable 2 and 3 the direct sensitivity index is larger than the total sensitivity index. It is not immediately clear to us why this is the case, because for independent data, the total and direct sensitivity indices should give the same results.

Figure 12 shows on the left similar results for the stability and spread in the estimates as before with KDE. On the right, we see the total sensitivity indices are larger or equal than their direct sensitivity indices counterparts. For the independent data, the differences between the the direct and total sensitivity indices are small for variable 1 and 3 and nearly invisible for variable 2. Theoretically, this difference should be (numerically) zero. For the dependent data, we see the difference between direct and total sensitivity index is largest for variable 2, while variables 1 and 3 show a small difference. This is due to variable 2 being stronger dependent with the other two variables.

5 Conclusion

We proposed to use Gaussian processes in order to improve the estimates of divergence-based sensitivity indices. This is advantageous in cases where the number of available input-output samples is small, for example if the computational cost of each model evaluation needed to compute the output is high.

We compared the use of Gaussian processes to the well-established method of stochastic collocation combined with Lagrange interpolation. This method has several disadvantages in practice and is outperformed by the Gaussian process-based methods in our experiments. The use of Gaussian processes also allowed us to propose (i) a new estimation method and (ii) a new type of sensitivity indices. This new estimation method for divergence-based sensitivity indices is based on minimum spanning trees and can be used in case the divergence used is the Hellinger distance. This estimation method has been used before to compute entropies and is numerically fast. The new type of sensitivity index, named direct sensitivity index, is especially useful when the input data is dependent.

Acknowledgments

This research is part of the EUROS programme, which is supported by NWO domain Applied and Engineering Sciences under grant number 14185 and partly funded by the Ministry of Economic Affairs.

Bibliography42

The reference list from the paper itself. Each links out to its DOI / PubMed record.

1[1] A. Saltelli, Global sensitivity analysis: an introduction, in: Proc. 4th International Conference on Sensitivity Analysis of Model Output (SAMO’04), 2004, pp. 27–43.
2[2] J. E. Oakley, A. O’Hagan, Probabilistic sensitivity analysis of complex models: a Bayesian approach, Journal of the Royal Statistical Society: Series B (Statistical Methodology) 66 (3) (2004) 751–769. doi:10.1111/j.1467-9868.2004.05304.x . · doi ↗
3[3] A. Saltelli, P. Annoni, I. Azzini, F. Campolongo, M. Ratto, S. Tarantola, Variance based sensitivity analysis of model output. Design and estimator for the total sensitivity index, Computer Physics Communications 181 (2) (2010) 259–270. doi:10.1016/j.cpc.2009.09.018 . · doi ↗
4[4] I. M. Sobol’, Sensitivity Estimates for Nonlinear Mathematical Models, Mathematical Modeling & Computational Experiments 1 (4) (1993) 407–414.
5[5] E. Borgonovo, A new uncertainty importance measure, Reliability Engineering & System Safety 92 (6) (2007) 771–784.
6[6] K. Blix, Sensitivity analysis of Gaussian process machine learning for chlorophyll prediction from optical remote sensing, Master’s thesis, Ui T Norges arktiske universitet (2014).
7[7] I. Csiszár, P. C. Shields, Information theory and statistics: A tutorial, Foundations and Trends® in Communications and Information Theory 1 (4) (2004) 417–528.
8[8] S. Da Veiga, Global sensitivity analysis with dependence measures , Journal of Statistical Computation and Simulation 85 (7) (2014) 1283–1305. doi:10.1080/00949655.2014.945932 . URL http://dx.doi.org/10.1080/00949655.2014.945932 · doi ↗

TL;DR

Contribution

Findings

Abstract

Peer Reviews

Videos

Taxonomy

Efficient estimation of divergence-based sensitivity indices with Gaussian process surrogates

Abstract

1 Introduction

2 Divergence-based sensitivity indices and their estimation

2.1 Sensitivity indices from the fff-divergence

2.2 Difficulties for estimation

2.3 Estimation using Gaussian processes

2.3.1 Kernel density estimation

2.3.2 Minimum spanning trees

3 Results

3.1 Random data

3.2 Analytic test case

3.3 Ishigami function

3.4 Piston function

3.5 Recommendation

4 Direct sensitivity indices

4.1 Theory

4.2 Ishigami

5 Conclusion

Acknowledgments

2.1 Sensitivity indices from the $f$ -divergence