Sensitivity analysis beyond linearity

Manuele Leonelli

arXiv:1901.02062·math.ST·January 9, 2019·Int. J. Approx. Reason.

Sensitivity analysis beyond linearity

Manuele Leonelli

PDF

TL;DR

This paper extends sensitivity analysis methods to non-multilinear probabilistic graphical models, showing that sensitivity functions are polynomial and identifying optimal covariation schemes.

Contribution

It introduces a general approach for sensitivity analysis in non-multilinear models, relaxing the multilinearity assumption and deriving polynomial sensitivity functions.

Findings

01

Sensitivity functions are polynomial in non-multilinear models.

02

Proportional covariation is optimal under certain conditions.

03

Derived divergence and distance measures for various covariation schemes.

Abstract

A wide array of graphical models can be parametrised to have atomic probabilities represented by monomial functions. Such monomial structure has proven very useful when studying robustness under the assumption of a multilinear model where all monomial have either zero or one exponents. Robustness in probabilistic graphical models is usually investigated by varying some of the input probabilities and observing the effects of these on output probabilities of interest. Here the assumption of multilinearity is relaxed and a general approach for sensitivity analysis in non-multilinear models is presented. It is shown that in non-multilinear models sensitivity functions have a polynomial form, conversely to multilinear models where these are simply linear. The form of various divergences and distances under different covariation schemes is also formally derived. Proportional covariation is…

Tables1

Table 1. Table 1 : Probability specifications for the staged trees in Section 2.2 .

Multilinear staged tree
$θ_{1} = 0.5$ , $θ_{2} = 0.5$ , $θ_{3} = 0.2$ , $θ_{4} = 0.7$ , $θ_{5} = 0.1$ , $θ_{6} = 0.35$ , $θ_{7} = 0.65$ , $θ_{8} = 0.1$ , $θ_{9} = 0.5$ , $θ_{10} = 0.4$
Non-multilinear staged tree
$θ_{1} = 0.5$ , $θ_{2} = 0.5$ , $θ_{3} = 0.15$ , $θ_{4} = 0.6$ , $θ_{5} = 0.25$ , $θ_{6} = 0.35$ , $θ_{7} = 0.65$

Equations71

P (y) = i \in [n] \prod j \in S_{i} \prod θ_{j}^{A_{y, j}} = i \in [n] \prod θ_{S_{i}}^{A_{y, S_{i}}},

P (y) = i \in [n] \prod j \in S_{i} \prod θ_{j}^{A_{y, j}} = i \in [n] \prod θ_{S_{i}}^{A_{y, S_{i}}},

MM (A, θ, S) = ⎩ ⎨ ⎧ P \in Δ_{q - 1} : P (y) = i \in [n] \prod θ_{S_{i}}^{A_{y, S_{i}}} \mbox f or y \in Y \mbox an d θ \in R_{> 0}^{k} ⎭ ⎬ ⎫

MM (A, θ, S) = ⎩ ⎨ ⎧ P \in Δ_{q - 1} : P (y) = i \in [n] \prod θ_{S_{i}}^{A_{y, S_{i}}} \mbox f or y \in Y \mbox an d θ \in R_{> 0}^{k} ⎭ ⎬ ⎫

A = 210011

A = 210011

∙ v_{3}

∙ v_{3}

v_{0} \ignorespaces \ignorespaces \ignorespaces \ignorespaces \ignorespaces \ignorespaces \ignorespaces \ignorespaces

v_{0} \ignorespaces \ignorespaces \ignorespaces \ignorespaces \ignorespaces \ignorespaces \ignorespaces \ignorespaces

A_{11} = 1111111111000000000011110000000000111000000000011110000000000111000000010010010000100100100001001001 A_{12} = 0000000000111111111111110000000000111000000000011110000000000111000000010010010000100100100001001001

A_{11} = 1111111111000000000011110000000000111000000000011110000000000111000000010010010000100100100001001001 A_{12} = 0000000000111111111111110000000000111000000000011110000000000111000000010010010000100100100001001001

v_{0} \ignorespaces \ignorespaces \ignorespaces \ignorespaces \ignorespaces \ignorespaces \ignorespaces \ignorespaces

v_{0} \ignorespaces \ignorespaces \ignorespaces \ignorespaces \ignorespaces \ignorespaces \ignorespaces \ignorespaces

A_{21} = 1111111111000000000012111001000010121010000100111210000000000111000000 A_{22} = 0000000000111111111112111001000010121010000100111210000000000111000000

A_{21} = 1111111111000000000012111001000010121010000100111210000000000111000000 A_{22} = 0000000000111111111112111001000010121010000100111210000000000111000000

\displaystyle\sigma:\hskip 14.22636pt\bigtimes\limits_{i\in[n]}\Delta_{\#S_{i}-1}\longmapsto

\displaystyle\sigma:\hskip 14.22636pt\bigtimes\limits_{i\in[n]}\Delta_{\#S_{i}-1}\longmapsto

(θ_{i}, θ_{S_{j}^{- i}}, θ_{- S_{j}}) ⟼

\tilde{θ}_{k} = \frac{1 - θ ~ _{i}}{1 - θ _{i}} θ_{k} for all k \in S_{j}^{- i} .

\tilde{θ}_{k} = \frac{1 - θ ~ _{i}}{1 - θ _{i}} θ_{k} for all k \in S_{j}^{- i} .

\tilde{θ}_{k} = \frac{1 - θ ~ _{i}}{# S _{j} - 1} for all k \in S_{j}^{- i} .

\tilde{θ}_{k} = \frac{1 - θ ~ _{i}}{# S _{j} - 1} for all k \in S_{j}^{- i} .

\tilde{θ}_{k} = γ_{k} \tilde{θ}_{i} + δ_{k} for all k \in S_{j}^{- i},

\tilde{θ}_{k} = γ_{k} \tilde{θ}_{i} + δ_{k} for all k \in S_{j}^{- i},

σ (P) (E) = y \in E \sum \tilde{θ}_{S_{j}}^{A_{y, S_{j}}} θ_{- S_{j}}^{A_{y, - S_{j}}}

σ (P) (E) = y \in E \sum \tilde{θ}_{S_{j}}^{A_{y, S_{j}}} θ_{- S_{j}}^{A_{y, - S_{j}}}

σ_{pro} (P) (E) = y \in E \sum \tilde{θ}_{i}^{A_{y, i}} (\frac{1 - θ ~ _{i}}{1 - θ _{i}})^{∣ A_{y, S_{j}^{- i}} ∣} θ_{S_{j}^{- i}}^{A_{y, S_{j}^{- i}}} θ_{- S_{j}}^{A_{y, - S_{j}}}

σ_{pro} (P) (E) = y \in E \sum \tilde{θ}_{i}^{A_{y, i}} (\frac{1 - θ ~ _{i}}{1 - θ _{i}})^{∣ A_{y, S_{j}^{- i}} ∣} θ_{S_{j}^{- i}}^{A_{y, S_{j}^{- i}}} θ_{- S_{j}}^{A_{y, - S_{j}}}

σ_{uni} (P) (E) = y \in E \sum \tilde{θ}_{i}^{A_{y, i}} (\frac{1 - θ ~ _{i}}{# S _{j} - 1})^{∣ A_{y, S_{j}^{- i}} ∣} θ_{- S_{j}}^{A_{y, - S_{j}}}

σ_{uni} (P) (E) = y \in E \sum \tilde{θ}_{i}^{A_{y, i}} (\frac{1 - θ ~ _{i}}{# S _{j} - 1})^{∣ A_{y, S_{j}^{- i}} ∣} θ_{- S_{j}}^{A_{y, - S_{j}}}

σ_{lin} (P) (E) = y \in E \sum \tilde{θ}_{i}^{A_{y, i}} k \in S_{j}^{- i} \prod (γ_{k} \tilde{θ}_{i} + δ_{k})^{A_{y, k}} θ_{- S_{j}}^{A_{y, - S_{j}}}

σ_{lin} (P) (E) = y \in E \sum \tilde{θ}_{i}^{A_{y, i}} k \in S_{j}^{- i} \prod (γ_{k} \tilde{θ}_{i} + δ_{k})^{A_{y, k}} θ_{- S_{j}}^{A_{y, - S_{j}}}

σ (P) (E) = y \in E \sum \tilde{θ}^{A_{y}} = y \in E \sum \tilde{θ}_{S_{j}}^{A_{y, S_{j}}} \tilde{θ}_{- S_{j}}^{A_{y, - S_{j}}} = y \in E \sum \tilde{θ}_{S_{j}}^{A_{y, S_{j}}} θ_{- S_{j}}^{A_{y, - S_{j}}} .

σ (P) (E) = y \in E \sum \tilde{θ}^{A_{y}} = y \in E \sum \tilde{θ}_{S_{j}}^{A_{y, S_{j}}} \tilde{θ}_{- S_{j}}^{A_{y, - S_{j}}} = y \in E \sum \tilde{θ}_{S_{j}}^{A_{y, S_{j}}} θ_{- S_{j}}^{A_{y, - S_{j}}} .

D_{CD} (\tilde{P}, P) = lo g y \in Y max \frac{P ~ ( y )}{P ( y )} - lo g y \in Y min \frac{P ~ ( y )}{P ( y )} .

D_{CD} (\tilde{P}, P) = lo g y \in Y max \frac{P ~ ( y )}{P ( y )} - lo g y \in Y min \frac{P ~ ( y )}{P ( y )} .

θ_{i} = P (Y_{1} = i) = P (Y_{2} = i ∣ Y_{1} = j), i \in [3], j \in [2] .

θ_{i} = P (Y_{1} = i) = P (Y_{2} = i ∣ Y_{1} = j), i \in [3], j \in [2] .

D_{CD} (σ (P), P) = lo g y \in Y_{S_{j}}^{\neq =} max (\frac{θ ~ _{S_{j}}}{θ _{S_{j}}})^{A_{y, S_{j}}} - lo g y \in Y_{S_{j}}^{\neq =} min (\frac{θ ~ _{S_{j}}}{θ _{S_{j}}})^{A_{y, S_{j}}}

D_{CD} (σ (P), P) = lo g y \in Y_{S_{j}}^{\neq =} max (\frac{θ ~ _{S_{j}}}{θ _{S_{j}}})^{A_{y, S_{j}}} - lo g y \in Y_{S_{j}}^{\neq =} min (\frac{θ ~ _{S_{j}}}{θ _{S_{j}}})^{A_{y, S_{j}}}

D_{CD} (σ_{pro} (P), P) = lo g y \in Y_{S_{j}}^{\neq =} max (\frac{θ ~ _{i}}{θ _{i}})^{A_{y, i}} (\frac{1 - θ ~ _{i}}{1 - θ _{i}})^{∣ A_{y, S_{j}^{- i}} ∣} - lo g y \in Y_{S_{j}}^{\neq =} min (\frac{θ ~ _{i}}{θ _{i}})^{A_{y, i}} (\frac{1 - θ ~ _{i}}{1 - θ _{i}})^{∣ A_{y, S_{j}^{- i}} ∣}

D_{CD} (σ_{pro} (P), P) = lo g y \in Y_{S_{j}}^{\neq =} max (\frac{θ ~ _{i}}{θ _{i}})^{A_{y, i}} (\frac{1 - θ ~ _{i}}{1 - θ _{i}})^{∣ A_{y, S_{j}^{- i}} ∣} - lo g y \in Y_{S_{j}}^{\neq =} min (\frac{θ ~ _{i}}{θ _{i}})^{A_{y, i}} (\frac{1 - θ ~ _{i}}{1 - θ _{i}})^{∣ A_{y, S_{j}^{- i}} ∣}

D_{CD} (σ_{uni} (P), P) = lo g y \in Y_{S_{j}}^{\neq =} max \frac{θ _{i}^{A_{y, i}} ( \frac{1 - θ ~ _{i}}{# S _{j} - 1} ) ^{∣ A_{y, S_{j}^{- 1}} ∣}}{θ _{S_{j}}^{A_{y, S_{j}}}} - lo g y \in Y_{S_{j}}^{\neq =} min \frac{θ _{i}^{A_{y, i}} ( \frac{1 - θ ~ _{i}}{# S _{j} - 1} ) ^{∣ A_{y, S_{j}^{- 1}} ∣}}{θ _{S_{j}}^{A_{y, S_{j}}}}

D_{CD} (σ_{uni} (P), P) = lo g y \in Y_{S_{j}}^{\neq =} max \frac{θ _{i}^{A_{y, i}} ( \frac{1 - θ ~ _{i}}{# S _{j} - 1} ) ^{∣ A_{y, S_{j}^{- 1}} ∣}}{θ _{S_{j}}^{A_{y, S_{j}}}} - lo g y \in Y_{S_{j}}^{\neq =} min \frac{θ _{i}^{A_{y, i}} ( \frac{1 - θ ~ _{i}}{# S _{j} - 1} ) ^{∣ A_{y, S_{j}^{- 1}} ∣}}{θ _{S_{j}}^{A_{y, S_{j}}}}

D_{CD} (σ_{lin} (P), P) = lo g y \in Y_{S_{j}}^{\neq =} max (\frac{θ ~ _{i}}{θ _{i}})^{A_{y, i}} k \in S_{j} \prod (\frac{γ _{k} θ ~ _{i} + δ _{k}}{θ _{k}})^{A_{y, k}} - lo g y \in Y_{S_{j}}^{\neq =} min (\frac{θ ~ _{i}}{θ _{i}})^{A_{y, i}} k \in S_{j} \prod (\frac{γ _{k} θ ~ _{i} + δ _{k}}{θ _{k}})^{A_{y, k}}

D_{CD} (σ_{lin} (P), P) = lo g y \in Y_{S_{j}}^{\neq =} max (\frac{θ ~ _{i}}{θ _{i}})^{A_{y, i}} k \in S_{j} \prod (\frac{γ _{k} θ ~ _{i} + δ _{k}}{θ _{k}})^{A_{y, k}} - lo g y \in Y_{S_{j}}^{\neq =} min (\frac{θ ~ _{i}}{θ _{i}})^{A_{y, i}} k \in S_{j} \prod (\frac{γ _{k} θ ~ _{i} + δ _{k}}{θ _{k}})^{A_{y, k}}

D_{CD} (σ (P), P)

D_{CD} (σ (P), P)

lo g i = 3, 4, 5 max \frac{θ ~ _{i}}{θ _{i}} - lo g i = 3, 4, 5 min \frac{θ ~ _{i}}{θ _{i}} .

lo g i = 3, 4, 5 max \frac{θ ~ _{i}}{θ _{i}} - lo g i = 3, 4, 5 min \frac{θ ~ _{i}}{θ _{i}} .

lo g max {\frac{θ ~ _{3}}{θ _{3}}, \frac{θ ~ _{3}^{2}}{θ _{3}^{2}}, \frac{θ ~ _{4}^{2}}{θ _{4}^{2}}, \frac{θ ~ _{5}^{2}}{θ _{5}^{2}}, \frac{θ ~ _{3} θ ~ _{4}}{θ _{3} θ _{4}}, \frac{θ ~ _{3} θ ~ _{5}}{θ _{3} θ _{5}}, \frac{θ ~ _{4} θ ~ _{5}}{θ _{4} θ _{5}}} - lo g min {\frac{θ ~ _{3}}{θ _{3}}, \frac{θ ~ _{3}^{2}}{θ _{3}^{2}}, \frac{θ ~ _{4}^{2}}{θ _{4}^{2}}, \frac{θ ~ _{5}^{2}}{θ _{5}^{2}}, \frac{θ ~ _{3} θ ~ _{4}}{θ _{3} θ _{4}}, \frac{θ ~ _{3} θ ~ _{5}}{θ _{3} θ _{5}}, \frac{θ ~ _{4} θ ~ _{5}}{θ _{4} θ _{5}}} .

lo g max {\frac{θ ~ _{3}}{θ _{3}}, \frac{θ ~ _{3}^{2}}{θ _{3}^{2}}, \frac{θ ~ _{4}^{2}}{θ _{4}^{2}}, \frac{θ ~ _{5}^{2}}{θ _{5}^{2}}, \frac{θ ~ _{3} θ ~ _{4}}{θ _{3} θ _{4}}, \frac{θ ~ _{3} θ ~ _{5}}{θ _{3} θ _{5}}, \frac{θ ~ _{4} θ ~ _{5}}{θ _{4} θ _{5}}} - lo g min {\frac{θ ~ _{3}}{θ _{3}}, \frac{θ ~ _{3}^{2}}{θ _{3}^{2}}, \frac{θ ~ _{4}^{2}}{θ _{4}^{2}}, \frac{θ ~ _{5}^{2}}{θ _{5}^{2}}, \frac{θ ~ _{3} θ ~ _{4}}{θ _{3} θ _{4}}, \frac{θ ~ _{3} θ ~ _{5}}{θ _{3} θ _{5}}, \frac{θ ~ _{4} θ ~ _{5}}{θ _{4} θ _{5}}} .

lo g max {\frac{θ ~ _{4}^{2}}{θ ~ _{4}^{2}}, \frac{1 - θ ~ _{4}}{1 - θ _{4}}, \frac{( 1 - θ ~ _{4} ) ^{2}}{( 1 - θ _{4} ) ^{2}}, \frac{θ ~ _{4} ( 1 - θ ~ _{4} )}{θ _{4} ( 1 - θ _{4} )}} - lo g min {\frac{θ ~ _{4}^{2}}{θ ~ _{4}^{2}}, \frac{1 - θ ~ _{4}}{1 - θ _{4}}, \frac{( 1 - θ ~ _{4} ) ^{2}}{( 1 - θ _{4} ) ^{2}}, \frac{θ ~ _{4} ( 1 - θ ~ _{4} )}{θ _{4} ( 1 - θ _{4} )}},

lo g max {\frac{θ ~ _{4}^{2}}{θ ~ _{4}^{2}}, \frac{1 - θ ~ _{4}}{1 - θ _{4}}, \frac{( 1 - θ ~ _{4} ) ^{2}}{( 1 - θ _{4} ) ^{2}}, \frac{θ ~ _{4} ( 1 - θ ~ _{4} )}{θ _{4} ( 1 - θ _{4} )}} - lo g min {\frac{θ ~ _{4}^{2}}{θ ~ _{4}^{2}}, \frac{1 - θ ~ _{4}}{1 - θ _{4}}, \frac{( 1 - θ ~ _{4} ) ^{2}}{( 1 - θ _{4} ) ^{2}}, \frac{θ ~ _{4} ( 1 - θ ~ _{4} )}{θ _{4} ( 1 - θ _{4} )}},

D_{CD} (σ (P), P) = lo g i \in S_{j} max \frac{θ ~ _{i}}{θ _{i}} - lo g i \in S_{j} min \frac{θ ~ _{i}}{θ _{i}}

D_{CD} (σ (P), P) = lo g i \in S_{j} max \frac{θ ~ _{i}}{θ _{i}} - lo g i \in S_{j} min \frac{θ ~ _{i}}{θ _{i}}

D_{CD} (σ_{pro} (P), P) = lo g \frac{θ ~ _{i}}{θ _{i}} - lo g \frac{1 - θ ~ _{i}}{1 - θ _{i}}

D_{CD} (σ_{pro} (P), P) = lo g \frac{θ ~ _{i}}{θ _{i}} - lo g \frac{1 - θ ~ _{i}}{1 - θ _{i}}

D_{CD} (σ_{uni} (P), P) = lo g max {\frac{θ ~ _{i}}{θ _{i}}, \frac{1 - θ ~ _{i}}{( # S _{j} - 1 ) min _{k \in S_{j}^{- i}} θ _{k}}} - lo g min {\frac{θ ~ _{i}}{θ _{i}}, \frac{1 - θ ~ _{i}}{( # S _{j} - 1 ) max _{k \in S_{j}^{- i}} θ _{k}}}

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Full text

Sensitivity analysis beyond linearity

Manuele Leonelli

School of Mathematics and Statistics, University of Glasgow, UK.

Abstract

A wide array of graphical models can be parametrised to have atomic probabilities represented by monomial functions. Such monomial structure has proven very useful when studying robustness under the assumption of a multilinear model where all monomial have either zero or one exponents. Robustness in probabilistic graphical models is usually investigated by varying some of the input probabilities and observing the effects of these on output probabilities of interest. Here the assumption of multilinearity is relaxed and a general approach for sensitivity analysis in non-multilinear models is presented. It is shown that in non-multilinear models sensitivity functions have a polynomial form, conversely to multilinear models where these are simply linear. The form of various divergences and distances under different covariation schemes is also formally derived. Proportional covariation is proven to be optimal in non-multilinear models under some specific choices of varied parameters. The methodology is illustrated throughout by an educational application.

keywords:

Covariation , Monomial models , Probabilistic graphical models , Sensitivity analysis , Staged trees

1 Introduction

Sensitivity methods have received great attention in the literature of probabilistic graphical models in the past twenty years. Sensitivity analysis is a fundamental part of any applied analysis, carried out to validate the construction of a probabilistic graphical model and investigate its robustness to misspecification of its probabilities. Such methods have been successfully used in a variety of applications (e.g. Nur et al., 2009; Oberguggenberger et al., 2009; Pollino et al., 2007; Uusitalo, 2007).

Research has mostly focused on Bayesian network (BN) models (Koller et al., 2009; Smith, 2010), although sensitivity results also exist for Markov networks (Chan and Darwiche, 2005b) and chain event graphs (Leonelli et al., 2017a). Sensitivity analysis in BNs usually consists of two phases: first some parameters of the model are varied and the effect of these variations on output probabilities of interest are investigated; second, once parameter variations are identified, the effect of these are summarized by a distance or divergence measure between the original and the varied distributions underlying the BN. Although sensitivity methods exist for continuous random variables under the assumption of Gaussianity (e.g Castillo and Kjærulff, 2003; Gómez-Villegas et al., 2013; Görgen and Leonelli, 2018), henceforth we focus on the most common case of discrete random variables only.

For the first phase of a sensitivity analysis, a simple mathematical function, usually termed sensitivity function, describes an output probability of interest as a function of the BN parameters. This is a (multi-) linear function of the varied parameters for marginal output probabilities (Castillo et al., 1997; Coupé and Van Der Gaag, 2002). Conversely, if the probability of interest is a conditional probability, then the sensitivity function is a ratio of (multi-) linear functions.

For the second phase, the Chan-Darwiche distance (Chan and Darwiche, 2005a), Kullback-Leibler divergence (Kullback and Leibler, 1951) and $\phi$ -divergences (Ali and Silvey, 1966) are often used to measure the overall effect of parameter variations. One important line of research has focused on identifying parameter covariations, i.e. ways to adjust parameters so to respect the sum to one condition after a parameter variation, that minimize such distances. Proportional covariation (Laskey, 1995; Renooij, 2014), which assigns the same proportion of residual probability mass to covarying paramaters after a variation, is the gold-standard method since this has been shown to minimize the above-mentioned divergences in a variety of settings (Chan and Darwiche, 2002; Leonelli et al., 2017a), although not all (Leonelli and Riccomagno, 2018).

Most of the above-mentioned results, although specifically derived for BNs, hold for a variety of models whose atomic probabilities can be written as a multilinear polynomial (Leonelli et al., 2017a). The multilinear structure of atomic probabilities in BNs has been known for quite some time (Castillo et al., 1995; Darwiche, 2003), but other models entertain the same property under specific parametrisations, for instance stratified staged trees (Görgen et al., 2015), context-specific BNs (Boutilier et al., 1996) and influence diagrams (Leonelli et al., 2017b).

The development of sensitivity methods for models whose atomic probabilities cannot be written as multilinear polynomials have been limited. Results have been derived for dynamic Bayesian networks (DBNs) (Charitos and van der Gaag, 2006a, b), Markov chains (de Cooman et al., 2008) and hidden Markov models (Amsalu et al., 2017; Renooij, 2012). The atomic probabilities of all these model classes have a non-square-free polynomial representation, as demonstrated in Brandherm and Jameson (2004) since they all have a DBN characterisation. Non-multilinear atomic probabilities are often associated to models whose probabilities are recursively updated through time in a dynamic fashion, although this does not necessarily have to be the case as demonstrated by the examples below.

This work presents a general framework for sensitivity analysis in models whose atomic probabilities have a non-multilinear structure and therefore can be applied to the already mentioned model classes of DBNs and hidden Markov models. The monomial representation of a statistical model introduced in Leonelli and Riccomagno (2018) is used here to encompass all classes of discrete models with non-multilinear atomic probabilities. For such models, the form of the sensitivity functions and their properties are derived. Furthermore, results about the computation of the CD distance and $\phi$ -divergences under various covariation schemes are derived. In particular, it is proven that, for specific choices of parameters to be varied, proportional covariation is optimal, in the sense that it minimizes the CD distance between the original and varied distributions amongst all possible ways to covary parameters. Therefore, this work extends the results of Leonelli et al. (2017a) for multilinear models to non-multilinear ones, as well as proposing sensitivity methods similar to those of Renooij (2012) and Charitos and van der Gaag (2006a) but which apply to a much more general class of models.

The paper is structured as follows. Section 2 reviews monomial models and shows that staged trees have in general a non-multilinear polynomial representation. This section further introduces a running example from an educational application. Section 3 reviews covariation methods for probabilities. Section 4 reports the derivations of the sensitivity functions for non-multilinear models, whilst Section 5 deals with divergences and their computation. The paper is concluded with a discussion.

2 Monomial models

A review of monomial models, in short MMs, as introduced in Leonelli and Riccomagno (2018) is given first. Let $\mathbb{Y}$ be a finite set with $q$ elements and $\operatorname{P}$ a strictly positive probability density function for $\mathbb{Y}$ . Let $\#\mathbb{Y}=q$ , call $y\in\mathbb{Y}$ an atom and $\operatorname{P}(y)$ the atomic probability of $y$ . The generic probability $\operatorname{P}$ can be seen as a point in the interior set of the $q$ -dimensional simplex, i.e. $\operatorname{P}\in\Delta_{q-1}$ . Next, a particular class of parametric statistical models, called MMs, is associated to $\mathbb{Y}$ .

Let $[k]=\{1,2,\ldots,k\}$ . A MM is defined by three elements: a $q\times k$ matrix $A$ with non-negative integer entries, $A\in\mathcal{M}_{q\times k}(\mathbb{Z}_{\geq 0})$ ; a $k$ -dimensional parameter vector $\theta$ with positive real entries, $\theta=(\theta_{i})_{i\in[k]}\in\mathbb{R}^{k}_{>0}$ ; and a partition $S=\{S_{1},\dots,S_{n}\}$ of $[k]$ . There is a row of $A$ for each atom $y$ and $A_{y}$ indicates the $y$ -th row of $A$ . The atomic probability of $y\in\mathbb{Y}$ given $\theta$ and $A$ is defined as $\operatorname{P}(y)=\prod_{i\in[k]}\theta_{i}^{A_{y,i}}=\theta^{A_{y}}$ . The partition $S$ of $[k]$ is such that $\theta_{S_{i}}=(\theta_{j})_{j\in S_{i}}\in\Delta_{\#S_{i}-1}$ . The atomic probability of $y\in\mathbb{Y}$ can then be written as

[TABLE]

where $\theta_{S}^{A_{y,S}}=\prod_{i\in S}\theta_{i}^{A_{y,i}}$ denotes the monomial associated to an event $y\in\mathbb{Y}$ where only parameters $\theta_{i}$ for $i\in S$ can have non-zero exponent. For $A\in\mathcal{M}_{q\times k}(\mathbb{Z}_{\geq 0})$ , $B\subseteq[q]$ and $C\subseteq[k]$ , $A_{B,C}$ denotes the submatrix of $A$ with $B$ rows and $C$ columns.

Definition 1.

The MM over $\mathbb{Y}$ associated to $A$ , $\theta$ and $S$ , where $S$ is such that $\theta_{S_{i}}\in\Delta_{\#S_{i}-1}$ , is defined as

[TABLE]

A $\operatorname{MM}(A,\theta,S)$ is said to be multilinear if $A\in\mathcal{M}_{q\times k}(\{0,1\})$ .

A MM is multilinear if all its monomials are square free, i.e. the exponents of the parameters are either zero or one. Leonelli et al. (2017a) and Leonelli and Riccomagno (2018) give a thorough investigation of sensitivity analysis in multilinear MMs. Here conversely the focus is on models which are not necessarily multilinear.

Example 1.

Consider a simple coin toss game. The probability of head (H) is $\theta_{1}$ , whilst tail (T) has probability $\theta_{2}$ , where $\theta_{1}+\theta_{2}=1$ . If the result of the first toss is head, then the coin is tossed a second time. This situation can be represented by a MM with parameter vector $\theta=(\theta_{1},\theta_{2})$ , degenerate partition of $[2]$ including one element only, and matrix $A$ defined as

[TABLE]

where the first column of $A$ relates to $\theta_{1}$ and the second to $\theta_{2}$ . The model is such that $\operatorname{P}(HH)=\theta_{1}^{2}$ , $\operatorname{P}(HT)=\theta_{1}\theta_{2}$ and $\operatorname{P}(T)=\theta_{2}$ . This MM is non-multilinear since the matrix A includes an entry equal to 2.

Since DBNs have been already shown to have a non-multilinear monomial structure in Brandherm and Jameson (2004), here the focus is on staged trees, which are introduced next.

2.1 Staged trees

Graphical models represented by event trees $\mathcal{T}=(V,E)$ are considered here, which are directed rooted trees where each inner vertex $v\in V$ has at least two children. In this context, the sample space of the model corresponds to the set of root-to-leaf paths in the graph and each directed path, which is a sequence of edges $r=(e~{}|~{}e\in E(r))$ , for $E(r)\subset E$ has a meaning in the modelling context. Each edge $e\in E$ is associated to a primitive probability $\theta_{e}\in(0,1)$ such that on each floret $\mathcal{F}(v)=(v,E(v))$ , where $E(v)\subseteq E$ is the set of edges emanating from $v\in V$ , the primitive probabilities sum to unity. The probability of an atom is then simply the product of the primitive probabilities along the edges of its path: $\operatorname{P}(r)=\prod_{e\in E(r)}\theta_{e}$ .

Definition 2.

Let $\theta_{v}=(\theta_{e}~{}|~{}e\in E(v))$ be the vector of primitive probabilities associated to the floret $\mathcal{F}(v)$ , $v\in V$ , in an event tree $\mathcal{T}=(V,E)$ . A staged tree is an event tree as above where, for some $v,w\in V$ , the floret probabilities are identified $\theta_{v}=\theta_{w}$ . Then, $v,w\in V$ are in the same stage.

Two vertices are thus in the same stage if they have the same (conditional) distribution over their edges. When drawing a tree, vertices in the same stage are either framed using the same shape or equally colored in order to have a visual counterpart of that information. Setting floret probabilities equal can be thought of as representing conditional independence information. Staged trees are capable of representing all conditional independence hypotheses within discrete BNs, whilst at the same time being more flexible in expressing modifications of these (Görgen et al., 2015; Smith and Anderson, 2008).

Staged trees are MMs whose atomic probabilities can either be multilinear or not (Görgen et al., 2015). The following example gives a simple illustration of a non-multilinear staged tree.

Example 2.

The MM of Example 1 can be depicted as the staged tree in Figure 1, which has two inner-vertices, $v_{0}$ and $v_{1}$ , in the same stage. The tree has three root-to-leaf paths ending in the leaves $v_{3}$ (head and head), $v_{4}$ (head and tail) and $v_{2}$ (tail). The edges emanating from the inner-vertices $v_{0}$ and $v_{1}$ are associated to the primitive probabilities $\theta_{1}$ and $\theta_{2}$ representing the probability of head and tail respectively.

2.2 An example

To illustrate the construction of a staged tree and its monomial representation, an example from an educational application is considered. This example was first introduced in Freeman and Smith (2011).

In a one-year program students take components A and B, but not everyone in the same order: students are first allocated to study either module A or B for the first six months and then the other for the final six months. After the first six months students are examined on their allocated component and can be awarded a distinction (D), a pass (P) or a fail (F). If failed, they can resit the exam with the possibility of passing and thus be allowed to the second component. Students who fail the resit are withdrawn from the program. For the second module students can again either fail, pass or be awarded a distinction, but with no possibility of resitting. With an obvious extension of the labeling, the process can be depicted by the tree in Figure 2

Various hypotheses of conditional independence, corresponding to equal primitive probabilities of multiple florets, can be embedded in the above educational scenario. One set of such hypotheses was given in Freeman and Smith (2011) as:

The components A and B are equally hard: this corresponds to an equal framing of the vertices A and B in Figure 2.

2.

The chances of passing the first module after a fail do not depend on the module taken: this is depicted by an equal colouring of $F_{1,A}$ and $F_{1,B}$ in Figure 2.

3.

The distribution of grades for the last six months does not depend on the module taken nor on the results of the first part: this is depicted by framing $P_{R,A}$ , $P_{1,A}$ , $D_{1,A}$ , $P_{R,B}$ , $P_{1,B}$ and $D_{1,B}$ by a rectangle in Figure 2.

These hypotheses give the staged tree of Figure 2, which can be equally represented by a MM with parameter vector $(\theta_{1},\dots,\theta_{10})$ , matrix $A=(A_{11},A_{12})^{\textnormal{T}}$ , with

[TABLE]

and partition $S=\{S_{1},S_{2},S_{3},S_{4}\}$ where $S_{1}=\{1,2\}$ , $S_{2}=\{3,4,5\}$ , $S_{3}=\{6,7\}$ and $S_{4}=\{8,9,10\}$ . This model is multilinear since all entries of $A$ are either zero or one. Graphically this could have also been deduced by noticing that no vertices along a root-to-leaf path are in the same stage.

A second set of hypotheses may embellish the first one by assuming that the distribution of grades of students not experiencing fails are the same in all components. This additional hypothesis gives the staged tree in Figure 3 where vertices $A$ , $B$ , $P_{R,A}$ , $P_{1,A}$ , $D_{1,A}$ , $P_{R,B}$ , $P_{1,B}$ and $D_{1,B}$ are now all in the same stage. This staged tree can be written as a MM with parameter $(\theta_{1},\dots,\theta_{7})$ , matrix $A=(A_{21},A_{22})^{\textnormal{T}}$ with

[TABLE]

and partition $S=\{S_{1},S_{2},S_{3}\}$ where $S_{1}=\{1,2\}$ , $S_{2}=\{3,4,5\}$ and $S_{3}=\{6,7\}$ . Under this additional hypothesis the staged tree does not entertain a multilinear monomial parametrization, but only a non-multilinear one. For such models there is currently no established sensitivity theory to investigate their robustness.

3 Covariation

The basic underlying idea of sensitivity analysis is to vary some of the model’s parameters and observe how such variations affect outputs of interest. However, when such variations are performed, then some of the remaining parameters need to be adjusted (or to covary) to respect the sum-to-one condition of probability measures. In the binary case when one of the two parameters is varied this is straightforward, since the second parameter will be equal to one minus the other. But in generic discrete finite cases there are multiple ways to covary parameters.

The theory of covariation from Renooij (2014) is reviewed next, with a particular focus on its specific characterization for MMs given in Leonelli and Riccomagno (2018). For a set $S$ and $i\in S$ , let $S^{-i}$ denote $S\setminus\{i\}$ , $-S_{j}$ denote the set $[k]\setminus S_{j}$ and let $|v|$ denote the sum of the elements of a vector $v$ .

Definition 3.

Let $\theta$ be the parameter vector of a MM and $\theta_{i}$ be the parameter varied where $i\in S_{j}$ . Let $\theta$ be partitioned as $\theta=(\theta_{i},\theta_{S^{-i}_{j}},\theta_{-S_{j}})$ and let $\tilde{\theta}_{i}\in(0,1)$ . A $\tilde{\theta}_{i}$ -covariation scheme is a function $\sigma:\bigtimes\limits_{k\in[n]}\Delta_{\#S_{k}-1}\rightarrow\bigtimes\limits_{k\in[n]}\Delta_{\#S_{k}-1}$ which fixes $\theta_{i}$ to $\tilde{\theta}_{i}$ and does not change $\theta_{-S_{j}}$ , i.e.

[TABLE]

Thus $\theta_{S_{j}}$ denotes a vector of parameters that need to respect the sum to one condition, $\tilde{\theta}_{i}$ denotes the new numerical specification of the parameter varied and $\theta_{-S_{j}}$ the parameter vector which is not affected by the variation. Consider as an example a staged tree model. In a staged tree the sets $S_{k}$ , $k\in[n]$ , denote the conditional probability distributions of florets in different stages. Suppose one parameter from one stage is varied. Then the parameters associated to that same stage are covaried, whilst all others are held fixed.

Definition 4.

In the notation of Definition 3

the $\tilde{\theta}_{i}$ -proportional covariation scheme $\sigma_{\operatorname{pro}}(\theta)=(\tilde{\theta}_{i},\tilde{\theta}_{S^{-i}_{j}},\theta_{-S_{j}})$ is defined by setting

[TABLE]

2.

The $\tilde{\theta}_{i}$ -uniform covariation scheme, $\sigma_{\operatorname{uni}}(\theta)=(\tilde{\theta}_{i},\tilde{\theta}_{S_{j}^{-i}},\theta_{-S_{j}})$ is defined by setting

[TABLE]

3.

The $\tilde{\theta}_{i}$ -linear covariation scheme $\sigma_{\operatorname{lin}}(\theta)=(\tilde{\theta}_{i},\tilde{\theta}_{S_{j}^{-i}},\theta_{-S_{j}})$ is defined by setting

[TABLE]

where $\gamma_{k}$ and $\delta_{k}$ need to be chosen so that $\tilde{\theta}_{i}+|\tilde{\theta}_{S_{j}^{-i}}|=1$

Different covariation schemes may entertain different properties which, depending on the domain of application, might be more or less desirable (see Leonelli et al., 2017a; Renooij, 2014, for a list). Applying a linear covariation scheme is very natural: if for instance $\delta_{k}=-\gamma_{k}$ , then $\tilde{\theta}_{k}=\delta_{k}(1-\tilde{\theta}_{i})$ and the scheme assigns a proportion $\delta_{k}$ of the remaining probability mass to $\tilde{\theta}_{k}$ . Notice that uniform and proportional schemes are specific instances of linear covariations. Another used covariation scheme is the order-preserving one (see Renooij, 2014, for details).

4 Sensitivity functions

Sensitivity functions represent the functional relationship between a parameter being varied and the output probability of an event of interest. These are often used in practice since, for instance, the parameter specifications of a MM may imply event probabilities which appears to be unreasonable to a user, although being a coherent consequence of his/her beliefs. Sensitivity functions depict the required change of a parameter that would give a reasonable event probability.

Consider a $MM(A,\theta,S)$ and an event $E\subset\mathbb{Y}$ of interest. Definition 5 gives the probability of an event $E$ as a function of a covariation scheme.

Definition 5.

Let $\sigma$ be a $\tilde{\theta}_{i}$ -covariation scheme. For $\operatorname{P}\in MM(A,\theta,S)$ , the probability $\sigma(\operatorname{P})(E)$ read as a function of $\tilde{\theta}_{i}$ is called sensitivity function associated to $\sigma$ .

The following theorem derives the general form of sensitivity functions in non-multilinear MMs as well as their form for specific covariation schemes.

Theorem 1.

Let $\operatorname{P}\in MM(A,\theta,S)$ , $E\subset\mathbb{Y}$ and suppose the parameter $\theta_{i}$ is varied, where $i\in S_{j}$ . Then

for a generic $\theta_{i}$ -covariation scheme $\sigma$

[TABLE]

2.

for proportional covariation $\sigma_{\operatorname{pro}}$

[TABLE]

3.

for uniform covariation $\sigma_{\operatorname{uni}}$

[TABLE]

4.

for linear covariation $\sigma_{\operatorname{lin}}$

[TABLE]

Proof.

For equation (1) notice that

[TABLE]

The form of the sensitivity function under different covariation schemes follows from equation (1) by plugging-in their definition given in Definition 3. ∎

From Theorem 1 is then easy to deduce the polynomial properties of the sensitivity function in general MMs.

Corollary 1.

For proportional, uniform and linear $\tilde{\theta}_{i}$ -covariation schemes, the sensitivity function $\sigma(\operatorname{P})(E)$ is a polynomial in $\tilde{\theta}_{i}$ of degree $\max_{y\in E}|A_{y,S_{j}}|$ .

This follows from the form of the sensitivity functions given in equation (2)-(4).

Notice that differently to multilinear MMs, where the sensitivity function is linear for any linear covariation scheme, the sensitivity function is more generally polynomial in non-multilinear MMs. However, there are cases where sensitivity functions are simply linear, as formalized by the following corollary.

Corollary 2.

In the notation of Theorem 1, if $0\leq|A_{y,S_{j}}|\leq 1$ for all $y\in E$ , then $\sigma(\operatorname{P})(E)$ is a linear function of $\tilde{\theta}_{i}$ for any linear $\tilde{\theta}_{i}$ -covariation scheme.

This follows from Corollary 1 since if $0\leq|A_{y,S_{j}}|\leq 1$ then the sensitivity function is a polynomial of degree 1.

The previous results formalize the form of sensitivity functions for marginal probabilities. Conditional sensitivity functions represent the functional relationship between conditional probabilities and a parameter varied.

Corollary 3.

The conditional sensitivity function $\sigma(P)(E~{}|~{}C)$ is the ratio of sensitivity functions $\sigma(P)(E\cap C)/\sigma(P)(C)$ , where each of these have the properties formalized in Theorem 1, Corollary 1 and Corollary 2.

This result easily follows from the definition of conditional probability.

Example 3.

To illustrate the different form of sensitivity functions in multilinear and non-multilinear models, consider the staged trees from the educational example of Section 2.2. The two staged tree structures are embellished by the probability specifications given in Table 1. For ease of comparison the probability distributions from the stages $\{v_{0}\}$ and $\{F_{1,A},F_{1,B}\}$ are equally defined in the two trees. The distribution of the stage $\{A,B,P_{1,A},D_{1,A},P_{1,B},D_{1,B},P_{R,A},P_{R,B}\}$ in the non-multilinear staged tree of Figure 3 is such that the parameters $\theta_{3}$ , $\theta_{4}$ and $\theta_{5}$ are chosen from the probabilities underlying the tree in Figure 2 as $(\theta_{3}+\theta_{8})/2$ , $(\theta_{4}+\theta_{9})/2$ and $(\theta_{5}+\theta_{10})/2$ , respectively. Suppose the parameter $\theta_{4}$ is varied in both cases: notice that for the first tree this is the probability of passing the exam in the second semester, whilst for the three in Figure 3 this is the probability of passing an exam at any point.

The probabilities of four events are considered here. First, the sensitivity function for a $\theta_{4}$ variation of not being admitted to the second semester is for both trees $\theta_{1}\theta_{3}\tilde{\theta}_{6}+\theta_{2}\theta_{3}\tilde{\theta}_{6}$ , where $\tilde{\theta}_{6}$ depends on the covariation scheme used. Thus in both models this function is simply linear whenever the covariation scheme is linear, even though the second tree is a non-multilinear model. These sensitivity functions are reported in Figure 4(a). Under uniform covariation, the sensitivity function is the same for the two trees, whilst under proportional covariation they differ.

The second event considered is failing the exam in the second semester. For the multilinear tree the associated sensitivity function can be written as $\theta_{8}(\theta_{1}+\theta_{2})(\tilde{\theta}_{3}\theta_{7}+\tilde{\theta}_{4}+\tilde{\theta}_{5})$ , whilst for the non-multilinear tree this is $(\theta_{1}+\theta_{2})(\tilde{\theta}_{3}^{2}\theta_{7}+\tilde{\theta}_{4}\tilde{\theta}_{3}+\tilde{\theta}_{5}\tilde{\theta}_{3})$ . Thus in this case the sensitivity function is a non-linear function of the varied parameter, as reported in Figure 4(b), but for both trees the sensitivity function is decreasing.

For the event of passing both exams with distinction the sensitivity functions for the two trees are highly different, as reported in Figure 4(c). For the multilinear tree, the sensitivity function is slightly increasing and almost identical for uniform and proportional covariation. Conversely, for the non-multilinear tree this is decreasing non-linearly. Formally, for the multilinear tree the sensitivity function is $(\theta_{1}+\theta_{2})\tilde{\theta}_{5}\theta_{10}$ , whilst for the non-multilinear tree this is $(\theta_{1}+\theta_{2})\tilde{\theta}_{5}^{2}$ .

Lastly, the conditional probability of obtaining a distinction in the first semester given that a distinction was given in the second one is computed. In this case, the sensitivity function is a ratio of polynomials and as such is not linear even for multilinear models. This is shown in Figure 4(d). As for the first event considered, the sensitivity functions under uniform covariation are equal for the two trees.

5 Divergence quantification

Once viable parameter variations have been identified, via the study of sensitivity functions as illustrated in Section 4, the overall effect that these would have on the model’s distribution is studied. This is carried out by computing various distances and divergences between the original and the varied distributions.

5.1 The CD distance in non-multilinear models

The measure of dissimilarity which is most commonly used in sensitivity analysis in graphical models is the so-called CD distance (Chan and Darwiche, 2005a).

Definition 6.

The CD distance between two probability distributions $\tilde{\operatorname{P}}$ and $\operatorname{P}$ over a discrete sample space $\mathbb{Y}$ is

[TABLE]

For single and specific multi-way parameter variations, proportional covariation minimizes the CD distance in BN models, as well as in any multilinear MM (Chan and Darwiche, 2002; Leonelli et al., 2017a). However, in non-multilinear models even for single parameter variations proportional covariation does not minimize the CD distance in general as shown by the following example.

Example 4.

Consider two random variables $Y_{1}$ and $Y_{2}$ and suppose $\mathbb{Y}_{1}=\mathbb{Y}_{2}=[3]$ . Suppose also

[TABLE]

and $\theta_{i+3}=\operatorname{P}(Y_{2}=i\,|\,Y_{1}=3).$ The atomic probabilities of this model are clearly non-multilinear. Suppose $\theta_{i}$ is varied and $\theta_{2}$ and $\theta_{3}$ are covaried. Suppose $\theta_{1}=0.33$ , $\theta_{2}=0.33$ , $\theta_{3}=0.34$ and let $\theta_{1}$ be varied to $0.4$ (the value of $\theta_{4}$ , $\theta_{5}$ and $\theta_{6}$ does not affect the CD distance). In this situation the CD distance under a proportional scheme is $2.52$ , whilst under a uniform scheme the distance it equals $2.50$ . For this parameter variation, the uniform scheme would then be preferred to a proportional one if a user wishes to minimize the CD distance. Conversely, if $\theta_{1}$ is set to $0.2$ the distance is smaller under the proportional scheme $(2.89)$ than under the uniform one $(2.92)$ .

Next the form of the CD distance in MMs is derived in general and for specific covariation schemes. For all $\emptyset\neq H\subset[k]$ define $\mathbb{Y}_{H}^{=}=\{y\in\mathbb{Y}:A_{y,i}=0\mbox{ for all }i\in H\}$ and let $\mathbb{Y}_{H}^{\neq}=\mathbb{Y}\setminus\mathbb{Y}_{H}^{=}$ . The set $\mathbb{Y}_{H}^{\neq}$ includes the events for which at least one parameter with index in $H$ has a non-zero exponent.

Theorem 2.

Let $\operatorname{P}\in MM(A,\theta,S)$ and suppose the parameter $\theta_{i}$ is varied, where $i\in S_{j}$ . Then

for a generic $\theta_{i}$ -covariation scheme $\sigma$

[TABLE]

2.

for proportional covariation $\sigma_{\operatorname{pro}}$

[TABLE]

3.

for uniform covariation $\sigma_{\operatorname{uni}}$

[TABLE]

4.

for linear covariation $\sigma_{\operatorname{lin}}$

[TABLE]

Proof.

For equation (1) notice that

[TABLE]

where the last equality holds since, for all $y\in\mathbb{Y}_{S_{j}}^{=}$ , $(\tilde{\theta}_{S_{j}}/\theta_{S_{j}})^{A_{y,S_{j}}}=1$ and there are always both larger and smaller ratios between varied and original parameters.

The form of the CD distance under different covariation schemes follows from equation (5) by plugging-in their definition given in Definition 3. ∎

One of the reasons why the CD distance is commonly used for sensitivity analysis in BNs is that, for a single parameter variation, the distance between the BN distributions equals the distance between the single conditional probability distributions associated to the varied parameters (Chan and Darwiche, 2002). Theorem 2 demonstrates that this is true in general for non-multilinear models since the distance only depends on the parameter $\theta_{S_{j}}$ .

Example 5.

As in Example 4, suppose the parameter $\theta_{4}$ is varied in the two staged trees from the educational example of Section 2.2. From the results of Leonelli et al. (2017a), it can be deduced that for the multilinear tree, the CD distance between the original and varied distributions is simply

[TABLE]

Conversely, using Theorem 2, for the non-multilinear staged tree this equals

[TABLE]

The specific form of the CD distance for uniform covariation can be deduced from equation (7) by simply substituting $\tilde{\theta}_{3}$ and $\tilde{\theta}_{5}$ with $(1-\tilde{\theta}_{4})/2$ . For proportional covariation the CD distance greatly simplifies and can be written as

[TABLE]

which, as formalized by Theorem 2, only depends on the original and varied values of $\theta_{4}$ .

The CD distances for proportional and uniform covariation and any possible varied value of $\theta_{4}$ are reported in Figure 5. Although for the two trees the shape of the distances are similar, for the non-multilinear tree the CD distance is larger. Notice that although for this application the CD distance for proportional covariation is always smaller than for uniform covariation, Example 4 above gives an illustration where this is not the case.

Theorem 2 and Example 5 show that for single parameter variations the CD distance in non-multilinear models does not simply correspond to the distance between distributions defined over one element of the partition $S$ (as in equation (6) for the multilinear staged tree). However, there are parameter variations in non-multilinear model where this is the case as formalized by Corollary 4

Corollary 4.

In the notation of Theorem 2, suppose $0\leq|A_{y,S_{j}}|\leq 1$ for all $y\in\mathbb{Y}_{S_{j}}^{\neq}$ . Then

for a generic $\theta_{i}$ -covariation scheme $\sigma$

[TABLE]

2.

for proportional covariation $\sigma_{\operatorname{pro}}$

[TABLE]

3.

for uniform covariation $\sigma_{\operatorname{uni}}$

[TABLE]

4.

for linear covariation $\sigma_{\operatorname{lin}}$ , where $\delta_{k}=-\gamma_{k}$ for all $k\in S_{j}^{-i}$ ,

[TABLE]

Proof.

Equation (8) follows from equation (5) by imposing the condition $0\leq|A_{y,S_{j}}|\leq 1$ . Equation (8) then coincides to the CD distance between one conditional probability distribution in BNs and its varied version and the specific form of the distance under different covariation schemes can be derived as in Renooij (2014). ∎

Corollary 4 generalizes the results of Renooij (2014), which derive the specific form of the sensitivity function for various covariation schemes in BNs, to the case of non-multilinear models for specific choices of varied parameter. Importantly, the form of the CD distance derived in Corollary 4 has the very important consequence that for some varied parameters proportional variation can be shown to be optimal.

Theorem 3.

Under the conditions of Corollary 4, proportional covariation minimizes the CD distance between the original and varied distribution amongst all possible covariation schemes.

Proof.

The theorem follows from equation (8) which is the CD distance between one conditional probability distribution in BNs and its varied version. As proven in Chan and Darwiche (2002) this distance is minimized by proportional covariation. ∎

Theorem 3 therefore extends the results of Chan and Darwiche (2002) and Leonelli et al. (2017a) which prove the optimality of proportional covariation for BNs and multilinear MMs to specific sensitivity analyses in non-multilinear models.

Example 6.

For the non-multilinear staged tree in Figure 3, consider the stage $\{F_{1,A},F_{1,B}\}$ . Suppose there is an additional edge coming out of this stage ending in a leaf (for example by splitting the fail result, into badly failed and moderately fail). Then one could show that the columns associated to the parameters of the stage probability distribution in the $A$ matrix have only zero or one entries. This can also be seen graphically since $F_{1,A}$ and $F_{1,B}$ are not along a same root-to-leaf path. Therefore, by Theorem 3, if one probability from this stage distribution is varied then by proportionally covarying the remaining parameters the CD distance between the original staged tree distribution and the new one is minimized.

5.2 $\phi$ -divergences in non-multilinear models

Another class of divergences which is often used in practice is the so-called $\phi$ -divergence (Ali and Silvey, 1966).

Definition 7.

The $\phi$ -divergence from $\tilde{\operatorname{P}}$ to $\operatorname{P}$ over a discrete sample space $\mathbb{Y}$ is

[TABLE]

where $\Phi$ is the class of convex functions $\phi(x)$ , $x\geq 0$ , such that $\phi(1)=0$ , $0\phi(0/0)=0$ and $0\phi(x/0)=\lim_{x\rightarrow+\infty}\phi(x)/x$ .

By definition, and conversely to CD distances, $\phi$ -divergences are not symmetric, i.e. $\mathcal{D}_{\phi}(\tilde{\operatorname{P}},\operatorname{P})\neq\mathcal{D}_{\phi}(\operatorname{P},\tilde{\operatorname{P}})$ . Notice that this class includes a large number of commonly used divergences, most notably Kullback-Leibler divergence (Kullback and Leibler, 1951) for $\phi(x)=x\log(x)$ and the inverse Kullback-Leibler divergence for $\phi(x)=-\log(x)$ .

Proposition 1.

Let $\operatorname{P}\in MM(A,\theta,S)$ and suppose the parameter $\theta_{i}$ is varied, where $i\in S_{j}$ . Then for a generic $\theta_{i}$ -covariation scheme $\sigma$

[TABLE]

Proof.

Notice that

[TABLE]

where the last equality follows by noting that for all $y\in\mathbb{Y}_{S_{j}}^{=}$ the term in the summation is $0\phi(0/0)$ which by definition is equal to zero. ∎

Notice that as for BNs and multilinear MMs, $\phi$ -divergences do not depend on the parameter vector $\theta_{S_{j}}$ of the varied parameter only, but on the full $\theta$ . Therefore, their computation in practice is more expensive than for CD distances. Furthermore, due to this extra complexity, $\phi$ -divergences do not simplify greatly for specific covariation schemes. To see this, the $\phi$ -divergence under proportional covariation can be written as

[TABLE]

which still depends on the full parameter vector $\theta$ . The specific form of the $\phi$ -divergence under other covariation schemes can be easily deduced by plugging-in their definition into equation (9).

Example 7.

The Kullback-Leibler divergences for proportional and uniform covariation and any possible varied value of $\theta_{4}$ in the trees of Section 2.2 are reported in Figure 5. The form and the value of the divergences for the two trees are similar. Notice that for this example the Kullback-Leibler divergence is always smaller for proportional covariation than uniform covariation, although there is no theoretical guarantee that this is always the case.

6 Discussion

The representation of probabilistic graphical models in terms of the defining atomic monomial probabilities has proven useful in sensitivity analysis. Here a general approach for this type of analyses in models whose atomic probabilities are non-multilinear, including DBNs, hidden Markov models and staged trees, is introduced. The form of the sensitivity functions and various distances/divergences is derived here for a variety of covariation schemes, and their properties studied. In general these are different to their counterparts in multilinear MMs and exhibit a more complex structure. One optimality result for proportional covariation is also presented, giving an even stronger justification for the use of this scheme in practice.

The examples presented suggest that proportional covariation minimizes both CD distances and $\phi$ -divergences under much milder conditions than the ones given in Theorem 3. However, it is currently unknown under which conditions proportional covariation is optimal in general. General conditions of optimality in multilinear models have been derived only recently in Leonelli and Riccomagno (2018). The identification of these in the more general case of non-multilinear case is the subject of ongoing research.

Software for carrying out sensitivity analysis in practice is still very limited (see samIam, , for a notable exception). A package for sensititivity analysis in BNs, and more generally for MMs, in the open-source R software (R Core Team, 2018) is currently under development. The development of such a package is critical and could be of great benefit for the whole AI community.

Acknowledgements

The author kindly thanks Christiane Görgen and Jim Q. Smith for comments on previous versions of the manuscript.

References

Ali and Silvey (1966)

S. M. Ali and S. D. Silvey.

A general class of coefficients of divergence of one distribution from another.

Journal of the Royal Statistical Society Series B, 28:131–142, 1966.

Amsalu et al. (2017)

S. B. Amsalu, A. Homaifar, and A. C. Esterline.

A simplified matrix formulation for sensitivity analysis of hidden Markov models.

Algorithms, 10(3):97, 2017.

Boutilier et al. (1996)

C. Boutilier, N. Friedman, M Goldszmidt, and D. Koller.

Context-specific independence in Bayesian networks.

In Proceedings of the 12th Conference on Uncertainty in Artificial Intelligence, pages 115–123, 1996.

Brandherm and Jameson (2004)

B. Brandherm and A. Jameson.

An extension of the differential approach for Bayesian network inference to dynamic Bayesian networks.

International Journal of Intelligent Systems, 19(8):727–748, 2004.

Castillo and Kjærulff (2003)

E. Castillo and U. Kjærulff.

Sensitivity analysis in Gaussian Bayesian networks using a symbolic-numerical technique.

Reliability Engineering & System Safety, 79(2):139–148, 2003.

Castillo et al. (1995)

E. Castillo, J. M. Gutiérrez, and A. S. Hadi.

Parametric structure of probabilities in Bayesian networks.

In European Conference on Symbolic and Quantitative Approaches to Reasoning and Uncertainty, pages 89–98. Springer, 1995.

Castillo et al. (1997)

E. Castillo, J. M. Gutiérrez, and A. S. Hadi.

Sensitivity analysis in discrete Bayesian networks.

IEEE Transactions on Systems, Man, and Cybernetics, Part A: Systems and Humans, 27(4):412–423, 1997.

Chan and Darwiche (2002)

H. Chan and A. Darwiche.

When do numbers really matter?

Journal of Artificial Intelligence Research, 17:265–287, 2002.

Chan and Darwiche (2005a)

H. Chan and A. Darwiche.

A distance measure for bounding probabilistic belief change.

International Journal of Approximate Reasoning, 38:149–174, 2005a.

Chan and Darwiche (2005b)

H. Chan and A. Darwiche.

Sensitivity analysis in Markov networks.

In Proceedings of the Nineteenth International Joint Conference on Artificial Intelligence (IJCAI), pages 1300–1305, 2005b.

Charitos and van der Gaag (2006a)

T. Charitos and L. C. van der Gaag.

Sensitivity analysis of Markovian models.

In Proceedings of the 19th International Florida Artificial Intelligence Research Society Conference, pages 806–811, 2006a.

Charitos and van der Gaag (2006b)

T. Charitos and L. C. van der Gaag.

Sensitivity analysis for threshold decision making with DBNs.

In Proceedings of the 22nd Conference on Uncertainty in Artificial Intelligence, pages 72–79, 2006b.

Coupé and Van Der Gaag (2002)

V. M. H. Coupé and L. C. Van Der Gaag.

Properties of sensitivity analysis of Bayesian belief networks.

Annals of Mathematics and Artificial Intelligence, 36(4):323–356, 2002.

Darwiche (2003)

A. Darwiche.

A differential approach to inference in Bayesian networks.

Journal of the ACM, 50(3):280–305, 2003.

de Cooman et al. (2008)

G. de Cooman, F. Hermans, and E. Quaeghebeur.

Sensitivity analysis for finite Markov chains in discrete time.

In Proceedings of the 24th Conference on Uncertainty in Artificial Intelligence, pages 129–136, 2008.

Freeman and Smith (2011)

G. Freeman and J. Q. Smith.

Bayesian MAP model selection of chain event graphs.

Journal of Multivariate Analysis, 102:1152–1165, 2011.

Gómez-Villegas et al. (2013)

M. A. Gómez-Villegas, P. Main, and R. Susi.

The effect of block parameter perturbations in Gaussian Bayesian networks: sensitivity and robustness.

Information Sciences, 222:429–458, 2013.

Görgen and Leonelli (2018)

C. Görgen and M. Leonelli.

Model-preserving sensitivity analysis for families of Gaussian distributions.

arXiv:1809.10794, 2018.

Görgen et al. (2015)

C. Görgen, M. Leonelli, and J. Q. Smith.

A differential approach for staged trees.

In Symbolic and Quantitative Approaches to Reasoning with Uncertainty, pages 346–355. Springer, 2015.

Koller et al. (2009)

D. Koller, N. Friedman, and F. Bach.

Probabilistic graphical models: principles and techniques.

MIT press, 2009.

Kullback and Leibler (1951)

S. Kullback and R. A. Leibler.

On information and sufficiency.

The Annals of Mathematical Statistics, 22:79–86, 1951.

Laskey (1995)

K. B. Laskey.

Sensitivity analysis for probability assessments in Bayesian networks.

IEEE Transactions on Systems, Man, and Cybernetics, 25(6):901–909, 1995.

Leonelli and Riccomagno (2018)

M. Leonelli and E. Riccomagno.

A geometric characterization of sensitivity analysis in monomial models.

arXiv:, 2018.

Leonelli et al. (2017a)

M. Leonelli, C. Görgen, and J. Q. Smith.

Sensitivity analysis in multilinear probabilistic models.

Information Sciences, 411:84–97, 2017a.

Leonelli et al. (2017b)

M. Leonelli, E. Riccomagno, and J. Q. Smith.

A symbolic algebra for the computation of expected utilities in multiplicative influence diagrams.

Annals of Mathematics and Artificial Intelligence, 81(3-4):273–313, 2017b.

Nur et al. (2009)

D. Nur, D. Allingham, J. Rousseau, K. L. Mengersen, and R. McVinish.

Bayesian hidden Markov model for DNA sequence segmentation: A prior sensitivity analysis.

Computational Statistics & Data Analysis, 53(5):1873–1882, 2009.

Oberguggenberger et al. (2009)

M. Oberguggenberger, J. King, and B. Schmelzer.

Classical and imprecise probability methods for sensitivity analysis in engineering: A case study.

International Journal of Approximate Reasoning, 50(4):680–693, 2009.

Pollino et al. (2007)

C. A. Pollino, O. Woodberry, A. Nicholson, K. Korb, and B. T. Hart.

Parameterisation and evaluation of a Bayesian network for use in an ecological risk assessment.

Environmental Modelling & Software, 22(8):1140–1152, 2007.

R Core Team (2018)

R Core Team.

R: A Language and Environment for Statistical Computing.

R Foundation for Statistical Computing, Vienna, Austria, 2018.

URL https://www.R-project.org/.

Renooij (2012)

S. Renooij.

Efficient sensitivity analysis in hidden Markov models.

International Journal of Approximate Reasoning, 53(9):1397–1414, 2012.

Renooij (2014)

S. Renooij.

Co-variation for sensitivity analysis in Bayesian networks: properties, consequences and alternatives.

International Journal of Approximate Reasoning, 55:1022–1042, 2014.

(32)

samIam.

Sensitivity analysis, modeling, inference and more*.

URL http://reasoning.cs.ucla.edu/samiam/.

Smith (2010)

J. Q. Smith.

Bayesian decision analysis: principles and practice.

Cambridge University Press, 2010.

Smith and Anderson (2008)

J.Q. Smith and P.E. Anderson.

Conditional independence and chain event graphs.

Artificial Intelligence, 172:42–68, 2008.

Uusitalo (2007)

L. Uusitalo.

Advantages and challenges of Bayesian networks in environmental modelling.

Ecological Modelling, 203(3-4):312–318, 2007.

Bibliography35

The reference list from the paper itself. Each links out to its DOI / PubMed record.

1Ali and Silvey (1966) S. M. Ali and S. D. Silvey. A general class of coefficients of divergence of one distribution from another. Journal of the Royal Statistical Society Series B , 28:131–142, 1966.
2Amsalu et al. (2017) S. B. Amsalu, A. Homaifar, and A. C. Esterline. A simplified matrix formulation for sensitivity analysis of hidden Markov models. Algorithms , 10(3):97, 2017.
3Boutilier et al. (1996) C. Boutilier, N. Friedman, M Goldszmidt, and D. Koller. Context-specific independence in Bayesian networks. In Proceedings of the 12th Conference on Uncertainty in Artificial Intelligence , pages 115–123, 1996.
4Brandherm and Jameson (2004) B. Brandherm and A. Jameson. An extension of the differential approach for Bayesian network inference to dynamic Bayesian networks. International Journal of Intelligent Systems , 19(8):727–748, 2004.
5Castillo and Kjærulff (2003) E. Castillo and U. Kjærulff. Sensitivity analysis in Gaussian Bayesian networks using a symbolic-numerical technique. Reliability Engineering & System Safety , 79(2):139–148, 2003.
6Castillo et al. (1995) E. Castillo, J. M. Gutiérrez, and A. S. Hadi. Parametric structure of probabilities in Bayesian networks. In European Conference on Symbolic and Quantitative Approaches to Reasoning and Uncertainty , pages 89–98. Springer, 1995.
7Castillo et al. (1997) E. Castillo, J. M. Gutiérrez, and A. S. Hadi. Sensitivity analysis in discrete Bayesian networks. IEEE Transactions on Systems, Man, and Cybernetics, Part A: Systems and Humans , 27(4):412–423, 1997.
8Chan and Darwiche (2002) H. Chan and A. Darwiche. When do numbers really matter? Journal of Artificial Intelligence Research , 17:265–287, 2002.

TL;DR

Contribution

Findings

Abstract

Peer Reviews

Videos

Sensitivity analysis beyond linearity

Abstract

keywords:

1 Introduction

2 Monomial models

Definition 1**.**

Example 1**.**

2.1 Staged trees

Definition 2**.**

Example 2**.**

2.2 An example

3 Covariation

Definition 3**.**

Definition 4**.**

4 Sensitivity functions

Definition 5**.**

Theorem 1**.**

Proof.

Corollary 1**.**

Corollary 2**.**

Corollary 3**.**

Example 3**.**

5 Divergence quantification

5.1 The CD distance in non-multilinear models

Definition 6**.**

Example 4**.**

Theorem 2**.**

Proof.

Example 5**.**

Corollary 4**.**

Proof.

Theorem 3**.**

Proof.

Example 6**.**

5.2 ϕ\phiϕ-divergences in non-multilinear models

Definition 7**.**

Proposition 1**.**

Proof.

Example 7**.**

6 Discussion

Acknowledgements

References

Definition 1.

Example 1.

Definition 2.

Example 2.

Definition 3.

Definition 4.

Definition 5.

Theorem 1.

Corollary 1.

Corollary 2.

Corollary 3.

Example 3.

Definition 6.

Example 4.

Theorem 2.

Example 5.

Corollary 4.

Theorem 3.

Example 6.

5.2 $\phi$ -divergences in non-multilinear models

Definition 7.

Proposition 1.

Example 7.