Bounding Causes of Effects with Mediators

Philip Dawid; Macartan Humphreys; Monica Musio

arXiv:1907.00399·math.ST·July 2, 2019

Bounding Causes of Effects with Mediators

Philip Dawid, Macartan Humphreys, Monica Musio

PDF

TL;DR

This paper develops a method to bound the probability of causation in binary exposure-outcome scenarios by leveraging the structure of mediators, providing formulas and bounds for various causal processes.

Contribution

It introduces a general formula for bounding the probability of causation using mediator structures, improving causal inference in complex causal processes.

Findings

01

Bounds on probability of causation can be derived from mediator data.

02

Maximum and minimum bounds are achieved with processes of at most two steps.

03

Probability of causation can be zero with negative data, but not one even with extensive positive mediator data.

Abstract

Suppose X and Y are binary exposure and outcome variables, and we have full knowledge of the distribution of Y, given application of X. From this we know the average causal effect of X on Y. We are now interested in assessing, for a case that was exposed and exhibited a positive outcome, whether it was the exposure that caused the outcome. The relevant "probability of causation", PC, typically is not identified by the distribution of Y given X, but bounds can be placed on it, and these bounds can be improved if we have further information about the causal process. Here we consider cases where we know the probabilistic structure for a sequence of complete mediators between X and Y. We derive a general formula for calculating bounds on PC for any pattern of data on the mediators (including the case with no data). We show that the largest and smallest upper and lower bounds that can result…

Tables2

Table 1. Table 1: Pr ⁡ ( Y 0 = y 0 , Y 1 = y 1 ) Pr subscript 𝑌 0 subscript 𝑦 0 subscript 𝑌 1 subscript 𝑦 1 \Pr(Y_{0}=y_{0},Y_{1}=y_{1})

	$Y_{1} = 0$	$Y_{1} = 1$
$Y_{0} = 0$	$\frac{1}{2} (1 - ρ - ξ)$	$\frac{1}{2} (ξ + τ)$	$\frac{1}{2} (1 + τ - ρ)$
$Y_{0} = 1$	$\frac{1}{2} (ξ - τ)$	$\frac{1}{2} (1 + ρ - ξ)$	$\frac{1}{2} (1 - τ + ρ)$
	$\frac{1}{2} (1 - τ - ρ)$	$\frac{1}{2} (1 + τ + ρ)$	1

Table 2. Table 2 : Largest and smallest achievable upper and lower bounds from decompositions of any length, given no mediators observed, positive evidence observed for all mediators, or mixed evidence is observed. (*) Indicates that PC can be identified.

		No evidence	Positive evidence	Mixed evidence
Largest	Upper	$\bar{u UB} = \frac{1 + τ - \| ρ \|}{1 + τ + ρ}$	$\bar{o UB} = \min {1, 1 - ρ}$	$\bar{m UB} = 1$
	Lower	$\bar{u LB} = \frac{2 τ}{1 + τ + ρ}$	$\bar{o LB} = \frac{1 + τ - ρ}{2}$	$\bar{m LB} = 0$
Smallest	Upper	$\underline{u UB} = \frac{2 τ}{1 + τ + ρ}$ (*)	$\underline{o UB} = \frac{2 τ}{1 + τ + ρ}$ (*)	$\underline{m UB} = 0$ (*)
	Lower	$\underline{u LB} = \frac{2 τ}{1 + τ + ρ}$	$\underline{o LB} = \frac{2 τ}{1 + τ + ρ}$	$\underline{m LB} = 0$

Equations181

τ

τ

ρ

P=P(\tau,\rho):=\left(\begin{array}[c]{cc}\frac{1}{2}(1+\tau-\rho)&\frac{1}{2}(1-\tau+\rho)\\ \frac{1}{2}(1-\tau-\rho)&\frac{1}{2}(1+\tau+\rho)\end{array}\right).

P=P(\tau,\rho):=\left(\begin{array}[c]{cc}\frac{1}{2}(1+\tau-\rho)&\frac{1}{2}(1-\tau+\rho)\\ \frac{1}{2}(1-\tau-\rho)&\frac{1}{2}(1+\tau+\rho)\end{array}\right).

∣ ρ ∣ + ∣ τ ∣ \leq 1.

∣ ρ ∣ + ∣ τ ∣ \leq 1.

σ := \frac{ρ}{1 - τ},

σ := \frac{ρ}{1 - τ},

\mbox PC_{x y} = Pr (C ∣ X = x, Y = y) = Pr (C_{x y} ∣ X = x, Y_{x} = y) .

\mbox PC_{x y} = Pr (C ∣ X = x, Y = y) = Pr (C_{x y} ∣ X = x, Y_{x} = y) .

ξ = Pr (Y_{0} = 0, Y_{1} = 1) + Pr (Y_{0} = 1, Y_{1} = 0) = Pr (C),

ξ = Pr (Y_{0} = 0, Y_{1} = 1) + Pr (Y_{0} = 1, Y_{1} = 0) = Pr (C),

∣ τ ∣ \leq ξ \leq 1 - ∣ ρ ∣.

∣ τ ∣ \leq ξ \leq 1 - ∣ ρ ∣.

Pr (C_{00}) = Pr (C_{11})

Pr (C_{00}) = Pr (C_{11})

Pr (C_{01}) = Pr (C_{10})

max {0, τ} \leq Pr (C_{00}) = Pr (C_{11})

max {0, τ} \leq Pr (C_{00}) = Pr (C_{11})

max {0, - τ} \leq Pr (C_{01}) = Pr (C_{10})

\mbox PC_{x y}

\mbox PC_{x y}

\mbox s LB_{00} := \frac{max { 0 , τ }}{Pr ( Y = 0 ∣ X \leftarrow 0 )} \leq \mbox PC_{00}

\mbox s LB_{00} := \frac{max { 0 , τ }}{Pr ( Y = 0 ∣ X \leftarrow 0 )} \leq \mbox PC_{00}

\mbox s LB_{10} := \frac{max { 0 , - τ }}{Pr ( Y = 0 ∣ X \leftarrow 1 )} \leq \mbox PC_{10}

\mbox s LB_{01} := \frac{max { 0 , - τ }}{Pr ( Y = 1 ∣ X \leftarrow 0 )} \leq \mbox PC_{01}

\mbox s LB_{11} := \frac{max { 0 , τ }}{Pr ( Y = 1 ∣ X \leftarrow 1 )} \leq \mbox PC_{11}

γ

γ

δ

\mbox s UB_{00}

\mbox s UB_{00}

\mbox s UB_{01}

\mbox s UB_{10}

\mbox s UB_{11}

\mbox s UB_{00}

\mbox s UB_{00}

\mbox s UB_{01}

\mbox s UB_{10}

\mbox s UB_{11}

\mbox PC = \frac{ξ + τ}{2 Pr ( Y = 1 ∣ X \leftarrow 1 )},

\mbox PC = \frac{ξ + τ}{2 Pr ( Y = 1 ∣ X \leftarrow 1 )},

\mbox{{\it{s}}\rm{LB}}=\frac{2\tau}{1+\tau+\rho}\leq\mbox{\rm PC}\leq\mbox{{\it{s}}\rm{UB}}=\left\{\begin{array}[]{ll}\delta&(\rho\geq 0)\\ 1&(\rho<0)\end{array}\right.

\mbox{{\it{s}}\rm{LB}}=\frac{2\tau}{1+\tau+\rho}\leq\mbox{\rm PC}\leq\mbox{{\it{s}}\rm{UB}}=\left\{\begin{array}[]{ll}\delta&(\rho\geq 0)\\ 1&(\rho<0)\end{array}\right.

Pr (M_{i + 1} = m_{i + 1} ∣ M_{j} \leftarrow m_{j}, j = 0, \dots, i) = Pr (M_{i + 1} = m_{i + 1} ∣ M_{i} \leftarrow m_{i}), (i = 0, \dots, n - 1) .

Pr (M_{i + 1} = m_{i + 1} ∣ M_{j} \leftarrow m_{j}, j = 0, \dots, i) = Pr (M_{i + 1} = m_{i + 1} ∣ M_{i} \leftarrow m_{i}), (i = 0, \dots, n - 1) .

Pr (M_{i + 1} = m_{i + 1} ∣ M_{j} = m_{j}, j = 0, \dots, i) = Pr (M_{i + 1} = m_{i + 1} ∣ M_{i} \leftarrow m_{i}), (i = 0, \dots, n - 1) .

Pr (M_{i + 1} = m_{i + 1} ∣ M_{j} = m_{j}, j = 0, \dots, i) = Pr (M_{i + 1} = m_{i + 1} ∣ M_{i} \leftarrow m_{i}), (i = 0, \dots, n - 1) .

P = P_{1} ∣ P_{2} \dots ∣ P_{n}

P = P_{1} ∣ P_{2} \dots ∣ P_{n}

τ = τ^{(n)}

τ = τ^{(n)}

ρ = ρ^{(n)}

ρ = ρ_{1} τ_{2} + ρ_{2} .

ρ = ρ_{1} τ_{2} + ρ_{2} .

{\mbox{$\mathbf{M}$}}_{i}:=(M_{i0},M_{i1})

{\mbox{$\mathbf{M}$}}_{i}:=(M_{i0},M_{i1})

∣ τ ∣ \leq ξ \leq i = 1 \prod n (1 - ∣ ρ_{i} ∣) .

∣ τ ∣ \leq ξ \leq i = 1 \prod n (1 - ∣ ρ_{i} ∣) .

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Full text

Bounding Causes of Effects with Mediators

Philip Dawid University of Cambridge [email protected]

Macartan Humphreys Columbia University & WZB Berlin [email protected]

Monica Musio Università degli Studi di Cagliari [email protected]

Abstract

Suppose $X$ and $Y$ are binary exposure and outcome variables, and we have full knowledge of the distribution of $Y$ , given application of $X$ . From this we know the average causal effect of $X$ on $Y$ . We are now interested in assessing, for a case that was exposed and exhibited a positive outcome, whether it was the exposure that caused the outcome. The relevant “probability of causation”, PC, typically is not identified by the distribution of $Y$ given $X$ , but bounds can be placed on it, and these bounds can be improved if we have further information about the causal process. Here we consider cases where we know the probabilistic structure for a sequence of complete mediators between $X$ and $Y$ . We derive a general formula for calculating bounds on PC for any pattern of data on the mediators (including the case with no data). We show that the largest and smallest upper and lower bounds that can result from any complete mediation process can be obtained in processes with at most two steps. We also consider homogeneous processes with many mediators. PC can sometimes be identified as 0 with negative data, but it cannot be identified at 1 even with positive data on an infinite set of mediators. The results have implications for learning about causation from knowledge of general processes and of data on cases.

1 Introduction

Even the best possible evidence regarding the effects of a treatment on an outcome is generally not enough to identify the probability that the outcome was caused by the treatment.

For instance, researchers conducting randomised controlled trials may determine that providing a medicine to school children increases the overall probability of good health from one third to two thirds. This information, no matter how precise, is not enough to answer the following question: Is Ann healthy because she took the medicine? It is not even enough to answer the question probabilistically. The reason is that, consistent with these results, it may be that the medicine makes a positive change for 2 out of 3 students, but an adverse change for the remainder: in that case the medicine certainly helped Ann. But it might alternatively be that the medicine makes a positive change for 1 in 3 children but no change for the others. In that case the chances it helped Ann are just 1 in 2. Of the children taking the medicine, two thirds are healthy. Half of these are healthy because of the medicine, whereas the other half would have been healthy anyway.

Put differently, the experimental data identifies the “effects of causes,” (EoC) but we are interested in the reverse problem, of quantifying “causes of effects” (CoE). The CoE task of defining and assessing the probability of causation (Robins and Greenland, 1989) in an individual case has been considered by Tian and Pearl (2000); Dawid (2011); Yamamoto (2012); Pearl (2015); Dawid, Musio and Fienberg (2016); Dawid, Murtas and Musio (2016); Dawid, Musio and Murtas (2017); Murtas, Dawid and Musio (2017). Note that this is distinct from the “reverse causal question” of Gelman and Imbens (2013), which is an EoC task aimed at ascertaining which causes have an effect on an outcome.

To understand causes of effects better, we might seek additional evidence along causal pathways. For example, researchers evaluating development programs specify “theories of change” and seek evidence for intermediate outcomes along a pathway linking treatment to outcomes—most simply, Was the treatment received? Was the medicine ingested? Van Evera (1997) describes various tests that might be implemented using such ancillary evidence. A “smoking gun test” searches for evidence that, though unlikely to be found, would give great confidence in a claim if it were to be found; a “hoop test” test is a search for evidence that we expect to find, but which, if found to be absent, would provide compelling evidence against a proposition (as if the proposition were asked to jump through a hoop).

Sometimes many points along a causal pathway are investigated. An intervention might be to provide citizens with information on political corruption, in the hope that this will lead to ultimate changes in politicians’ behavior. Researchers might then check many points along a chain of intermediate outcomes. Was the political message delivered? Was it understood? Was it believed? Did it induce a change in behavior by citizens? Did this in turn produce a change in behavior by politicians?

Seeing positive evidence at many points along a such a causal chain would appear to give confidence that the final outcome is indeed due to the conjectured cause. This is the core premise of “process tracing,” as deployed by qualitative political scientists (Collier, 2011), as well as of mixed methods research as used in development evaluation (White, 2009). In the most optimistic accounts it is assumed that, as one gets close enough to a process, by observing more and more links in a chain, the link between any two steps becomes less questionable and eventually the causal process reveals itself (Mahoney, 2012, 581).

We here provide a comprehensive treatment of the scope for inferences of this form from knowledge of causal chains. We obtain a general formula for calculating bounds on the probability of causation, for an arbitrary pattern of data along chains of binary variables. We derive implications of this formula, and calculate the largest and smallest upper and lower bounds achievable from any causal chain consistent with the known relation between $X$ and $Y$ . We give special attention to what might appear to be the best possible conditions: those in which causal processes really do follow a simple causal chain, in which researchers have complete experimental evidence about the probabilistic relationship between any two consecutive nodes in the chain, in which the chain is arbitrarily long, in which the causal effect of each intermediate variable on its successor climbs to 1, and in which researchers observe outcomes consistent with positive effects at every point on the chain. We show that such information does indeed increase confidence that an outcome can be attributed to a cause and, for homogeneous chains at least, that the longer the chain the better. However, we find that even under these ideal conditions our ability to narrow the bounds for the probability of causation can be modest. In the example of attributing Ann’s health to good medicine, a homogeneous process with arbitrarily many positive intermediate steps observed might only tighten the bounds from $[.5,1]$ to $[.58,1]$ .

In contrast, we show that non-homogeneous processes can tighten the bounds considerably. For example, suppose Ann was prescribed the medicine and recovered. If we know that being prescribed the medicine is the only way in which Ann could have obtained and taken the medicine, and that taking the medicine helps anyone who would otherwise be sick, then with positive evidence on a single intermediate point on the causal chain—that Ann did indeed take the medicine—we can identify the probability that prescribing the medicine caused Ann’s recovery at $2/3$ . (We are still short of 1, because it is possible that Ann would have recovered even without the medicine.) A process like this, in which we observe a “necessary condition for a sufficient condition”, provides the largest possible lower bound on the probability of causation available from any observations on any chain. At this point we have done the best possible and more data along the chain will not help.

Although achieving identification of the probability of causation at 1 is generally elusive, negative data can yield identification at 0, either in two steps from a heterogeneous process, or from alternating data along an infinite homogeneous chain. In this sense, information on mediators can support “hoop” tests but not “smoking gun” tests.

1.1 Plan of paper

Existing results (Dawid, Murtas and Musio, 2016) have considered the case of a single unobserved mediator. We generalize this in two ways. First, we consider situations with chains of arbitrary length. Secondly, we calculate bounds for general data, that is, for situations in which the values of none, some or all the mediators are observed.

We proceed as follows. Section 2 introduces the set-up, and provides general formulae for bounding the probability of causation for a simple one-step process. In § 3 we extend these results to cases in which we know the structure of a complete mediation process. We consider various degrees of knowledge of the values of the mediators for the individual case at hand: all unobserved, all observed, or just some observed. Our main result is Theorem 4, which provides a general formula applicable to all cases.

Section 4 draws out the detailed implications of this result in a variety of contexts. In § 4.1 we investigate the largest achievable lower and upper bounds from any sequence, and find that these can be achieved by heterogeneous two-step processes. Section 4.2 examines the case of homogeneous processes of arbitrary length. We show that an alternating pattern for the values at all intermediate points can lead to a limiting value of 0 for the probability of causation. However, it is not generally possible for even the most positive evidence to identify the probability of causation—and a fortiori not possible to identify it at 1—even in the limit of infinitely many steps. Section § 4.3 considers implications of our results for gathering data on mediators. In § 5 we compare the bounds based on knowledge of mediator processes with those achievable from knowledge of covariates, which can be much tighter. We summarise our findings in § 6. Various technical details for the proofs in the paper are elaborated in three appendices.

2 Preliminaries

We consider a binary treatment variable $X$ and binary outcome variable $Y$ . We suppose we have access to experimental (or unconfounded observational) data supplying values for $\Pr(Y=y\mid X\leftarrow x)$ , where we use the notation $X\leftarrow x$ to denote a regime in which $X$ is set to value $x$ by external intervention.

Define

[TABLE]

Then $\tau$ is the average causal effect of $X$ on $Y$ , while $\rho$ is a measure of how common $Y=1$ is.

The transition matrix from $X$ to $Y$ (where the row and column labels of any such matrix are implicitly [math] and $1$ in that order) can be written:

[TABLE]

All entries of $P$ must be non-negative: this holds if and only if

[TABLE]

We have equality in (2) if and only if one of the entries of (1) is 1, in which case we term $P$ degenerate. For $\tau\geq 0$ , this will happen if either $\rho=1-\tau$ , in which case $\Pr(Y=1\mid X=1)=1$ and $X=1$ can be thought of as a sufficient condition for $Y=1$ ; or $\rho=\tau-1$ , in which case $\Pr(Y=1\mid X=0)=0$ , and $X=1$ can be thought of as a necessary condition for $Y=1$ . Defining, for $\tau\geq 0$ ,

[TABLE]

we might thus regard $\sigma\in[-1,1]$ as measuring the relative sufficiency of $X=1$ for $Y=1$ .111Although we do not focus on it, for $\tau<0$ the analogous quantity $\frac{-\rho}{1+\tau}$ can be interpreted as the relative sufficiency of $X=1$ for $Y=0$ .

2.1 Potential outcomes and causes of effects

While knowledge of the transition matrix $P$ , and in particular the “average causal effect” $\tau$ , is directly relevant for EoC (“effects of causes”) analysis, it is not enough to support CoE (“causes of effects”) analysis. For this we need to introduce the pair of potential outcomes, ${\mbox{$ \mathbf{Y} $}}=(Y_{0},Y_{1})$ , where we conceive of $Y_{x}$ as the value $Y$ would take, if $X\leftarrow x$ . We regard both $Y_{0}$ and $Y_{1}$ as existing simultaneously, even prior to setting the value of $X$ , and as having a bivariate probability distribution.

We can now define the following events in terms of $\mathbf{Y}$ (where $\overline{x}$ denotes $1-x$ , the value distinct from $x$ , etc.):

General causation

$C^{(X,Y)}$ := “ $Y_{1}\neq Y_{0}$ ”.

That is, changing the value of $X$ will result in a change to the value of $Y$ . We can also describe this as “ $X$ affects $Y$ .”

When the relevant variables $X$ and $Y$ are clear from the context we will simplify the notation to $C$ .

Specific causation

$C^{(X,Y)}_{xy}$ := “ $Y_{x}=y,Y_{\overline{x}}=\overline{y}$ ” (for $x,y=0$ or $1$ ).

That is, changing the value of $X$ from $x$ to $\overline{x}$ would change the value of $Y$ from $y$ to $\overline{y}$ . We can also describe this as “ $X=x$ causes $Y=y$ .” When the relevant variables $X$ and $Y$ are clear from the context we will simplify the notation to $C_{xy}$ .

We note that $C_{xy}=C_{\overline{x}\overline{y}}$ .

Probability of Causation.

In cases of interest we will have observed $X=x,Y=y$ , and want to know the probability that $X$ caused $Y$ , given this information. We denote this quantity by $\mbox{\rm PC}_{xy}^{(X,Y)}$ , or $\mbox{\rm PC}_{xy}$ when the relevant variables $X$ and $Y$ are clear from the context. Thus

[TABLE]

The joint distribution for $\mathbf{Y}$ , while constrained by knowledge of the transition matrix $P$ , is in general not fully determined by it. Rather, we can only deduce that it has the form of Table 1, where the marginal probabilities agree with (1) according to $\Pr(Y_{x}=y)=\Pr(Y=y\mid X\leftarrow x)$ .

However, the internal entries of Table 1 are not determined by $P$ , but have one degree of freedom, expressed by the “slack” quantity $\xi$ = $\xi(P)$ . We see that

[TABLE]

the probability of general causation.

The only constraints on $\xi$ are that all internal entries of Table 1 must be non-negative, which holds if and only if

[TABLE]

In particular $\xi$ , and thus the bivariate distribution of $(Y_{0},Y_{1})$ in Table 1, is uniquely determined by $P$ if and only $P$ is degenerate.

We further note

[TABLE]

whence, by (6),

[TABLE]

Throughout this article we shall assume no confounding, expressed mathematically as $X\,\mbox{$ \perp!!!\perp $}\,{\mbox{$ \mathbf{Y} $}}$ . Then

[TABLE]

which is thus subject to the interval bounds, given by (9) or (10), as appropriate, divided by the known entry $\Pr(Y=y\mid X\leftarrow x)$ of the transition matrix $P$ .

This analysis delivers the following lower and upper bounds (prefix “s” for “simple”):

[TABLE]

In the absence of additional information, the above bounds constitute the best available inference regarding the probability of causation.

Specifically, when $\tau\geq 0$ , on defining

[TABLE]

we have the following upper bounds:

For $\rho\geq 0$ :

[TABLE]

For $\rho<0$ :

[TABLE]

2.2 Special case

A particular interest is in cases where $\tau>0$ (so the overall effect of $X$ and $Y$ is positive) and we observe positive outcomes, $X=1$ , $Y=1$ . In this case we omit the subscript $11$ . We have

[TABLE]

and interval bounds given by

[TABLE]

This result agrees with (Tian and Pearl, 2000; Dawid, 2011; Dawid, Musio and Murtas, 2017).

PC is identified (i.e., the interval in (26) reduces to a single point) if and only if $|\rho|=1-\tau$ , which holds when $P$ is degenerate with either the lower left or upper right element of $P$ being 0. In the former case $\mbox{\rm PC}=\tau$ , while in the latter case $\mbox{\rm PC}=1$ .

More generally, we have $\mbox{{\it{s}}\rm{LB}}=\tau/\Pr(Y=1\mid X\leftarrow 1)\geq\tau$ , so $\mbox{\rm PC}\geq\tau$ .

3 Bounds from mediation

We now suppose that, in addition to $X$ and $Y$ , we can gather data on one or more binary mediator variables $M_{1},\ldots,M_{n-1}$ . We also define $M_{0}\equiv X$ and $M_{n}\equiv Y$ . We are interested in assessing the probability that $X=x$ caused $Y=y$ for a new case where we have information on the values of some or all of the mediators $M_{1},\dots,M_{n-1}$ .

We assume that the data are based on experiments, or in any case are such as to allow us to determine the one-step interventional probabilities $\Pr(M_{i+1}=m_{i+1}\mid M_{i}\leftarrow m_{i})$ , $i=0,\ldots,n-1$ . We shall here confine attention to the case of a complete mediation sequence, where

[TABLE]

We shall further suppose that, for any new case considered, there is no confounding at every step, so that

[TABLE]

In this case the sequence of observations $(X\equiv M_{0},\ldots,M_{n}\equiv Y)$ on a new case will form a (generally non-stationary) Markov chain. This is an empirically testable consequence of our assumptions, assumptions which would therefore be falsified if the Markov property is found to fail (although those assumptions are not guaranteed to be valid when it is found to hold.)

Let the transition matrix from $M_{i-1}$ to $M_{i}$ be $P_{i}=P(\tau_{i},\rho_{i})$ , and the overall transition matrix from $X$ to $Y$ be $P=P(\tau,\rho)$ . We shall write

[TABLE]

to indicate that we are assuming the above mediation sequence, and refer to (27) as a decomposition of the matrix $P$ . In particular we then have $P=P^{(n)}:=\prod_{i=1}^{n}P_{i}$ .

We can readily show by induction that

[TABLE]

In particular, for the case $n=2$ , (29) becomes

[TABLE]

On account of (28) we have the following result:

Theorem 1

The average causal effect of $X$ on $Y$ is the product of the successive average causal effects of each variable in the sequence on the following one.

Again, to conduct CoE rather than EoC analysis, we introduce, for $i\geq 1$ , bivariate variables

[TABLE]

where $M_{im}$ denotes the potential value of $M_{i}$ under $M_{i-1}\leftarrow m$ , supposed unaffected by values of previous $M$ ’s. We further assume that the variable ${\mbox{$ \mathbf{M} $}}_{i}$ is common to all the various worlds, whether actual or counterfactual, under consideration. The actually realised values $(M_{i})$ satisfy $M_{i}=M_{i,{M_{i-1}}}$ .

As the expression of our “no confounding” assumptions, we impose mutual independence between $X$ , ${\mbox{$ \mathbf{M} $}}_{1}$ ,…, ${\mbox{$ \mathbf{M} $}}_{n}$ .

Theorem 2

$C^{(X,Y)}=\bigcap_{i=0}^{n-1}\,C^{(M_{i},M_{1+1})}$ . That is to say, $M_{0}\equiv X$ affects $M_{n}\equiv Y$ if and only if each $M_{i}$ affects the next.

**Proof. ** Suppose first that each variable affects the next. Then changing the value of $X$ will change that of $M_{1}$ , which in turn will change that of $M_{2}$ , and so on until the value of $Y$ is changed, so showing that $X$ affects $Y$ . Conversely, if, for some $j<n$ , $M_{j}$ does not affect $M_{j+1}$ , then, whether or not $M_{j}$ has been changed, the value of $M_{j+1}$ will be unchanged, whence so too will that of $M_{j+2}$ , and so on until the value of $Y$ is unchanged, whence $X$ does not affect $Y$ . $\Box$

Corollary 1

(i).

$\Pr(C^{(X,Y)})=\prod_{i=1}^{n}\,\Pr(C^{(M_{i-1},M_{1})})$ ** 2. (ii).

$\xi(P)=\prod_{i=1}^{n}\,\xi(P_{i})$ ** 3. (iii).

Given the detailed information on the decomposition (27), the constraints on $\xi=\xi(P)$ are now:

[TABLE]

**Proof. **

(i)

By the assumed mutual independence of the $({\mbox{$ \mathbf{M} $}}_{i})$ .

(ii)

By (5).

(iii)

By (ii), (6) for each $P_{i}$ , and (28).

$\Box$

On account of (i) we have:

Corollary 2

For any decomposition, the probability that $X$ affects $Y$ is the product of the probabilities that each variable in the sequence from $X$ to $Y$ affects the next in the sequence.

On comparing (31) with (6), we see that detailed knowledge of the mediation process has not changed the lower bound for $\xi$ . However, the upper bound is typically reduced:

Theorem 3

The upper bound of (31), which takes into account the decomposition (27), does not exceed the upper bound of (6), which ignores the decomposition. It will be strictly less if all the $P_{i}$ are non-degenerate with $\rho_{i}\neq 0$ .

**Proof. ** Consider first the case $n=2$ . Then

[TABLE]

It follows that

[TABLE]

Moreover, we shall have strict inequality in (33), and hence also in (34), if $P_{2}$ is non-degenerate and $\rho_{1}\neq 0$ .

The result for general $n$ follows easily by induction. $\Box$

We note that the above condition for strict inequality in (34), while sufficient, is not necessary. For example, in the case $n=2$ it will also hold if $\rho_{1}\tau_{2}$ and $\rho_{2}$ have different signs, since then we would have strict inequality in (32).

It follows from (31) and (34) that collapsing two mediators into a single one can only increase the upper bound for $\xi$ :

Corollary 3

Consider two decompositions $P=P_{1}\mid P_{2}\ldots\mid P_{n}$ and $P=P_{1}\mid\ldots\mid P_{i}\mid Q\mid P_{i+2}\mid\ldots\mid P_{n}$ , where $Q=P_{i}P_{i+1}$ . Then the upper bound for $\xi$ for the former does not exceed that for the latter.

3.1 Bounds when mediators are unobserved

Suppose first that, for the new case, we have observed $X=x,Y=y$ , but the values of the mediators are not observed. Even in this case, as shown for the two-term decomposition in Dawid, Murtas and Musio (2016), knowledge of the decomposition (27) of $P$ can alter the bounds for PC.

Indeed, in this case (4) still applies, where $\Pr(C_{xy})$ is given by (7) or (8) as appropriate, but now with $\xi$ subject to the revised bounds of (31). In each case the lower bound is unaffected, but, by Theorem 3, the upper bound is reduced.

This analysis delivers the following revised bounds (prefix “u” for “unobserved mediators”):

[TABLE]

3.2 Special case

In particular, for the case $\tau>0$ , where we observe $X=1$ , $Y=1$ (but the values of mediators are not observed), we have revised bounds

[TABLE]

For $n=2$ this agrees with the analysis of Dawid, Murtas and Musio (2016).

3.3 Bounds when some or all mediators are observed

Now suppose that, in addition to $X=x$ , $Y=y$ , we also observe data on $k$ mediators ( $0\leq k\leq n-1$ ) for the new case. In particular we observe $M_{i_{r}}=m_{i_{r}}$ , for $0<i_{1}<\ldots i_{r}\ldots<i_{k}<n$ . For notational simplicity we write $\widetilde{M}_{r}$ for $M_{i_{r}}$ , $\widetilde{m}_{r}$ for $m_{i_{r}}$ . We also identify $\widetilde{M}_{0}\equiv X$ and $\widetilde{M}_{k+1}\equiv Y$ (so $\widetilde{m}_{0}=x$ , $\widetilde{m}_{k+1}=y$ ).

The relevant probability of causation is now

[TABLE]

Note that in contrast to the difference between (35)–(38) on the one hand and (11)–(14) on the other hand, which relate to the same quantity $\mbox{\rm PC}_{xy}$ but express different conclusions about it, $\mbox{\rm$ \widetilde{\mbox{PC}} $}_{xy}$ is a genuinely different quantity from $\mbox{\rm PC}_{xy}$ , since it conditions on different information about the new case.

Theorem 4

Given observations on $X,\widetilde{M}_{1},\ldots,\widetilde{M}_{k},Y$ , the probability that $X$ caused $Y$ is given by the product of the probabilities that each observed term in the sequence caused the next observed term:

[TABLE]

**Proof. ** From Theorem 2 we have

[TABLE]

whence, using the “no-confounding” independence properties,

[TABLE]

$\Box$

Now since we have the decomposition information about the mediators (if any) occurring between $\widetilde{M}_{r}\equiv M_{i_{r}}$ and $\widetilde{M}_{r+1}\equiv M_{i_{r+1}}$ , but not their values for the new case, the bounds on any factor in (40) will, mutatis mutandis, have the form of the relevant expressions for $\mbox{{\it{u}}\rm{LB}}_{xy}$ and $\mbox{{\it{u}}\rm{UB}}_{xy}$ , as displayed in (35)—(38). Then the overall lower [resp., upper] bound on $\mbox{\rm$ \widetilde{\mbox{PC}} $}_{xy}$ will be the product of these lower [resp., upper] bounds, across all terms. This procedure supplies a complete recipe for determining the appropriate bounds on $\mbox{\rm$ \widetilde{\mbox{PC}} $}_{xy}$ in the knowledge of the full decomposition of $P$ and the values of the observed mediators for the new case.

3.4 Special cases

Again consider the case $\tau>0$ , $X=Y=1$ . On account of (28) we can, after possibly switching the labels [math] and $1$ for some of the $M_{i}$ ’s, take $\tau_{i}>0$ , all $i$ . We assume henceforth that this is the case. The above procedure then delivers lower bound [math] unless $\widetilde{m}_{i}=\widetilde{m}_{i-1}$ , all $i$ , so that $m_{i}=1$ , all $i$ . In that case we obtain lower bound (with prefix “o” for “observed mediators”):

[TABLE]

It is easy to see that this lower bound can only increase if we introduce further observed mediators. It follows that the smallest lower bound occurs when the are no observed mediators, when it reduces to $\mbox{{\it{u}}\rm{LB}}=\mbox{{\it{s}}\rm{LB}}$ as in (39) and (26); while the largest lower bound occurs when all mediators are observed (all taking value 1)—that is to say, there is positive evidence for every link in the mediation chain.

In the remainder of this paper we shall give special attention to this case, and write simply $\widetilde{\mbox{PC}}$ for $\mbox{\rm$ \widetilde{\mbox{PC}} $}_{11}$ , etc. The bounds for $\widetilde{\mbox{PC}}$ are then:

[TABLE]

The following result follows directly from the above considerations:

Lemma 1

*The lower bound oLB of (42) is at least as large as the lower bound sLB of (26). *

It is not, however, always the case that $\mbox{{\it{o}}\rm{UB}}\leq\mbox{{\it{s}}\rm{UB}}$ : see (45) below.

4 Implications

Equation (40) provides a general formula for calculating bounds on the probability of causation for any pattern of data observed on mediating variables (including no data).

We now derive implications from this analysis.

4.1 Largest and smallest upper and lower bounds

Consider an arbitrary decomposition of $P$ :

[TABLE]

with $P=P(\tau,\rho)$ , $P_{i}=P(\tau_{i},\rho_{i})$ . We restrict attention to the case $\tau>0$ and assume that variables are labeled so that each $\tau_{i}>0$ .

We investigate the smallest and largest achievable values for $\mbox{{\it{u}}\rm{LB}},\mbox{{\it{u}}\rm{UB}},\mbox{{\it{o}}\rm{LB}},\mbox{{\it{o}}\rm{UB}},\mbox{{\it{m}}\rm{LB}}$ , mUB (prefix $m$ for mixed evidence) and show that in each case these are achievable by decompositions involving at most one mediator.

Theorem 5

Let the (known, fixed) transition matrix from $X$ to $Y$ be $P=P(\tau,\rho)$ , with $\tau>0$ and $|\rho|<1-\tau$ . The largest and smallest upper and lower bounds from any complete mediation process for the case with mediators unobserved, for the case with positive outcomes on all mediators observed, and for mixed cases, that include some negative evidence on the mediators, are as given in Table 2.

These can all be achieved by decompositions of length 1 or 2.

**Proof. ** See Appendix A. $\Box$

The largest upper bound with mediators unobserved, $\overline{\mbox{{\it{u}}\rm{UB}}}$ , can be achieved without any mediators. Since unobserved mediators do not alter the lower bound we have $\overline{\mbox{{\it{u}}\rm{LB}}}=\underline{\mbox{{\it{u}}\rm{LB}}}={\mbox{{\it{s}}\rm{LB}}}$ . In addition we have $\underline{\mbox{{\it{u}}\rm{UB}}}={\mbox{{\it{s}}\rm{LB}}}$ , which is achievable, for example, from the following decomposition:

[TABLE]

Note that with this decomposition PC is identified via two degenerate transition matrices: $X=1$ is a sufficient condition for $M=1$ , while $M=1$ is a necessary condition for $Y=1$ .

The smallest upper and lower bounds available when mediators are observed agree with the simple lower bound. Positive evidence cannot reduce the lower bound, but it can reduce the upper bound to the lower bound, at which point $\widetilde{\mbox{PC}}$ is identified. This can be achieved by the same decomposition given in (44).

The largest upper bound with positive evidence on mediators, $\overline{\mbox{{\it{o}}\rm{UB}}}$ , can exceed the simple upper bound when $\rho>0$ . It is achieved by the following two-term decomposition, involving a single mediator:

[TABLE]

The lower bound can be raised with positive information on mediators, and takes its largest value with the following degenerate two-term decomposition $P=P_{1}\mid P_{2}$ , involving a single mediator:

[TABLE]

With this decomposition $\widetilde{\mbox{PC}}$ is identified via two degenerate transition matrices: in this case $X=1$ is a necessary condition for $M=1$ , while $M=1$ is a sufficient condition for $Y=1$ . The largest lower bound with positive evidence from this decomposition is $\frac{1+\tau-\rho}{2}$ which can fall far short of 1, implying that in general mediators cannot provide “smoking gun” evidence that $X=1$ caused $Y=1$ .

For the case with mixed evidence on the mediators the lower bound is always 0. The smallest upper bound is also 0, which can be achieved by the decomposition (46) above, with the single mediator observed at 0 (the key feature of this decomposition is that $Y=1$ can not be caused by $M=0$ ). In this case $\widetilde{\mbox{PC}}$ is identified at 0, showing that it is possible for negative data on mediators to provide “hoop” evidence that $X=1$ did not cause $Y=1$ . The highest upper bound, $\mbox{{\it{m}}\rm{UB}}=1$ , can be achieved by a two-step decomposition $P(\tau,\rho)=P(\tau_{1},\rho_{1})\mid P(\tau_{2},\rho_{2})$ , with the mediator taking value 0. For $\rho\leq 0$ this occurs with the decomposition with parameters

[TABLE]

For $\rho\geq 0$ it occurs with decomposition parameterized by

[TABLE]

4.2 Homogeneous transitions

Throughout this section we confine attention to the special case $\tau>0$ , $X=Y=1$ . We specialize further to the case of a constant one-step transition matrix, $P_{i}=P^{\prime}=P(\tau^{\prime},\rho^{\prime})$ for all $i$ . We define $\sigma^{\prime}$ , $\gamma^{\prime}$ , $\delta^{\prime}$ in terms of $\tau^{\prime}$ and $\rho^{\prime}$ in parallel to (3), (15) and (16).

In this case, by (28) and (29), we have

[TABLE]

In particular, we note that the relative sufficiency of $X$ for $Y$ is preserved at each intermediate step: $\sigma^{\prime}={\rho^{\prime}}/(1-\tau^{\prime})={\rho}/(1-\tau)=\sigma$ . It follows that $\gamma^{\prime}=\gamma$ .

We have

[TABLE]

Note that, for large $n$ , $\tau^{\prime}$ must be close to 1 and $\rho^{\prime}$ close to 0, with the same sign as $\rho$ .

Using (51) and (52) in (39) and (42) yield the following bounds for a homogeneous process:

[TABLE]

In particular, for the degenerate cases $|\rho|=1-\tau$ , so that $|\rho^{\prime}|=1-\tau^{\prime}$ , we see, that for all $n$ , PC and $\widetilde{\mbox{PC}}$ are both identified, at $\tau$ when $\rho=1-\tau$ , and at 1 when $\rho=\tau-1$ —the existence of the mediators being irrelevant in these cases.

Mixed evidence

Here we assume the process is non-degenerate.

For the case with some negative evidence the lower bound, $\mbox{{\it{m}}\rm{LB}}_{n}$ say, is always 0, as noted in Section § 3.4. The upper bound, however, depends on the particular pattern of positive and negative evidence. For any sequence $s$ of observations on consecutive mediators (allowing $M_{0}\equiv X$ and $M_{n}\equiv Y$ , both required to take value 1), denote the associated upper bound by $\mbox{\rm UB}(s)$ . Let $\mathbf{s}$ denote a full sequence of observations (i.e., on all $n+1$ mediators). We search for a full sequence ${\mbox{$ \mathbf{s} $}}_{0}$ yielding the maximum value, $\mbox{{\it{m}}\rm{UB}}_{n}$ say, of $\mbox{\rm UB}({\mbox{$ \mathbf{s} $}})$ .

Theorem 6

For large enough $n$ , we have

[TABLE]

The optimal sequence ${\mbox{$ \mathbf{s} $}}_{0}$ alternates $10101\ldots$ , except, if $n$ is odd, for the final 2 symbols.

**Proof. ** See Appendix B. $\Box$

For $\rho=0$ the smallest possible upper bound is $1$ for all $n$ . Otherwise, $\mbox{{\it{m}}\rm{UB}}_{n}\rightarrow 0$ as $n\rightarrow\infty$ . Then with alternating evidence on many mediators the associated probability of causation, $\widehat{\mbox{PC}}$ say, is effectively identified as [math].

Figure 1 plots the intervals $[\mbox{{\it{u}}\rm{LB}}_{n},\mbox{{\it{u}}\rm{UB}}_{n}]$ , $[\mbox{{\it{o}}\rm{LB}}_{n},\mbox{{\it{o}}\rm{UB}}_{n}]$ and $[\mbox{{\it{m}}\rm{LB}}_{n},\mbox{{\it{m}}\rm{UB}}_{n}]$ for a range of cases. It highlights how modest are the gains from repeated observation of homogeneous mediators and how alternating evidence can tighten bounds as long as $\rho\neq 0$ .

4.2.1 Unboundedly many mediators

We now consider the behaviour of the bounds when we have a potentially unlimited sequence of variables directly mediating between $X$ and $Y$ —still assuming identical one-step transition matrices. Our results are given in Theorem 7.

Theorem 7

[TABLE]

**Proof. ** See Appendix C. $\Box$

In particular, for $\rho=0$ we have

[TABLE]

and

[TABLE]

Proposition 1

For $|\rho|<1-\tau$ , $\mbox{{\it{o}}\rm{LB}}_{n}$ is a concave strictly increasing function of $n$ , and $\mbox{{\it{u}}\rm{UB}}_{n}$ and (for $\rho>0$ ) $\mbox{{\it{o}}\rm{UB}}_{n}$ are both convex strictly decreasing functions of $n$ .

We do not have a full proof of Proposition 1. Supporting evidence is given by numerous plots of $\mbox{{\it{o}}\rm{LB}}_{n}$ and $\mbox{{\it{o}}\rm{UB}}_{n}$ against $n$ for various $(\tau,\rho)$ pairs, and the following two results, which are proved in Appendix C.

Lemma 2

If $|\rho|<1-\tau$ , then $\mbox{{\it{o}}\rm{LB}}_{n}$ is a concave increasing function of $n$ , and $\mbox{{\it{u}}\rm{UB}}_{n}$ and (for $\rho>0$ ) $\mbox{{\it{o}}\rm{UB}}_{n}$ are convex strictly decreasing functions of $n$ , for $n$ sufficiently large.

Lemma 3

For the non-degenerate case $|\rho|<1-\tau$ , $\mbox{{\it{u}}\rm{UB}}_{2n}<\mbox{{\it{u}}\rm{UB}}_{n}$ , $\mbox{{\it{o}}\rm{LB}}_{2n}>\mbox{{\it{o}}\rm{LB}}_{n}$ , and (for $\rho>0$ ) $\mbox{{\it{o}}\rm{UB}}_{2n}<\mbox{{\it{u}}\rm{UB}}_{n}$ .

4.3 Implications for data gathering

Our results have focused on improving the bounds on PC by learning about general mediating processes together with values for prespecified mediators for the case at hand. Our results can also be used to suggest which mediators researchers might most fruitfully seek to observe for the case at hand.

Thus consider a homogeneous process with $n$ steps ( $n$ even) and suppose that researchers can observe the value of just one mediator $M_{i}$ . In this case we can show that the lower bound $\widetilde{}\mbox{\rm LB}$ on $\widetilde{\mbox{PC}}$ , if we were to observe $M_{i}=1$ , is maximized if the central mediator in the sequence is observed. To see this, note that from (28), (29) and Theorem 4, the lower bound $\widetilde{}\mbox{\rm LB}$ from observation of mediator $M_{k}=1$ is given by the product of the lower bound for the probability that $X=1$ caused $M_{k}=1$ and the lower bound for the probability that $M_{k}=1$ caused $Y=1$ :

[TABLE]

where $\tau^{\prime}$ and $\rho^{\prime}$ are given by (51) and (52). This expression has the form ${c}/{f(k)f(n-k)}$ , where $f(k)$ is decreasing and convex in $k$ : this holds since $\Delta_{k+1}:=f(k+1)-f(k)=\tau^{\prime k+1}-\tau^{\prime k}+\rho^{\prime}\tau^{\prime k+1}=\tau^{\prime k}(\tau^{\prime}+\rho^{\prime}-1)<0$ , and $\Delta_{k+1}-\Delta_{k}=(\tau^{\prime k}-\tau^{\prime k-1})(\tau^{\prime}+\rho^{\prime}-1)>0$ . Hence the denominator is minimised, and so $\widetilde{}\mbox{\rm LB}$ is maximised, when $k=n-k$ .

As an illustration, suppose 121 dominoes stand in a row. The fall of any domino increases the chance that its neighbor will fall from 0.005 to 0.995. You know that the first domino was knocked and fell, that the last is also down, and want the probability that the fall of the first one caused the fall of the last one. A lower bound above 50% would secure a conviction of domino 1.

With no further information, the lower bound is 0.461—not enough to convict. But now suppose you can seek information on the status of just one other domino in the sequence: which should you choose? It is better to choose in the middle than at the edges.

If for example you were to seek information on the status of domino 2 and found that it had fallen, you would find $\widetilde{}\mbox{\rm LB}=0.463$ —a modest gain, reflecting the fact that you fully expected domino 2 to have fallen, given that domino 1 was knocked. However, you are less sure you will find domino 61 down. If you do, you find $\widetilde{}\mbox{\rm LB}=0.501$ — enough to convict domino 1.

Note that in all cases the lower bound would be 0 if the intermediate domino were found to be standing. Taking both possible outcomes into account, the expected lower bound is always $0.461$ . But the second strategy does better than the first, in allowing the possibility to obtain a larger lower bound (albeit with a smaller probability), and so secure a conviction.

5 Comparisons with other bounds

Although knowledge of mediators can narrow bounds, we have seen that this narrowing can be modest, even with access to an infinite sequence of positive evidence along a causal path. To put our results in context, we compare them with bounds that can be achieved from monotonicity, and from covariate information. Knowledge of the bounds achievable by different strategies provides some guidance as to whether a strategy would be worth pursuing.

Monotonicity

Suppose that we somehow knew that there are no cases for which the exposure would prevent the outcome, i.e., such that $Y_{0}=1,Y_{1}=0$ . From Table 1 this is equivalent to $\xi=\tau$ , its lower limit, which in turn implies that PC, given by (25), is identified at its lower limit, $\mbox{{\it{s}}\rm{LB}}=(2\tau)/(1+\tau+\rho).$

However, since monotonicity is an attribute of the typically unidentifiable joint distribution of $(Y_{0},Y_{1})$ , it is not easy to justify without additional knowledge. One case where this works is when we know the existence of a mediation process with decomposition (44).

Observed covariate

Suppose that, in addition to $X$ and $Y$ , we can observe a binary covariate $C$ , which can affect the dependence of $Y$ on $X$ . Let $\pi=\Pr(C=1)$ , and let $P_{i}$ be the transition matrix from $X$ to $Y$ , conditional on $C=i$ ; for consistency with the known $P=P(\tau,\rho)$ we must have $P=\pi P_{1}+(1-\pi)P_{0}$ .

In particular, it could be the case that $\pi=(1+\tau-\rho)/2$ , and

[TABLE]

In this case knowledge that an individual with $X=Y=1$ also has $C=1$ is enough to identify PC at 1.

Unobserved covariate

As shown in Dawid (2011), knowledge of covariates can improve bounds, even if their values are not observed for the case at hand. In particular, this can let us identify PC at the upper bound, $\mbox{{\it{s}}\rm{UB}}=\min\{1,\frac{1+\tau-\rho}{1+\tau+\rho}\}$ . For this to be possible, however, the average treatment effect must be negative for some value of $C$ .

Thus suppose $\pi=\frac{1+\tau+\rho}{2}$ , and the conditional transition matrices are:

For $\rho<0$ ,

[TABLE]

For $\rho\geq 0$ ,

[TABLE]

In either case, knowledge that $X=Y=1$ is sufficient to infer that $C=1$ . This identifies the probability of causation: $\mbox{\rm PC}=1$ for $\rho<0$ , $\mbox{\rm PC}=\frac{1+\tau-\rho}{1+\tau+\rho}$ for $\rho\geq 0$ . In both cases we hit the upper bound.

Comparisons

Figure 2 compares the bounds obtained, for a range of values of $\tau$ and $\rho$ . It illustrates how, in general, lower bounds rise with $\tau$ and fall with $\rho$ . For homogeneous processes the lower bounds improve on the simple bounds, although the gain from unlimited steps is not a striking improvement on that for just two steps. The best gains from non-homogeneous decompositions are substantial, as are the gains from knowledge of covariates, especially when $\rho$ is small.

6 Conclusion

We close with some comments, which may help to guide the collection of ancillary evidence to improve the bounds on the probability of causation. These are based on our general results, as exemplified in Figure 2.

Knowledge of mediation processes, and of positive values for some mediators in a particular case, can raise the lower bound on the probability of causation, thus providing some evidence against a sceptic who doubts that the outcome in the case can be attributed to the putative cause. However, it may well not raise the bound enough to convince her. In contrast, for some processes, observing negative evidence on mediators can effectively convince the sceptic that the outcome is not the result of the exposure. 2. 2.

Observing positive data on homogeneous mediation processes can improve the bounds, but there are diminishing returns, and full identification is not achieved, even with infinite data. 3. 3.

For a homogeneous process, observation in the middle of the process is more informative than nearer the edges. 4. 4.

Heterogeneous mediation processes can sometimes yield identification with minimal auxiliary data gathering:

•

A process where $X$ is a necessary condition for a sufficient condition for $Y$ yields the largest possible upper bound, and identifies the probability of causation. For example, if it is known that the effect of delivering a deworming medicine passes uniquely through ingestion, and ingestion is sufficient for effective deworming, then evidence of ingestion raises the lower bound and identifies the probability of causation.

•

A process in which $X$ is a sufficient condition for a necessary condition for $Y$ yields identification, and there is no gain from gathering data on the mediator. For instance if ingesting medicine is a sufficient condition for good health, and good health is a necessary condition for good school performance, then observing ingestion and good school performance is sufficient to achieve identification. There are no additional gains from measuring health, since good health is already implied by good performance. 5. 5.

Potential gains from knowledge of mediation processes are typically weaker than potential gains from knowledge of conditions under which interventions are more or less effective. Even when covariates are unobserved for the case at hand, knowledge of the general effect of covariates can tighten the bounds when some subgroups exhibit adverse effects. On this basis researchers might be able to assess whether a search for a suitable covariate could lead to improved bounds, and perhaps even identification of the probability of causation.

Appendix A Proof of Theorem 5

A.1 Mediators unobserved

Lower bounds:

uLB is unchanged by knowledge of the mediation process alone and so the largest and smallest values of uLB are $\underline{\mbox{{\it{u}}\rm{LB}}}=\overline{\mbox{{\it{u}}\rm{LB}}}=\mbox{{\it{s}}\rm{LB}}$ .

Smallest upper bound:

From (39) we can see that for a degenerate two-term decomposition with $|\rho_{1}|=1-\tau_{1}$ and $|\rho_{2}|=1-\tau_{2}$ , $\mbox{{\it{u}}\rm{UB}}=\mbox{{\it{u}}\rm{LB}}=\mbox{{\it{s}}\rm{LB}}$ . In this case PC is identified.

Largest upper bound:

It follows from Corollary 3 and (25) that this is achieved when there are no mediators, and is thus sUB.

A.2 Positive data observed at every step

We now consider the case where mediators are observed. Then, for the decomposition (43),

[TABLE]

Smallest lower bound

It follows from Lemma 1 that the smallest achievable lower bound is

[TABLE]

which does not require any mediators.

Smallest upper bound

Trivially we must have $\mbox{{\it{o}}\rm{UB}}\geq\mbox{{\it{o}}\rm{LB}}\geq\mbox{{\it{s}}\rm{LB}}$ .

Note now that the decomposition (44) identifies $\mbox{\rm$ \widetilde{\mbox{PC}} $}=\mbox{{\it{s}}\rm{LB}}$ , whence in particular $\mbox{{\it{o}}\rm{UB}}=\mbox{{\it{s}}\rm{LB}}$ , the smallest possible value.

Largest lower bound

Lemma 4

[TABLE]

**Proof. ** This holds since

[TABLE]

$\Box$

Lemma 5

Let $P=P_{1}\times P_{2}$ . Then

[TABLE]

**Proof. ** Follows from matrix multiplication, on noting that each term is the leading entry of its associated transition matrix. $\Box$

Corollary 4

Let $P=P_{1}\times\ldots\times P_{n}$ . Then

[TABLE]

From (42), Lemma 4 and Corollary 4 we deduce:

Corollary 5

Let $P=P_{1}\mid\ldots\mid P_{n}$ . Then $\mbox{{\it{o}}\rm{LB}}\leq(1+\tau-\rho)/2$ .

However the value $\mbox{{\it{o}}\rm{LB}}=(1+\tau-\rho)/2$ can be achieved, specifically for the degenerate two-term decomposition (46), so this is indeed the largest lower bound. And in this case we have identification: $\mbox{\rm$ \widetilde{\mbox{PC}} $}=(1+\tau-\rho)/2$ .

We note that, since $\rho\leq 1-\tau$ , the largest lower bound, $(1+\tau-\rho)/2$ , can not exceed the simple upper bound $\mbox{{\it{s}}\rm{UB}}=(1+\tau-\rho)/(1+\tau+\rho)$ . Thus any lower bound must lie in the simple interval $[\mbox{{\it{s}}\rm{LB}},\mbox{{\it{s}}\rm{UB}}]$ .

Largest upper bound

Lemma 6

[TABLE]

**Proof. ** Trivial if $\rho\leq 0$ . Otherwise follows from $(1-\rho)(1+\tau+\rho)-(1+\tau-\rho)=\rho(1-\tau-\rho)\geq 0$ . $\Box$

Lemma 7

Let $P=P_{1}\times P_{2}$ . Then

[TABLE]

**Proof. ** Trivial if both $\rho_{1}$ and $\rho_{2}$ (and hence, by (30) and the fact that $\tau_{2}>0$ , also $\rho$ ) are negative.

If $\rho_{1}\leq 0$ , $\rho_{2}\geq 0$ , we have to show $\rho_{2}\geq\rho$ . This follows from (30). Similarly if $\rho_{1}\geq 0$ , $\rho_{2}\leq 0$ (using $\tau_{2}\leq 1)$ .

Finally, if $\rho_{1}>0$ , $\rho_{2}>0$ (and so also $\rho>0)$ , the result follows from (34). $\Box$

Corollary 6

Let $P=P_{1}\times\ldots\times P_{n}$ . Then

[TABLE]

From Lemma 6 and (42) we deduce:

Corollary 7

For decomposition $P=P_{1}\mid\ldots\mid P_{n}$ , $\mbox{{\it{o}}\rm{UB}}\leq\min\{1,1-\rho\}$ .

However the value $\mbox{\rm UB}=\min\{1,1-\rho\}$ can be achieved. If $\rho\leq 0$ , no mediators are required. If $\rho>0$ , the value $\mbox{{\it{o}}\rm{UB}}=1-\rho$ is achieved by the two-term decomposition (45). By Lemma 6, this largest upper bound $1-\rho$ is at least as large as the simple upper bound sUB of (26).

Since we know that $\mbox{{\it{o}}\rm{LB}}\leq\mbox{{\it{s}}\rm{UB}}$ , we cannot have identification of $\widetilde{\mbox{PC}}$ in this case unless these inequalities become equalities, which only holds when $\rho=1-\tau$ . In fact for the decomposition (45) we have $\mbox{{\it{o}}\rm{LB}}=\tau$ .

A.3 Negative data observed at some steps

The lower bounds at 0 are immediate from Equation (40). It is easy to verify that the lowest upper bound at 0 is achievable by the decomposition (45), and the highest upper bound at 1 is achievable from the decompositions in (47) and (48). Since these bounds are at 0 and 1 they are the extreme values obtainable from any process involving some negative data.

Appendix B Proof of Theorem 6

Using (17)–(20) and (21)–(24), we have the following upper bounds for a single step.

For $\rho\geq 0$ :

[TABLE]

For $\rho<0$ :

[TABLE]

When $\rho=0$ , all these upper bounds are 1, and the upper bound for any evidence sequence is 1.

Otherwise, $\gamma<1$ does not depend on $n$ , while $\delta^{\prime}\rightarrow 1$ as $n\rightarrow\infty$ . So there exists $N\geq 2$ such that $n>N\Rightarrow\gamma^{\prime}<(\delta^{\prime})^{2}$ . Henceforth we suppose $n>N$ . Then we have:

Lemma 8

[TABLE]

**Proof. ** We have $\mbox{\rm UB}(111)=\{\mbox{\rm UB}(11)\}^{2}=(\delta^{\prime})^{2}$ if $\rho\geq 0$ , or 1 if $\rho<0$ , while $\mbox{\rm UB}(000)=\{\mbox{\rm UB}(00)\}^{2}=1$ if $\rho\geq 0$ , or $(\delta^{\prime})^{2}$ if $\rho<0$ . In all cases $\mbox{\rm UB}(101)=\mbox{\rm UB}(010)=\mbox{\rm UB}(10)\mbox{\rm UB}(01)=\gamma$ , while $\mbox{\rm UB}(111)$ and $\mbox{\rm UB}(000)\geq(\delta^{\prime})^{2}>\gamma$ . $\Box$

Corollary 8

The optimal sequence ${\mbox{$ \mathbf{s} $}}_{0}$ can not contain any subsequence of repeated values of length greater than 2.

We now consider separately the cases of positive and negative $\rho$ .

1. $\rho>0$ .

Suppose ${\mbox{$ \mathbf{s} $}}_{0}$ contains a subsequence $00$ . It must then be followed by a $1$ , so ${\mbox{$ \mathbf{s} $}}_{0}$ contains a subsequence $001$ . Since $\mbox{\rm UB}(001)=\gamma$ , while $\mbox{\rm UB}(011)=\gamma\delta^{\prime}<\gamma$ , on replacing this subsequence by $011$ we would achieve a smaller upper bound. This contradiction shows that ${\mbox{$ \mathbf{s} $}}_{0}$ cannot contain any successive repeated [math]’s.

Now suppose ${\mbox{$ \mathbf{s} $}}_{0}$ contains a subsequence $11$ . Consider the first appearance of this. If not at the very end, it must be followed by $01$ . Now replace this subsequence $1101$ by $1011$ . Since $\mbox{\rm UB}(1101)=\mbox{\rm UB}(1011)=\gamma\delta^{\prime}$ , we have not changed $\mbox{\rm UB}({\mbox{$ \mathbf{s} $}}_{0})$ , but have postponed the first occurrence of $11$ . We can thus assume that the first such occurrence (if any) is at the very end.

If now $n$ is even, the first $n-1$ values must be the alternating sequence $1010\ldots 1$ . But this can not be followed by $11$ , since that would produce a subsequence $111$ . We deduce that ${\mbox{$ \mathbf{s} $}}_{0}$ must be the full alternating sequence $1010\ldots 1$ . The smallest possible upper bound with mixed evidence is thus $\mbox{{\it{m}}\rm{UB}}_{n}=\mbox{\rm UB}({\mbox{$ \mathbf{s} $}}_{0})=\gamma^{n/2}$ .

If $n$ is odd, there must at least one appearance of $11$ . The above argument now delivers as ${\mbox{$ \mathbf{s} $}}_{0}$ the alternating sequence $1010\ldots 1$ of length $n$ , followed by the final $1$ . We now have $\mbox{{\it{m}}\rm{UB}}_{n}=\mbox{\rm UB}({\mbox{$ \mathbf{s} $}}_{0})=\gamma^{(n-1)/2}\delta^{\prime}$ .

2. $\rho<0$ .

The argument here is almost, but not quite, a mirror image of that above.

Suppose that ${\mbox{$ \mathbf{s} $}}_{0}$ contains a subsequence $11$ . If not at the very end, it must be followed by a [math] so that $s_{0}$ contains a subsequence $110$ . Now, since $\mbox{\rm UB}(110)=\gamma$ and $\mbox{\rm UB}(100)=\gamma\delta^{\prime}<\gamma$ , ${\mbox{$ \mathbf{s} $}}_{0}$ cannot contain any internal successive repeated 1’s. Also, ${\mbox{$ \mathbf{s} $}}_{0}$ can not end with $11$ , and hence with $011$ , since $\mbox{\rm UB}(001)=\delta’<\mbox{\rm UB}(011)=1$ . So there can be no repeated $1$ ’s.

Now suppose ${\mbox{$ \mathbf{s} $}}_{0}$ contains a subsequence $00$ . Consider the first appearance of this. If not just before the final $1$ , it must be followed by $10$ . Now replace this subsequence $0010$ by $0100$ . Since $\mbox{\rm UB}(0010)=\mbox{\rm UB}(0100)=\gamma\delta^{\prime}$ , we have not changed $\mbox{\rm UB}({\mbox{$ \mathbf{s} $}}_{0})$ , but have postponed the first occurrence of $00$ . So the only possibility for two successive [math]s is if ${\mbox{$ \mathbf{s} $}}_{0}$ ends with $001$ .

Before the end, we must have an alternating sequence $101\ldots 01$ .

If $n$ is even, we can not then conclude with $001$ , so in this case we must have the full alternating sequence $1010\ldots 1$ . The smallest possible upper bound with mixed evidence is, again, $\mbox{{\it{m}}\rm{UB}}_{n}=\mbox{\rm UB}({\mbox{$ \mathbf{s} $}}_{0})=\gamma^{n/2}$ .

If $n$ is odd we must have the alternating sequence $1010\ldots 1$ of length $n-2$ , followed by $001$ . We again find $\mbox{{\it{m}}\rm{UB}}_{n}=\mbox{\rm UB}({\mbox{$ \mathbf{s} $}}_{0})=\gamma^{(n-1)/2}\delta^{\prime}$ .

Appendix C Proofs for § 4.2.1

**Proof of Theorem 7. ** Using Mathematica (Wolfram Research, Inc., 2018), we obtain expansions

[TABLE]

with

[TABLE]

The expression for $\mbox{{\it{u}}\rm{UB}}_{\infty}$ is obtained similarly.

Finally, since $\mbox{{\it{u}}\rm{LB}}_{n}=\mbox{{\it{s}}\rm{LB}}$ for all $n$ , we trivially have $\mbox{{\it{u}}\rm{LB}}_{\infty}=\mbox{{\it{s}}\rm{LB}}$ . $\Box$

**Proof of Lemma 2. ** In (A.1) and (A.2), Mathematica gives

[TABLE]

In particular $k<0$ , $q>0$ . Thus

[TABLE]

where the leading term is positive, so $\mbox{{\it{o}}\rm{LB}}_{n}$ is eventually increasing. Similarly

[TABLE]

with negative leading term, so $\mbox{{\it{o}}\rm{LB}}_{n}$ is eventually concave in $n$ . A similar argument shows that, for $\rho>0$ , $\mbox{{\it{o}}\rm{UB}}_{n}$ is eventually decreasing and convex in $n$ . We note that the convergence of $\mbox{{\it{o}}\rm{UB}}_{n}$ to its limit is at a faster rate than for $\mbox{{\it{o}}\rm{LB}}_{n}$ .

The behaviour of $\mbox{{\it{u}}\rm{UB}}_{n}$ is obtained similarly (the limit being approached at rate $1/n$ ). $\Box$

**Proof of Lemma 3. ** Consider the $n$ -part homogeneous decomposition $P=P_{1}\mid\ldots\mid P_{n}$ . Now replace each $P_{i}$ by its homogeneous 2-part decomposition $P_{i}=Q_{i1}\mid Q_{i2}$ , so creating the $2n$ -part homogeneous decomposition $P=Q_{11}\mid Q_{12}\mid\ldots\mid Q_{n1}\mid Q_{n2}$ .

By Corollary 3 and (25) we see that uUB decreases on making these replacements.

The argument of § 3.4 shows that oLB is increased by these replacements.

To show the result for oUB it is enough to show that $\mbox{{\it{o}}\rm{UB}}<\mbox{{\it{s}}\rm{UB}}$ for a two-term homogeneous decomposition with $\rho>0$ . That is to say,

[TABLE]

or equivalently

[TABLE]

Noting that $\rho=\rho^{\prime}(1+\tau^{\prime})$ and $\tau=\tau^{\prime 2}$ , this becomes

[TABLE]

equivalent to $\rho^{\prime 2}<(1-\tau^{\prime})^{2}$ , which holds since, by (51) and (52), $\rho^{\prime}/(1-\tau^{\prime})=\rho/(1-\tau)\in(0,1)$ by assumption.

$\Box$

Bibliography16

The reference list from the paper itself. Each links out to its DOI / PubMed record.

1(1)
2Collier (2011) Collier, David. 2011. “Understanding Process Tracing.” PS: Political Science & Politics 44(4):823–830.
3Dawid, Musio and Fienberg (2016) Dawid, A. Philip, Monica Musio and Stephen E. Fienberg. 2016. “From Statistical Evidence to Evidence of Causality.” Bayesian Analysis 11:725–752.
4Dawid (2011) Dawid, Alexander Philip. 2011. The Rôle of Scientific and Statistical Evidence in Assessing Causality. In Perspectives on Causation , ed. Richard Goldberg. Oxford: Hart Publishing pp. 133––147.
5Dawid, Musio and Murtas (2017) Dawid, Alexander Philip, Monica Musio and Rossella Murtas. 2017. “The Probability of Causation.” Law, Probability and Risk 16:163–179.
6Dawid, Murtas and Musio (2016) Dawid, Alexander Philip, Rossella Murtas and Monica Musio. 2016. Bounding the Probability of Causation in Mediation Analysis. In Topics on Methodological and Applied Statistical Inference . Springer pp. 75–84.
7Gelman and Imbens (2013) Gelman, Andrew and Guido Imbens. 2013. Why Ask Why? Forward Causal Inference and Reverse Causal Questions. Working Paper 19614 National Bureau of Economic Research. https://www.nber.org/papers/w 19614
8Mahoney (2012) Mahoney, James. 2012. “The Logic of Process Tracing Tests in the Social Sciences.” Sociological Methods & Research 41(4):570–597.