Which practical interventions does the do-operator refer to in causal inference? Illustration on the example of obesity and cancer
Lola Etievant, Vivian Viallon

TL;DR
This paper examines the interpretation of the do-operator in causal inference, especially for exposures like obesity, by analyzing how interventions on causes of the exposure relate to the hypothetical intervention effect.
Contribution
It clarifies the conditions under which the effect of do(X=x) aligns with interventions on causes of X within structural causal models.
Findings
Effect of do(X=x) equals intervention on causes of X affecting outcome only through X
Interventions on causes affecting outcome through other pathways only partly captured by do(X=x)
In simple models, do(X=x) represents an indirect effect of interventions on causes W
Abstract
For exposures like obesity, no precise and unambiguous definition exists for the hypothetical intervention . This has raised concerns about the relevance of causal effects estimated from observational studies for such exposures. Under the framework of structural causal models, we study how the effect of relates to the effect of interventions on causes of . We show that for interventions focusing on causes of that affect the outcome through only, the effect of equals the effect of the considered intervention. On the other hand, for interventions on causes of that affect the outcome not only through , we show that the effect of only partly captures the effect of the intervention. In particular, under simple causal models (e.g., linear models with no interaction), the effect of can be seen as an…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAdvanced Causal Inference Techniques · Health Systems, Economic Evaluations, Quality of Life · Statistical Methods in Clinical Trials
Which practical interventions does the -operator refer to in causal inference? Illustration on the example of obesity and cancer.
Lola Etievant*(1)* and Vivian Viallon*(2)*
( (1) Univ Lyon, Université Claude Bernard Lyon 1, CNRS UMR 5208, Institut Camille Jordan, 43 boulevard du 11 novembre 1918, F-69622 Villeurbanne, France.
(2) International Agency for Research on Cancer, Nutritional Methodology and Biostatistics Group, Lyon, France.
)
Abstract
For exposures like obesity, no precise and unambiguous definition exists for the hypothetical intervention . This has raised concerns about the relevance of causal effects estimated from observational studies for such exposures. Under the framework of structural causal models, we study how the effect of relates to the effect of interventions on causes of . We show that for interventions focusing on causes of that affect the outcome through only, the effect of equals the effect of the considered intervention. On the other hand, for interventions on causes of that affect the outcome not only through , we show that the effect of only partly captures the effect of the intervention. In particular, under simple causal models (e.g., linear models with no interaction), the effect of can be seen as an indirect effect of the intervention on .
1 Introduction
Because most epidemiological results are derived from observational data, their causal interpretation has always been at the center of concern 1. Causal inference theory, which has attracted a lot of interest in the last few decades, has proved useful to formally describe conditions ensuring the causal validity of results derived from observational data 2, 3, 4, 5, 6, 7. For example, a number of sets of sufficient conditions has been established for the identifiability of causal effects in the presence of confounding or non-random selection. Under the so-called Structural Causal Models 3, 6 (SCMs), and further assuming that the structure of the underlying Directed Acyclic Graph (DAG) is known, a key condition for the identifiability of the causal effect is exchangeability, or ignorability 3, 6, 7. In particular, exchangeability has been shown to hold conditionally on any set of variables satisfying the back-door criterion 3, 6. Then, a variety of statistical approaches have been proposed for the estimation of causal effects under increasingly complex settings including time-varying confounding, failure time data, etc. Among other approaches, we shall mention the parametric g-formula, inverse probability weighting approaches, g-estimation and doubly robust procedures 3, 7, 8.
Even if their use has been controversial 9, counterfactual variables, or potential outcomes, are key to most causal inference theories commonly considered nowadays, in epidemiology, social science, statistics and computer science. The -calculus that accompanies SCMs allows precise definitions of these variables and their joint distribution 3, 6. Here, we will use the notation to denote the counterfactual variable representing the outcome that would have been observed in the counterfactual world that would have followed the hypothetical intervention , where is the exposure of interest and is any potential value for this exposure 3. For simplicity, we will focus on binary outcomes, and we let denote the probability of observing the outcome in this counterfactual world.
For some exposures, the lack of a precise and unambiguous definition for the intervention has raised some concerns in the literature 10, 11, 12, 13, 14, 15, 16, 17, 18, 19. For example, consider the case where stands for a binary variable indicating obesity status at 20 years of age. In a population of lean teenagers, or even newborns, the hypothetical intervention , for (or ), could then correspond to a typically adaptive and dynamic intervention that would ensure that individuals stay lean (or get obese) by the age of 20. However, these interventions are not well-defined, in the sense that different “versions” may lead to the same obesity value at 20 years-old. For instance, in the “stay lean” arm (), individuals may be asked to do 45 minutes of physical exercise a day, or 72 minutes of physical exercise a day. They could also be asked to adhere to a healthy diet, etc. In addition, some of the versions ensuring that at 20 years old may be impossible to apply in practice, such as those involving genetic factors.
More generally, this situation of a treatment with different versions, or compound treatment, violates the “no-multiple-versions-of-treatment assumption”, which is part of the “Stable Unit Treatment Value Assumption” (SUTVA) 20, 16. This has led to some debate around the relevance, for public health matters, of the causal effects estimated from observational studies in such cases. Interestingly, most arguments have been based by considering the situation where “treatment precedes versions of that treatment”, while situations where “versions precede treatment” were only quickly mentioned, if at all 11, 12, 16. Here, we consider the situations where versions precede treatment, in which case these versions can be seen as particular levels for the causes of . Then, focusing on situations where direct interventions on are impractical, we inspect how the effect of the hypothetical intervention relates to the effects of interventions on causes of . We show that the effect of the hypothetical intervention equals the effect of particular interventions on causes of that are causes of through only, as expected. However, for causes that influence not only through , the causal effect of differs from the causal effect of interventions on . For example, in the particular case of obesity and cancer occurence, the effect of is different from the effects of interventions on diet or physical activity, except for cancers whose risk is not directly associated with diet and/or physical activity.
To make our illustrative example even more concrete, we assume throughout that we intend to estimate the causal effect of obesity at 20 years of age on the occurence of cancer by the age of 50. A typical prospective cohort study would sample individuals who are cancer-free at the age of 20, record information regarding their obesity status and other variables (potential confounders, etc.) at inclusion, follow these individuals over the age interval 20-50 and finally record cancer occurence by the age of 50. Denote by and the binary variables representing obesity at 20 and cancer occurence between 20 and 50. For simplicity, we further assume the absence of competing events and censoring.
The rest of the article is organized as follows. Even if this is highly unlikely in our illustrative example, we start by considering the unconfounded setting where all causes of are causes of through only. Then, in Section 3, we consider a more realistic setting where confounders are present. We shall stress that this second setting is still an over-simplified version of the causal model in our illustrative example (see the Discussion). Yet, we believe it is instructive to describe the relationship between the intervention and its multiple versions. Under both settings, we consider the situation where some causes are modifiable, while others are not. Section 4 presents some concluding remarks and discussion. Proofs of our main results are presented in the Appendix.
2 The unconfounded case
Because exposure is not randomized in our prospective cohort study, identifiability of the causal effect of on is generally not guaranteed. A particular situation when this causal effect is identifiable is when all causes of , denoted by in this simple case, are causes of through only. Even if this absence of confounders is highly unlikely in our illustrative example, it is instructive to consider this simple situation as a starting point. The more general situation where confounding is present is deferred to Section 3.
2.1 Preliminary derivations
Consider that the data available in our cohort study are generated by a causal model with associated DAG and structural equations as presented in Figure 1a. Variables and represent all causes of and , respectively, and are assumed to be independent to each other. Both and may include purely random components. Given the structural equations attached to this simple causal model, we have , so that consistency holds. Moreover, under this causal model, the ignorability condition Y^{(x)}\mathbin{\mathchoice{\hbox to0.0pt{\hbox{\set@color\displaystyle\perp}\hss}\kern 3.46875pt{}\kern 3.46875pt\hbox{\set@color\displaystyle\perp}}{\hbox to0.0pt{\hbox{\set@color\textstyle\perp}\hss}\kern 3.46875pt{}\kern 3.46875pt\hbox{\set@color\textstyle\perp}}{\hbox to0.0pt{\hbox{\set@color\scriptstyle\perp}\hss}\kern 2.36812pt{}\kern 2.36812pt\hbox{\set@color\scriptstyle\perp}}{\hbox to0.0pt{\hbox{\set@color\scriptscriptstyle\perp}\hss}\kern 1.63437pt{}\kern 1.63437pt\hbox{\set@color\scriptscriptstyle\perp}}}X holds. Then, whenever the positivity condition further holds (), we have
[TABLE]
and the causal effect of on is identifiable. But, when direct interventions on are impractical, and only interventions on the causes of are practical, a natural question is the meaning of the hypothetical intervention . Consider the structural equation pertaining to exposure, , and set . Of course, we have . As a result, for any , ; see Appendix A. In this simple case, all interventions on the causes of which would yield share the same effect on : versions are irrelevant 11, 16, and the causal effect estimated on the cohort is an estimate of this shared effect.
2.2 Distinguishing modifiable and non-modifiable causes
To gain insight from a practical standpoint, the previous analysis can be slightly refined by decomposing causes of as where and correspond to sets of modifiable and non-modifiable causes of , respectively. See Figure 1b. Because non-modifiable causes may affect modifiable ones, while the former are unlikely to be affected by the latter, we do not consider the possibility of an arrow pointing from to in Figure 1b. Causes are non-modifiable and the only interventions that could be practically set up are those on . Denote the set of possible values of by . Then, for any and , set . First assume that this set is non-empty for any and : in other words, first assume that, for any , and for any value for the non-modifiable factors , there exists some value of the modifiable factors such that . Now, for individuals such that , for any , we have . Therefore for any . Denote by any intervention which sets, for all individuals in the population, the value of according to the value of , in such a way that for any individual with , the intervention sets to . Then, we have . In other words, versions are again irrelevant and any such intervention has the same effect on , which is .
Of course, unless there exists at least one value , only a dynamic, i.e. individual-specific, treatment can be adopted to attain this effect. For instance, consider the “stay lean” arm of the clinical trial mentioned in the Introduction. Because individuals may be more or less genetically predisposed to obesity, some individuals will have to make little effort to stay lean by the age of 20, while others will have to adopt a drastic diet and/or have intense physical activity, etc. We may stress that this heterogeneity among individuals is at the core of personalized (preventive) medicine and need to be acknowledged, rather than discarded, in causal inference. Similarly, our cohort reflects this heterogeneity: individuals sharing the same obesity status , for , can differ regarding and . More precisely, for , set . The lean and obese groups in our cohort are sampled from
[TABLE]
for and , respectively. Again, if the model of Figure 1b is correct, versions of the compound treatment obesity are not relevant 11, 16. Therefore, how the levels of the causes of “obesity at 20 years of age” are mixed up in the group of obese, or lean, individuals in our cohort is not relevant either: our cohort would return unbiased estimates for the quantity , just as the clinical trial would. Then, the effect of the intervention can again be interpreted as the effect of any intervention on the causes of ensuring .
If, for some , there exist some values of the non-modifiable variables such that the set is empty, the intervention is purely theoretical for individuals such that since no practical intervention could yield for them. However, under the assumptions of SCMs, and if the DAG of Figure 1b is correct, the effect of the hypothetical intervention can still be estimated from our cohort study even if no practical intervention ensuring exists for individuals with . Indeed, we have .
3 The more standard case with confounders
3.1 Preliminary analyses
We now turn our attention to the more common situation where confounding is present. Without loss of generality, assume that causes of are grouped in two sets, and . Here, and as above, causes in are assumed to have an effect on through only, while is the set of common causes of and , that is the set of confounders in the - relationship. In our illustrative example, could include gender, physical activity and dietary habit, while might include genetic predisposition to obesity. Figure 2a depicts the corresponding causal model. Assume for ease of notation that the set of possible values for is discrete. Further recall that consistency still holds, and assume that for all such that . Then, because Y^{(x)}\mathbin{\mathchoice{\hbox to0.0pt{\hbox{\set@color\displaystyle\perp}\hss}\kern 3.46875pt{}\kern 3.46875pt\hbox{\set@color\displaystyle\perp}}{\hbox to0.0pt{\hbox{\set@color\textstyle\perp}\hss}\kern 3.46875pt{}\kern 3.46875pt\hbox{\set@color\textstyle\perp}}{\hbox to0.0pt{\hbox{\set@color\scriptstyle\perp}\hss}\kern 2.36812pt{}\kern 2.36812pt\hbox{\set@color\scriptstyle\perp}}{\hbox to0.0pt{\hbox{\set@color\scriptscriptstyle\perp}\hss}\kern 1.63437pt{}\kern 1.63437pt\hbox{\set@color\scriptscriptstyle\perp}}}X|W under the model depicted in Figure 2a , the causal effect of on is identifiable. More precisely, we have
[TABLE]
But, again, a natural question is how the hypothetical intervention does relate to interventions on causes of . Neglecting for now issues related to the possibility to apply these interventions in practice, these interventions can concern either only, only, or both and .
First consider interventions on only and set, for any and , . For any , we have for individuals belonging to stratum . Then, assume that is non-empty for all and denote by any intervention setting to any value for individuals in stratum , for all . Arguing as in Section 2.2, we get ; see Section B.1 in the Appendix. Again, versions are irrelevant, and any such intervention has the same effect on , which is .
Now consider interventions on only and set, for any and , . Then, assume that is non-empty for every , and for any , denote by one given element of . Given this particular collection of values , denote by the intervention which sets to for individuals in stratum , for all . Arguing as before, it comes that , which generally differs from . The intervention does entail for all individuals, but because has an effect on not only through , the effect of is not entirely captured by that of . Actually, can be seen as a mediator in the relationship, and, under simple models, in particular in the absence of interaction between and , the effect of is actually related to the indirect effect of the intervention , through ; see Section B.3 in the Appendix. It is also important to note that depends on the collection of values . If and are two distinct elements of for some , then , while . The difference between these two quantities is related to the direct effect of , and reflects the fact that two interventions on sharing the same effect on do not necessarily have the same effects on when has a direct effect on : in this case, versions of the compound treatment are relevant.
Now, if is empty for some , then no intervention on only can ensure for individuals in stratum . Similarly, if is empty for some pair , then no intervention on only can ensure for individuals in stratum . Then, consider interventions on both and , and set . For any , it is easy to show that . Therefore, interventions on both and that ensure are similar to interventions on only: their effects are generally not uniquely defined (they depend on the particular pair of values ) and only partly capture the effect of interventions on .
3.2 Distinguishing modifiable and non-modifiables causes
All the analyses above can be refined by acknowledging that some causes in and are modifiable, while others are not, and by considering interventions on modifiable causes only. See Figure 2b. Compared to Section 3.1, notations become a little more complex, but conclusions remain mostly similar. For instance, consider interventions on both and , where is a modifiable cause of with no direct effect on , while is a modifiable confounder in the relationship. For any and any potential values and for non-modifiable causes and , assume that the set is non-empty, and denote by one given element in this set. Then denote by the intervention setting to and to for any individuals in stratum , for all . Arguing as before, it can be shown that This quantity generally differs from and the reason again is that the intervention not only ensures that , but it also has a direct effect on through the intervention on .
4 Conclusion-Discussion
In this article, we showed how the hypothetical intervention , when impossible to apply in practice, relates to interventions on causes of . Basing our arguments on structural causal models, our conclusions are in line with those of Petersen 12: the DAG which represents our assumptions on the causal model under study is basically sufficient (and necessary) to precisely understand how can be interpreted. When interventions on causes of that are causes of through only exist, the effect of captures the effect of such interventions. However, for causes of , say , that cause not only through , the effect of only partly captures the effect of interventions on . Under simple causal models, the effect of is related to the indirect effect of interventions on .
Taking the example of obesity (at 20 years old) and the risk of cancer (by the age of 50), our results confirm concerns raised by several authors 16, 19, 11: because most modifiable causes of obesity can be regarded as confounders in the obesity-cancer relationship, the effect of obesity estimated from observational data likely differs from the effect of interventions on these causes, which could be estimated through clinical trials. At this point, however, we may insist on the fact that, if all modifiable causes of obesity are confounders in the obesity-cancer relationship, then clinical trials would not yield an estimate of the effect of obesity on cancer. Instead, a clinical trial would return an estimate of the causal effect of the considered intervention on cancer, and this effect would only partly capture the effect of obesity. Consider again the clinical trial sketched in the Introduction. More precisely, consider a randomized clinical trial where the study population, corresponding, e.g. to lean teenagers, is randomly assigned to two arms. Denote by and the other, possibly non-modifiable, causes of , with corresponding to common causes of and , and corresponding to causes of through only. In this setting, observe that Y^{X=x}\mathbin{\mathchoice{\hbox to0.0pt{\hbox{\set@color\displaystyle\perp}\hss}\kern 3.46875pt{\not}\kern 3.46875pt\hbox{\set@color\displaystyle\perp}}{\hbox to0.0pt{\hbox{\set@color\textstyle\perp}\hss}\kern 3.46875pt{\not}\kern 3.46875pt\hbox{\set@color\textstyle\perp}}{\hbox to0.0pt{\hbox{\set@color\scriptstyle\perp}\hss}\kern 2.36812pt{\not}\kern 2.36812pt\hbox{\set@color\scriptstyle\perp}}{\hbox to0.0pt{\hbox{\set@color\scriptscriptstyle\perp}\hss}\kern 1.63437pt{\not}\kern 1.63437pt\hbox{\set@color\scriptscriptstyle\perp}}}W while Y^{X=x}\mathbin{\mathchoice{\hbox to0.0pt{\hbox{\set@color\displaystyle\perp}\hss}\kern 3.46875pt{}\kern 3.46875pt\hbox{\set@color\displaystyle\perp}}{\hbox to0.0pt{\hbox{\set@color\textstyle\perp}\hss}\kern 3.46875pt{}\kern 3.46875pt\hbox{\set@color\textstyle\perp}}{\hbox to0.0pt{\hbox{\set@color\scriptstyle\perp}\hss}\kern 2.36812pt{}\kern 2.36812pt\hbox{\set@color\scriptstyle\perp}}{\hbox to0.0pt{\hbox{\set@color\scriptscriptstyle\perp}\hss}\kern 1.63437pt{}\kern 1.63437pt\hbox{\set@color\scriptscriptstyle\perp}}}(W,Z) in general. Denote by and the sets of possible values for and , respectively. Then, an “ideal” clinical trial would consist in randomly assigning individuals to one of the following two groups: those for whom would be set to and those for whom would be set to , for two given collections of values and , where and ensure that and , respectively, for individuals with and . Assuming complete compliance, and arguing as in Section 3, it is easy to show that the comparison of these two groups would return an estimate of the effect of this particular intervention on , not that of . Comparisons should be made between groups of individuals sharing the same value for and to obtain a valid estimate of the effect of obesity, within strata defined by and . In other words, under this ideal clinical trial setting, non-modifiable confounders in the relationship would still have to be measured and controlled for to unbiasedly estimate the causal effect of obesity, within strata defined by and . When controlled for a sufficient set of confounders, analyses based on observational studies can be used to derive unbiased estimates of these same effects.
There are a number of subtleties that we neglected for the sake of simplicity. First, a clinical trial whose objective is to prevent obesity by the age of 20 would typically not only be dynamic, but also adaptive, i.e. the intervention is not only subject-specific, but it is also time-dependent. A good example is the Feeding Dynamic Intervention, to prevent childhood obesity (https://clinicaltrials.gov/ct2/show/NCT01515254). Similarly, although we focused on time-fixed exposure and confounders, but they are all time-varying in the population. For instance, physical activity and food intakes vary over the age interval , and the corresponding variables are all potential confounders in the relationship between obesity at 20 years-old and cancer occurence before 50 years-old. Another important time-varying cause of obesity at 20 years-old is obesity over the age interval . Consequently, individuals in the two groups of our cohort, obese and lean at 20 years-old, do not only differ because of their status regarding obesity at 20 years of age, they also typically differ with respect to their histories regarding obesity, physical activity and dietary habits. This can lead to biases if these histories are not appropriately accounted for in the analysis 21. Second, selection bias may also be at play in our cohort study since only individuals who are cancer-free at 20 can be included. This selection bias will be more severe if cancer risk before 20 years old is associated to levels of obesity, physical activity and dietary habits over the age interval [0, 19]. This selection bias due to prevalent exposure and depletion of susceptibles has been put forward as one of the reasons explaining the discrepancies between results obtained through observational and interventional data when studying the association between hormone replacement therapy and coronary heart disease for instance 22.
Appendix A Proof in the unconfounded case
Under the model depicted in Figure 1a, we have
[TABLE]
Appendix B Proof in the confounded case
B.1 Interventions of type
Assume that is non-empty for any . Then, under the model depicted in Figure 2a, we have, for any
[TABLE]
where the last equality follows from rule 2 of the do-calculus3.
Moreover,
[TABLE]
B.2 Interventions of type
Assume that is non-empty for any . Then, under the model depicted in Figure 2a, we have, for any
[TABLE]
B.3 Relationship with indirect effects
Denote by two given collection of values such that and . Further let and denote two given interventions setting to and , respectively, for individuals in stratum , for all . We have
[TABLE]
The term can be regarded as an indirect effect since the level of is held fixed and only the value of changes from to which, for individuals in stratum , equal and respectively. More precisely, we have
[TABLE]
Under the model depicted in Figure 2a, recall we have
[TABLE]
Under simple causal models, for instance when , the two quantities, and , coincide and equal . However, under more complex models, these two quantities are typically different. Even under linear models, if interaction terms of the form are present in function , these two terms are typically different and would actually depend on the collection of values .
The reference list from the paper itself. Each links out to its DOI / PubMed record.
- 11 K. J. Rothman, S. Greenland, and T. L. Lash, Modern Epidemiology . Lippincott Williams & Wilkins, 2008.
- 22 D. B. Rubin, “Estimating causal effects of treatments in randomized and nonrandomized studies,” Journal of Educational Psychology , vol. 66, no. 5, pp. 688–701, 1974.
- 33 J. Pearl, Causality: models, reasoning, and inference . Cambridge, U.K. ; New York: Cambridge University Press, 2000.
- 44 K. J. Rothman and S. Greenland, “Causation and causal inference in epidemiology,” American Journal of Public Health , vol. 95, no. S 1, pp. S 144–S 150, 2005.
- 55 M. Glymour and S. Greenland, “Causal diagrams,” in Modern epidemiology , pp. 183–209, 3rd ed. lippincott williams & wilkins ed., 2008.
- 66 J. Pearl, “Causal inference in statistics: An overview,” Statistics Surveys , vol. 3, no. 0, pp. 96–146, 2009.
- 77 M. A. Hernan and J. M. Robins, Causal Inference . Boca Raton: Chapman & Hall/CRC, forthcoming.
- 88 J. K. Lunceford and M. Davidian, “Stratification and weighting via the propensity score in estimation of causal treatment effects: a comparative study,” Statistics in medicine , vol. 23, no. 19, pp. 2937–2960, 2004.
