The Impact of Confounder Selection in Propensity Scores for Rare Events Data - with Applications to Birth Defects
Ronghui Xu, Jue Hou, Christina D. Chambers

TL;DR
This study examines how different confounder selection methods affect propensity score analyses in rare event settings, such as birth defects, revealing that certain approaches lead to more stable and reliable estimates.
Contribution
It compares various confounder selection strategies in propensity score methods for rare events, highlighting the impact on variance and estimate stability through simulation and empirical data.
Findings
IPW without confounder selection yields high variance in estimates.
Selection based on univariate association improves IPW performance.
Regression adjustment remains stable regardless of confounder selection method.
Abstract
Our work was motivated by a recent study on birth defects of infants born to pregnant women exposed to a certain medication for treating chronic diseases. Outcomes such as birth defects are rare events in the general population, which often translate to very small numbers of events in the unexposed group. As drug safety studies in pregnancy are typically observational in nature, we control for confounding in this rare events setting using propensity scores (PS). Using our empirical data, we noticed that the estimated odds ratio for birth defects due to exposure varied drastically depending on the specific approach used. The commonly used approaches with PS are matching, stratification, inverse probability weighting (IPW) and regression adjustment. The extremely rare events setting renders the matching or stratification infeasible. In addition, the PS itself may be formed via different…
|
| Number of events | |||||
|---|---|---|---|---|---|
| Scenario | Average (SD) | None in Unexposed | 5 in Unexposed | ||
| I | 600 | 30 | 31.1 (5.5) | 0.8% | 66% |
| II | 600 | 30 | 30.8 (5.4) | 0.7% | 63% |
| III | 439 | 37 | 34.1 (5.5) | 0.6% | 59% |
| IV | 439 | 37 | 34.0 (5.4) | 0.7% | 63% |
| CIE | Univariate -value | |||||||
|---|---|---|---|---|---|---|---|---|
| Scenario | True Pos. | False Pos. | Incl. | Exact | True Pos. | False Pos. | Incl. | Exact |
| I | 1.60 | 0.69 | 63% | 30% | 1.30 | 2.84 | 42% | 0.7% |
| II | 1.53 | 0.63 | 57% | 28% | 1.27 | 2.80 | 41% | 0.8% |
| III | 1.55 | 0.88 | 60% | 18% | 1.83 | 1.94 | 84% | 12.5% |
| IV | 1.54 | 0.90 | 59% | 17% | 1.82 | 1.91 | 82% | 13.0% |
| Scenario | Method | 1 | 2 | 3 | 4 | 5 |
|---|---|---|---|---|---|---|
| I | CIE | |||||
| 89.35 % | 69.19 % | 28.27 % | 22.31 % | 7.35 % | ||
| PVAL | ||||||
| 81.96 % | 70.88 % | 59.35 % | 58.15 % | 7.61 % | ||
| II | CIE | |||||
| 87.85 % | 64.18 % | 24.81 % | 18.59 % | 7.35 % | ||
| PVAL | ||||||
| 81.40 % | 69.53 % | 59.26 % | 58.49 % | 7.14 % | ||
| III | CIE | Asthma | Mat_height | Referral | State | Education |
| 80.11 % | 73.15 % | 62.34 % | 3.56 % | 2.82 % | ||
| PVAL | Asthma | Mat_height | SES | IVF | Mat_weight | |
| 93.38 % | 87.49 % | 11.87 % | 10.81 % | 9.61 % | ||
| IV | CIE | Asthma | Mat_height | Referral | State | Education |
| 80.87 % | 70.95 % | 61.78 % | 3.63 % | 3.06 % | ||
| PVAL | Asthma | Mat_height | SES | IVF | Referral | |
| 93.65 % | 86.05 % | 11.43 % | 10.42 % | 10.11 % |
|
||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAdvanced Causal Inference Techniques · Statistical Methods in Clinical Trials · Statistical Methods and Bayesian Inference
The Impact of Confounder Selection in Propensity Scores for Rare Events Data - with Applications to Birth Defects
Ronghui Xu1,2∗, Jue Hou2 and Christina D. Chambers3,1
1Department of Family Medicine and Public Health,
2Department of Mathematics, 3Department of Pediatrics, University of California, San Diego.
*Correspondence: 9500 Gilman Drive, La Jolla, CA 92093-0112, [email protected].
Abstract
Our work was motivated by a recent study on birth defects of infants born to pregnant women exposed to a certain medication for treating chronic diseases. Outcomes such as birth defects are rare events in the general population, which often translate to very small numbers of events in the unexposed group. As drug safety studies in pregnancy are typically observational in nature, we control for confounding in this rare events setting using propensity scores (PS). Using our empirical data, we noticed that the estimated odds ratio for birth defects due to exposure varied drastically depending on the specific approach used. The commonly used approaches with PS are matching, stratification, inverse probability weighting (IPW) and regression adjustment. The extremely rare events setting renders the matching or stratification infeasible. In addition, the PS itself may be formed via different approaches to select confounders from a relatively long list of potential confounders. We carried out simulation experiments to compare different combinations of approaches: IPW or regression adjustment, with 1) including all potential confounders without selection, 2) selection based on univariate association between the candidate variable and the outcome, 3) selection based on change in effects (CIE). The simulation showed that IPW without selection leads to extremely large variances in the estimated odds ratio, which help to explain the empirical data analysis results that we had observed. The simulation also showed that IPW with selection based on univariate association with the outcome is preferred over IPW with CIE. Regression adjustment has small variances of the estimated odds ratio regardless of the selection methods used.
Key words: change-in-estimate; inverse probability weighting; regression adjustment; variable selection.
1 Introduction
Our work was motivated by research carried out by the Research Center for the Organization of Teratology Information Specialists (OTIS), which is a North American network of university or hospital based teratology services that counsel between 70,000 and 100,000 pregnant women every year. Research subjects are referred to the Research Center from the Teratology Information Services and through other methods of recruitment, where women are consented and the mothers and their babies are followed prospectively over time. Phone interviews are conducted through the length of the pregnancy along with pregnancy diaries recorded by the mother. An outcome telephone interview is conducted shortly after the pregnancy ends. If the pregnancy results in a live birth, a dysmorphology exam is performed within the first year of life and with further follow-ups at one year and possibly later dates.
The birth prevalence of major birth defects in the general population is about 3%, according to the Centers for Disease Control and Prevention (CDC) Metropolitan Atlanta Congenital Defects Program (MACDP), a population-based birth defects surveillance program and population-based references for secondary endpoints Rynn et al. (2008). As pregnant women exposed to a specific medication or other substance in a given recruitment time period are often limited in number, sample sizes in these safety research studies are often limited to as few as 200 subjects in each exposure group, and are powered to detect an odds ratio (OR) of 3 or larger Chambers et al. (2001). When there is no increased risk of birth defects, this often results in fewer than 10 events in each group.
In a recent study conducted to evaluate the safety in pregnancy for a specific medication used to treat certain chronic maternal diseases we had 319 pregnant women who were exposed to the medication and whose pregnancies ended in live birth, and 144 pregnant women who had the underlying diseases but were not exposed to the medication and whose pregnancies also ended in live birth. Out of these we had 30 major birth defects in the exposed group, and 5 major birth defects in the unexposed group.
In the cases of observational studies with such rare events, propensity score (PS) methods have been well established in the literature to count for potential confounding Braitman and Rosenbaum (2002); Cepeda et al. (2003); Patorno et al. (2014). These PS based methods generally include matching, stratification, regression adjustment or inverse probability of treatment weighting (IPTW or IPW in general). Due to the extremely rare events in our case, even matching or stratification becomes impractical. IPW on the other hand, has become popular at least partly due to its ease of implementation, since most regression software allow weights as an option. In the following we consider regression adjustment and IPW using PS.
Our main concern is to what extent we should perform variable selection in computing the PS. In practice we don’t know if an observed variable is truly a confounder Greenland et al. (1999), and different methods have been used to assessing confounding. Two common approaches in practice are: 1) change-in-estimate (Mickey and Greenland, 1989, CIE), which indirectly assesses the association of the candidate variable with both the exposure and the outcome, since a confounder should be a common cause of both; 2) significance testing of the association between the candidate variable and the outcome only, which was recommended by Rubin (1997) in order to reduce the variance of the estimated exposure effect. A third approach is to include all potential confounders. While it has been shown that variables that are only weakly associated with the outcome should not be included in the PS for small studies Brookhart et al. (2006), this does not appear to be widely known and confusion persists in practice Rotnitzky et al. (2010).
Table 1 shows the results of analyses using either regression adjustment or IPW with stabilized weights Robins et al. (2000); Hernan et al. (2009), with propensity scores formed by CIE to confirm actual confounders or by simply including all potential confounders collected in the study without any selection or confirmation. The list of all potential confounders is provided in Supplement Table 5. When including all potential confounders due to missing values the sample size is slightly reduced, leading to slightly different crude (i.e. unadjusted) odds ratio (OR) between exposure to the medication and the outcome of major birth defects. It is clear that the IPW approach using all potential confounders gives an OR of 6.45 which is very different the other estimated OR’s. In the following we carry out simulation experiments to study the behavior of different approaches that are aimed at estimating causal effects of exposure using propensity scores.
2 Simulation Setup
Here we restrict our attention to a binary outcome, and the effect measure as commonly used in practice is the OR. As logistic regression is commonly used and will be used to generate data here, we briefly discuss the non-collapsibility of logistic regression Greenland et al. (1999). This can be briefly summarized as the discrepancy between the ‘population averaged’ effect and the ‘conditional’ effect under the logistic regression model given other covariates. Let denote the exposed group, and 0 the unexposed group. The logistic regression model for the binary outcome is
[TABLE]
where and are the additional covariates. The coefficient in the data generating model, i.e. the conditional exposure effect given , is often used as the ‘true’ effect in simulation studies for assessing bias and estimation error in general Brookhart et al. (2006); Pirracchio et al. (2012). While might be a reasonable target for the regression adjustment approach, we note that it is not the probability limit to which the IPW estimator converges. In the Appendix we show that IPW estimator converges to the logarithm of the marginal odds ratio between and . This quantity does not generally have closed-form formula based on the logistic regression model (1) for a given distribution of , but can be approximated using a very large Monte Carlo sample.
For each simulation scenario below, we will compare the following estimates of the log odds ratio of exposure on outcome: crude, ignoring any covariate information; regression adjustment using PS; IPW using PS; and fitting the multivariate logistic regression model with the true confounders but without the unobserved ’s (see below). For both regression adjustment and IPW, we consider four different ways of selecting confounders: 1) oracle, i.e. using the true confounders; 2) CIE, using at least 10% change as criterion in the estimated OR when adjusting for the potential confounder as compared to the crude OR; 3) significance testing, using -value less than 0.05 as criterion in assess the univariate association of the potential confounder with the outcome in a logistic regression model; 4) including all potential confounders without any selection. The above gives a total of ten different estimators for each simulation scenario.
We consider two types of setups for simulation below: a general one and one based on OTIS data. For each setup, we consider two scenarios: with or without unobserved variables that contribute to the outcome. For the potential confounders, we consider continuous (uniform), binary, as well as categorical distributions. Notice that categorical variables are associated with multiple coefficients that should be grouped together in any variable selection process Yuan and Lin (2006); Meier et al. (2008). All variables are generated independently. Table 2 summarizes the setup of the four scenarios that are detailed in the following. Each scenario was repeated with 10,000 simulation runs. As scenarios I and II are designed to mimic the rare events structure in the real data, it is not surprising that they are similar to scenario III and IV in Table 2. In general, the average number of total outcome events was between 30 - 35, and well over half of the simulation runs had five or fewer events in the unexposed group. Less than 1% of the runs had no events in that group.
2.1 A general setup
In scenario I, depicted in Figure 1, U(0,1) and Bernoulli(0.5) are the true confounders. We also generate additional variables , U(0,1), and and categorical with 3 levels and equal probabilities of . The true propensity model is given by
[TABLE]
and the true outcome model is
[TABLE]
where the last two terms in the models above show that level 1 for both and are used as reference. The coefficients in the models are chosen so that the desired number of events and proportion of exposed subjects are achieved, as summarized in Table 2. In addition, we include in the list of potential confounders 7 U(0,1) variables, 10 Bernoulli(0.5) variables, 2 categorical variables with 3 levels, 3 categorical variables with 4 levels, and 2 categorical variables with 5 levels, giving a total of 30 potential confounders for selection purposes. All the categorical variables are with equal probabilities of each level.
From Table 2 we see that for scenario I there are 31.1 events for the subjects. By design the event rate is lower in the unexposed group. About 0.8% of the 10,000 simulation runs have generated 0 events in the unexposed group; these are the runs that give estimated odds ratio of infinity. Also about 66% of the 10,000 runs have generated no more than 5 events in the unexposed group. The ratio of exposed versus unexposed numbers of subjects is about 2:1 as in the real data.
Next we add unobserved variables for scenario II, as depicted in Figure 2. The setup is otherwise the same as in scenario I, with the same regression coefficients as in (2) and (3), but with the addition of , and N(0, 0.25), each with a coefficient of one. From Table 2 we see that the distribution of number of events is similar to scenario I.
2.2 Setup based on OTIS data
In this setup we consider the 37 potential confounders from the OTIS study that are given in the Supplement Table 5 for subjects. Based on the final analysis of the original dataset, asthma and maternal height were selected using CIE as the confirmed confounders. In addition, referral source was found to be relatively strongly correlated with exposure (Cox and Snell, 1989, generalized in Supplement Table 5). When the models based on Figure 3 are fitted to the original data, we have
[TABLE]
where the six levels of referral sources are: 0 - health-care professional, 1 - internet, 2 - other, 3 - patient support group, 4 - pharmaceutical company / sponsor, and 5 - TIS; and
[TABLE]
Models (2.2) and (5) are then used in scenario III as the true propensity and the true outcome model to generate simulated data. The number of events under this scenario is again summarized in Table 2.
Finally for scenario IV we consider an unobserved variable that contributes to the outcome. This is motivated by the fact that the generalized for the above outcome model (5), indicating that about 84% of the variation in the outcome remains unexplained. Fitting a logistic regression model with a normally distributed random intercept to account for the unobserved heterogeneity, we have
[TABLE]
where N(0, 3.9). We note that the estimated variance of in this case is quite large, and the estimated exposure effect has increased substantially from 1.03 (SE = 0.514) to 1.48 (SE = 2.569). Although the estimated exposure effect is no longer significantly different from zero, we still use model (6) with the fitted point estimates as the true outcome model to generate data as depicted in Figure 4. Despite the very different coefficients in the outcome models (5) and (6), Table 2 once again shows similar numbers of outcome events as for the previous scenarios, perhaps reflecting the fact that the same original data was used to create both outcome models.
3 Simulation Results
Table 3 summarizes the results of confounder selection, using both CIE and significance testing. The CIE selected roughly potential confounders on average (i.e. true positive false positive). The univariate -value selected a few more, between and . In terms of accuracy, both methods had a reasonable chance, about or more, to include all true confounders in their selection (column ‘Incl.’). However, they tended to choose non-confounding variables as well. Their chances of selecting exactly the set of true confounders were below (column ‘Exact’). As expected the univariate p-value rule had lower rate of exact capture since it had larger average number of false positives.
Table 4 lists the top five selected confounders for each scenario and each selection method, i.e. CIE or PVAL (for univariate -value). Compared to the diagrams of each scenario, it is clear that in addition to the true confounders, CIE had a tendency to selection the ‘instrumental variables’ (variables that affect the outcome only through their effects on exposure): , in scenarios I and II, and referral source in scenarios III and IV. In contrast, significance testing tended to selection those variables that contributed to the outcome, even though they were not associated with exposure ( and ).
Figure 5 shows the distribution of the ten different estimators described earlier for each scenario. Common to all four scenarios is that the IPW approach using all potential confounders without any selection had the largest spread among the ten, followed by IPW using CIE to selection confounders. This is also confirmed in Figure 6 for the tail probabilities, i.e. one minus the empirical cumulative distribution function. Note that the tail probabilities flattened out at the frequency of simulation runs with no events in the unexposed group (Table 2), in which case all ten estimates were infinite.
As discussed earlier, even with the same estimated PS, the regression adjustment and the IPW approach estimate different quantities, one conditional and one marginal (vertical lines in Figure 5). It is known that when the conditional logistic model is true - which is the case by design of the simulations - the marginal effect ignoring covariates is typically biased towards zero Robinson and Jewell (1991). This discrepancy is seen to be particularly outstanding in scenario IV, and it is interesting to observe that when ignoring the unobserved in the PS, even the regression adjustment estimates seem to be centered closer to the marginal effect.
4 Discussion
A confounder is a covariate that affects the quantity of interest such as a population mean of the outcome, that differs between the exposed (i.e treated) and unexposed populations, and that this difference between the populations has led to confounding of an association measure for the effect of interest Greenland et al. (1999). As we stated earlier, whether a potential confounder is a true confounder is unknown in practice. Causal knowledge should be a prerequisite for confounder assessment Hernan et al. (2002), and is used to create our list of potential confounders. Given this list, additional criteria are needed to assess the more mathematical aspects of confounding. In our medication and vaccine safety studies, CIE together with the assessment of correlation between a potential confounder and both the exposure and the outcome variables are routinely used to identity confounders. Indeed CIE alone does not imply confounding, especially without causal knowledge. Another concern about CIE is the non-collapsibility discussed earlier. However, in our experience the change in estimates due to non-collapsibility tend to be under 10% which is our criterion cutoff; this was also confirmed in our simulations when there were no unobserved confounders, so that the marginal and the conditional effects are sufficiently close Pirracchio et al. (2012).
Our simulation results clearly show that the IPW approach using all potential confounders without any selection has the greatest variability. This should help to explain the extremely large estimated OR of 6.45 observed in Table 1. In addition, and in particular in scenarios III and IV, we see that if the IPW is used, then the univariate assessment of the correlation between a potential confounder and the outcome is preferred over the CIE, at least in the rare events situations considered in this paper. The simulation results also shows that CIE has some tendency to select what is referred to as instrumental variables in the literature, i.e. variables that affect the outcome only through their effects on the exposure, which are known not to be included in the propensity scores. Finally, regression adjustment appears to have small variances of the estimated odds ratio regardless of the selection methods used.
The weighted approach was initially proposed in Horvitz and Thompson (1952) and has continued to be studied in the survey research literature Gelman (2007). As Kang and Schafer (2007) pointed out, surveys are usually designed to ensure that IPW estimates are acceptably precise, but in the more general missing data problems it has been known since at least the 1980s that IPW methods can assign relatively large weights to certain observations leading to large variances of the effect estimates. In our case it was those 5 unexposed subjects who had a major birth defect outcome that received lower than usual (stabilized) weights. As we have illustrated here, even using the stabilized weights does not solve the large variance problem. For the IPW approach using PS, Rotnitzky et al. (2010) showed asymptotically that adjusting for a covariate is efficient if the covariate is independent of the exposure, while not adjusting is efficient if the covariate is independent of the outcome given the exposure level. Similar conclusions have also been reached via empirical investigations Brookhart et al. (2006).
A main concern for the regression adjustment approach is that the regression model for the outcome might be wrong; however, Vansteelandt and Daniel (2014) showed that the standard test of the null hypothesis of no exposure effect (using robust variance estimators), as well as particular standardized effects obtained from such adjusted regression models, are robust against misspecification of the outcome model as long as the PS model is correctly specified. We note that the correct PS model is required for all PS-based methods to be valid. For rare events like in our settings, Xu et al. (2014) recommended the regression adjustment approach.
For outcomes with typically rare events in the population, such as major birth defects, the numbers of events in the unexposed groups are likely very small, as seen in the drug safety study that motivated this paper. In the future we might consider approaches to increase the sample sizes of the unexposed groups. This, however, might be limited by the feasibility to recruit in a given time period pregnant women with a certain disease and without exposure to the medication under study Chambers et al. (2010). The inclusion of historical controls, on the other hand, might bring in additional confounding that needs to be controlled for.
Acknowledgement
We appreciate our discussion with the US Food and Drug Administration (FDA) statisticians regarding the drug safety study that motivated this work, and their encouragement for us to publish the research results.
APPENDIX
Write as the weight for the -th subject. Straightforward algebra shows that the weighted score equations which the IPW estimator solves can be written as
[TABLE]
Therefore
[TABLE]
Using stabilized weights
[TABLE]
where is specified under the propensity score model, and is the empirically estimated proportion of . Assume correct specification of the PS model, it can be seen that the IPW estimator converges to the following population averaged quantity:
[TABLE]
which is the logarithm of the marginal OR between and .
The reference list from the paper itself. Each links out to its DOI / PubMed record.
- 1Braitman and Rosenbaum (2002) Braitman, L. E. and Rosenbaum, P. R. (2002). Rare outcomes, common treatments: Analytic strategies using propensity scores. Annals of Internal Medicine , 137 , 693–695.
- 2Brookhart et al. (2006) Brookhart, M. A., Schneeweiss, S., Rothman, K. J., Glynn, R. J., Avorn, J., and Stürmer, T. (2006). Variable selection for propensity score models. American Journal of Epidemiology , 163 (12), 280–287.
- 3Cepeda et al. (2003) Cepeda, M. S., Boston, R., Farrar, J. T., and Strom, B. L. (2003). Comparison of logistic regression versus propensity score when the number of events is low and there are multiple confounders. American Journal of Epidemiology , 158 (3), 1149–1156.
- 4Chambers et al. (2001) Chambers, C. D., Braddock, S. R., Briggs, G. G., Einarson, A., Johnson, Y. R., Miller, R. K., Polifka, J. E., Robinson, L. K., Stepanuk, and Jones, K. L. (2001). Postmarketing surveillance for human teratogenicity: a model approach. Teratology , 64 , 252–261.
- 5Chambers et al. (2010) Chambers, C. D., Johnson, D. L., Robinson, L. K., Braddock, S., Xu, R., Jimenez, J., Mirrasoul, N., Salas, E., Luo, Y., Jin, S., and K Jones, K. L. (2010). Birth outcomes in pregnant women taking leflunomide. Arthritis and Rheumatism , 62 , 1494–1503.
- 6Cox and Snell (1989) Cox, D. R. and Snell, E. J. (1989). The Analysis of Binary Data (2nd ed.) . Chapman and Hall.
- 7Gelman (2007) Gelman, A. (2007). Struggles with survey weighting and regression modeling (with discussion). Statistical Science , 22 , 153–164.
- 8Greenland et al. (1999) Greenland, S., Robins, J. M., and Pearl, J. (1999). Confounding and collapsibility in causal inference. Statistical Science , 14 (1), 29–46.
