Propensity score estimation using classification and regression trees in the presence of missing covariate data
Bas B.L. Penning de Vries, Maarten van Smeden, Rolf H.H. Groenwold

TL;DR
This study evaluates the effectiveness of using classification and regression trees (CART) for propensity score estimation with missing data, finding that multiple imputation combined with CART outperforms direct CART application.
Contribution
The paper demonstrates that automatic missing data handling by CART can introduce bias, and shows that multiple imputation improves propensity score estimation accuracy.
Findings
Direct application of CART to incomplete data causes bias.
Multiple imputation with CART yields better estimates.
Automatic missing data handling in CART can be problematic.
Abstract
Data mining and machine learning techniques such as classification and regression trees (CART) represent a promising alternative to conventional logistic regression for propensity score estimation. Whereas incomplete data preclude the fitting of a logistic regression on all subjects, CART is appealing in part because some implementations allow for incomplete records to be incorporated in the tree fitting and provide propensity score estimates for all subjects. Based on theoretical considerations, we argue that the automatic handling of missing data by CART may however not be appropriate. Using a series of simulation experiments, we examined the performance of different approaches to handling missing covariate data; (i) applying the CART algorithm directly to the (partially) incomplete data, (ii) complete case analysis, and (iii) multiple imputation. Performance was assessed in terms of…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
MethodsLogistic Regression
