High-dimensional multiple imputation (HDMI) for partially observed confounders including natural language processing-derived auxiliary covariates
Janick Weberpals, Pamela A. Shaw, Kueiyu Joshua Lin, Richard Wyss,, Joseph M Plasek, Li Zhou, Kerry Ngan, Thomas DeRamus, Sudha R. Raman, Bradley, G. Hammill, Hana Lee, Sengwee Toh, John G. Connolly, Kimberly J. Dandreo,, Fang Tian, Wei Liu, Jie Li, Jos\'e J. Hern\'andez-Mu\~noz

TL;DR
This study develops high-dimensional multiple imputation methods incorporating NLP-derived auxiliary covariates to improve bias reduction in studies with partially observed confounders, especially when missingness depends on unobserved factors.
Contribution
It introduces and compares HDMI approaches using structured and NLP-derived covariates, demonstrating their effectiveness in reducing bias in high-dimensional, partially observed confounder settings.
Findings
Claims data HDMI showed lowest bias (0.072)
Combining claims and sentence embeddings improved efficiency (RMSE 0.173)
NLP-derived covariates alone did not outperform baseline MI
Abstract
Multiple imputation (MI) models can be improved by including auxiliary covariates (AC), but their performance in high-dimensional data is not well understood. We aimed to develop and compare high-dimensional MI (HDMI) approaches using structured and natural language processing (NLP)-derived AC in studies with partially observed confounders. We conducted a plasmode simulation study using data from opioid vs. non-steroidal anti-inflammatory drug (NSAID) initiators (X) with observed serum creatinine labs (Z2) and time-to-acute kidney injury as outcome. We simulated 100 cohorts with a null treatment effect, including X, Z2, atrial fibrillation (U), and 13 other investigator-derived confounders (Z1) in the outcome generation. We then imposed missingness (MZ2) on 50% of Z2 measurements as a function of Z2 and U and created different HDMI candidate AC using structured and NLP-derived features.…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsNatural Language Processing Techniques
