High-dimensional multiple imputation (HDMI) for partially observed   confounders including natural language processing-derived auxiliary   covariates

Janick Weberpals; Pamela A. Shaw; Kueiyu Joshua Lin; Richard Wyss,; Joseph M Plasek; Li Zhou; Kerry Ngan; Thomas DeRamus; Sudha R. Raman; Bradley; G. Hammill; Hana Lee; Sengwee Toh; John G. Connolly; Kimberly J. Dandreo,; Fang Tian; Wei Liu; Jie Li; Jos\'e J. Hern\'andez-Mu\~noz; Sebastian; Schneeweiss; Rishi J. Desai

arXiv:2405.10925·stat.ME·May 20, 2024

High-dimensional multiple imputation (HDMI) for partially observed confounders including natural language processing-derived auxiliary covariates

Janick Weberpals, Pamela A. Shaw, Kueiyu Joshua Lin, Richard Wyss,, Joseph M Plasek, Li Zhou, Kerry Ngan, Thomas DeRamus, Sudha R. Raman, Bradley, G. Hammill, Hana Lee, Sengwee Toh, John G. Connolly, Kimberly J. Dandreo,, Fang Tian, Wei Liu, Jie Li, Jos\'e J. Hern\'andez-Mu\~noz

PDF

Open Access

TL;DR

This study develops high-dimensional multiple imputation methods incorporating NLP-derived auxiliary covariates to improve bias reduction in studies with partially observed confounders, especially when missingness depends on unobserved factors.

Contribution

It introduces and compares HDMI approaches using structured and NLP-derived covariates, demonstrating their effectiveness in reducing bias in high-dimensional, partially observed confounder settings.

Findings

01

Claims data HDMI showed lowest bias (0.072)

02

Combining claims and sentence embeddings improved efficiency (RMSE 0.173)

03

NLP-derived covariates alone did not outperform baseline MI

Abstract

Multiple imputation (MI) models can be improved by including auxiliary covariates (AC), but their performance in high-dimensional data is not well understood. We aimed to develop and compare high-dimensional MI (HDMI) approaches using structured and natural language processing (NLP)-derived AC in studies with partially observed confounders. We conducted a plasmode simulation study using data from opioid vs. non-steroidal anti-inflammatory drug (NSAID) initiators (X) with observed serum creatinine labs (Z2) and time-to-acute kidney injury as outcome. We simulated 100 cohorts with a null treatment effect, including X, Z2, atrial fibrillation (U), and 13 other investigator-derived confounders (Z1) in the outcome generation. We then imposed missingness (MZ2) on 50% of Z2 measurements as a function of Z2 and U and created different HDMI candidate AC using structured and NLP-derived features.…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsNatural Language Processing Techniques