Prediction-based Inference in Electronic Health Record (EHR)-linked Biobanks with Clinically Informative Outcomes
Xingran Chen, Cheng-Han Yang, Zhenke Wu, Bhramar Mukherjee

TL;DR
This paper evaluates prediction-based inference methods for GWAS in EHR-linked biobanks, focusing on handling missing biomarker data and their performance under different missingness mechanisms.
Contribution
It provides a comprehensive evaluation of nine methods, including four PB and five traditional approaches, under various outcome observation processes.
Findings
PB methods improve power when correctly specified
Misspecification affects the efficiency gains of PB methods
GWAS in AoU data shows PB methods replicate known associations efficiently
Abstract
Electronic health record (EHR)-linked biobank data hold tremendous promise for large-scale discoveries via genome-wide association study (GWAS) on diverse phenotypic traits and biomarkers routinely captured in the EHR. However, heterogeneous missingness in biomarkers compromises the validity and efficiency of statistical analyses. Prediction-based (PB) inference methods meet this challenge by using external machine learning (ML) predictions to impute missing biomarker outcomes, thereby improving statistical power and estimation accuracy in association analyses. Yet, their suitability remains unclear when outcomes are subject to clinically informative observation processes, that is, when laboratory tests are ordered based on both measured and unmeasured patient- and health system-level characteristics. In this paper, we review the statistical underpinnings of popular PB methods and then…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
