Prediction-based Inference in Electronic Health Record (EHR)-linked Biobanks with Clinically Informative Outcomes

Xingran Chen; Cheng-Han Yang; Zhenke Wu; Bhramar Mukherjee

arXiv:2603.14356·stat.AP·April 14, 2026

Prediction-based Inference in Electronic Health Record (EHR)-linked Biobanks with Clinically Informative Outcomes

Xingran Chen, Cheng-Han Yang, Zhenke Wu, Bhramar Mukherjee

PDF

TL;DR

This paper evaluates prediction-based inference methods for GWAS in EHR-linked biobanks, focusing on handling missing biomarker data and their performance under different missingness mechanisms.

Contribution

It provides a comprehensive evaluation of nine methods, including four PB and five traditional approaches, under various outcome observation processes.

Findings

01

PB methods improve power when correctly specified

02

Misspecification affects the efficiency gains of PB methods

03

GWAS in AoU data shows PB methods replicate known associations efficiently

Abstract

Electronic health record (EHR)-linked biobank data hold tremendous promise for large-scale discoveries via genome-wide association study (GWAS) on diverse phenotypic traits and biomarkers routinely captured in the EHR. However, heterogeneous missingness in biomarkers compromises the validity and efficiency of statistical analyses. Prediction-based (PB) inference methods meet this challenge by using external machine learning (ML) predictions to impute missing biomarker outcomes, thereby improving statistical power and estimation accuracy in association analyses. Yet, their suitability remains unclear when outcomes are subject to clinically informative observation processes, that is, when laboratory tests are ordered based on both measured and unmeasured patient- and health system-level characteristics. In this paper, we review the statistical underpinnings of popular PB methods and then…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.