Imputation of Unknown Missingness in Sparse Electronic Health Records

Jun Han; Josue Nassar; Sanjit Singh Batra; Aldo Cordova-Palomera; Vijay Nori; Robert E. Tillman

arXiv:2602.20442·cs.LG·February 25, 2026

Imputation of Unknown Missingness in Sparse Electronic Health Records

Jun Han, Josue Nassar, Sanjit Singh Batra, Aldo Cordova-Palomera, Vijay Nori, Robert E. Tillman

PDF

Open Access 1 Video

TL;DR

This paper introduces a transformer-based denoising neural network to recover unknown missing data in sparse EHRs, improving accuracy and downstream task performance over existing methods.

Contribution

The paper presents a novel transformer-based algorithm specifically designed to address unknown unknowns in EHR data, enhancing imputation accuracy for medical codes.

Findings

01

Improved accuracy in denoising medical codes in real EHR data.

02

Significant performance gains in hospital readmission prediction.

03

Outperforms existing imputation techniques in handling unknown missingness.

Abstract

Machine learning holds great promise for advancing the field of medicine, with electronic health records (EHRs) serving as a primary data source. However, EHRs are often sparse and contain missing data due to various challenges and limitations in data collection and sharing between healthcare providers. Existing techniques for imputing missing values predominantly focus on known unknowns, such as missing or unavailable values of lab test results; most do not explicitly address situations where it is difficult to distinguish what is missing. For instance, a missing diagnosis code in an EHR could signify either that the patient has not been diagnosed with the condition or that a diagnosis was made, but not shared by a provider. Such situations fall into the paradigm of unknown unknowns. To address this challenge, we develop a general purpose algorithm for denoising data to recover unknown…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

Imputation of Unknown Missingness in Sparse Electronic Health Records· underline

Taxonomy

TopicsMachine Learning in Healthcare · Statistical Methods and Inference · ECG Monitoring and Analysis