Risk Prediction with Imperfect Survival Outcome Information from Electronic Health Records
Stephanie F. Chan, Jue Hou, Xuan Wang, and Tianxi Cai

TL;DR
This paper introduces a semi-supervised risk prediction method that leverages limited labeled data and abundant imperfect proxy data from electronic health records to accurately predict disease onset times.
Contribution
It develops a novel semisupervised approach combining proxy and limited label data under a flexible measurement error model, with proven consistency and asymptotic properties.
Findings
Performs well in finite sample simulations
Effective in predicting obesity onset from EHR data
Provides a resampling-based interval estimation method
Abstract
Readily available proxies for time of disease onset such as time of the first diagnostic code can lead to substantial risk prediction error if performing analyses based on poor proxies. Due to the lack of detailed documentation and labor intensiveness of manual annotation, it is often only feasible to ascertain for a small subset the current status of the disease by a follow up time rather than the exact time. In this paper, we aim to develop risk prediction models for the onset time efficiently leveraging both a small number of labels on current status and a large number of unlabeled observations on imperfect proxies. Under a semiparametric transformation model for onset and a highly flexible measurement error models for proxy onset time, we propose the semisupervised risk prediction method by combining information from proxies and limited labels efficiently. From an initial estimator…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMachine Learning in Healthcare · Genetic Associations and Epidemiology · Artificial Intelligence in Healthcare
