Semi-supervised Method for Risk Prediction with Doubly Censored EHR Data
Jie Zhou, Enhao Wang, Xuan Wang

TL;DR
This paper introduces a semi-supervised learning framework for risk prediction using EHR data with double censoring, combining limited gold-standard labels and surrogate outcomes to improve estimation accuracy.
Contribution
It develops a novel SSL method that handles double censoring in EHR data, providing theoretical validation and demonstrating improved efficiency over existing methods.
Findings
Method improves estimation efficiency in simulations.
Application to T2D risk factors using EHR data.
Theoretical validity of the proposed estimator.
Abstract
The rapid expansion of large-scale electronic health record (EHR) data offers unique opportunities to improve the accuracy and efficiency of clinical risk estimation. Yet, because clinical events may occur outside the recording health system, clinical event outcomes are frequently subject to double censoring (both left and right). Besides, gold-standard event times can often only be ascertained through labor-intensive manual chart reviews, yielding labels for only a small subset of patients. Reliance on this limited labeled set alone is limited in efficiency, whereas widely available surrogate outcomes such as the time to first diagnostic code or first disease mention are error-prone and can yield biased estimates if used directly. Semi-supervised learning (SSL) methods provide a principled way to integrate labeled and unlabeled data, and prior work has demonstrated their advantages in…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
