Semi-supervised Method for Risk Prediction with Doubly Censored EHR Data

Jie Zhou; Enhao Wang; Xuan Wang

arXiv:2605.08046·stat.ME·May 11, 2026

Semi-supervised Method for Risk Prediction with Doubly Censored EHR Data

Jie Zhou, Enhao Wang, Xuan Wang

PDF

TL;DR

This paper introduces a semi-supervised learning framework for risk prediction using EHR data with double censoring, combining limited gold-standard labels and surrogate outcomes to improve estimation accuracy.

Contribution

It develops a novel SSL method that handles double censoring in EHR data, providing theoretical validation and demonstrating improved efficiency over existing methods.

Findings

01

Method improves estimation efficiency in simulations.

02

Application to T2D risk factors using EHR data.

03

Theoretical validity of the proposed estimator.

Abstract

The rapid expansion of large-scale electronic health record (EHR) data offers unique opportunities to improve the accuracy and efficiency of clinical risk estimation. Yet, because clinical events may occur outside the recording health system, clinical event outcomes are frequently subject to double censoring (both left and right). Besides, gold-standard event times can often only be ascertained through labor-intensive manual chart reviews, yielding labels for only a small subset of patients. Reliance on this limited labeled set alone is limited in efficiency, whereas widely available surrogate outcomes such as the time to first diagnostic code or first disease mention are error-prone and can yield biased estimates if used directly. Semi-supervised learning (SSL) methods provide a principled way to integrate labeled and unlabeled data, and prior work has demonstrated their advantages in…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.