Semi-supervised Clustering Through Representation Learning of Large-scale EHR Data
Linshanshan Wang, Mengyan Li, Zongqi Xia, Molei Liu, Tianxi Cai

TL;DR
This paper introduces SCORE, a semi-supervised learning framework for EHR data that improves patient phenotyping and disease prediction by leveraging large-scale, high-dimensional, and partially labeled health records.
Contribution
SCORE combines a novel Poisson-Adapted Latent factor Mixture model with a hybrid EM and GVA algorithm, providing theoretical guarantees and improved accuracy over existing methods.
Findings
SCORE outperforms existing methods in simulations.
Increases accuracy with limited labeled data.
Produces more informative patient embeddings for MS prediction.
Abstract
Electronic Health Records (EHR) offer rich real-world data for personalized medicine, providing insights into disease progression, treatment responses, and patient outcomes. However, their sparsity, heterogeneity, and high dimensionality make them difficult to model, while the lack of standardized ground truth further complicates predictive modeling. To address these challenges, we propose SCORE, a semi-supervised representation learning framework that captures multi-domain disease profiles through patient embeddings. SCORE employs a Poisson-Adapted Latent factor Mixture (PALM) Model with pre-trained code embeddings to characterize codified features and extract meaningful patient phenotypes and embeddings. To handle the computational challenges of large-scale data, it introduces a hybrid Expectation-Maximization (EM) and Gaussian Variational Approximation (GVA) algorithm, leveraging…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsTraditional Chinese Medicine Studies
