Unsupervised Machine Learning for the Discovery of Latent Disease Clusters and Patient Subgroups Using Electronic Health Records
Yanshan Wang, Yiqing Zhao, Terry M. Therneau, Elizabeth J. Atkinson,, Ahmad P. Tafti, Nan Zhang, Shreyasee Amin, Andrew H. Limper, Hongfang Liu

TL;DR
This paper explores unsupervised machine learning models, including a novel Poisson Dirichlet Model, to identify latent disease clusters and patient subgroups from electronic health records, aiding epidemiological research.
Contribution
It introduces the Poisson Dirichlet Model (PDM), extending LDA with a Poisson distribution to better account for age and sex factors in disease clustering.
Findings
PDM effectively identifies disease clusters by reducing age and sex bias.
LDA provides more differentiable patient subgroups based on survival analysis.
Both models are useful for discovering patient subgroups with different research focuses.
Abstract
Machine learning has become ubiquitous and a key technology on mining electronic health records (EHRs) for facilitating clinical research and practice. Unsupervised machine learning, as opposed to supervised learning, has shown promise in identifying novel patterns and relations from EHRs without using human created labels. In this paper, we investigate the application of unsupervised machine learning models in discovering latent disease clusters and patient subgroups based on EHRs. We utilized Latent Dirichlet Allocation (LDA), a generative probabilistic model, and proposed a novel model named Poisson Dirichlet Model (PDM), which extends the LDA approach using a Poisson distribution to model patients' disease diagnoses and to alleviate age and sex factors by considering both observed and expected observations. In the empirical experiments, we evaluated LDA and PDM on three patient…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMachine Learning in Healthcare · Artificial Intelligence in Healthcare · Topic Modeling
MethodsLinear Discriminant Analysis
