An Unsupervised Homogenization Pipeline for Clustering Similar Patients using Electronic Health Record Data
Alvaro Ulloa, Anna Basile, Gregory J. Wehner, Linyuan Jing, Marylyn D., Ritchie, Brett Beaulieu-Jones, Christopher M. Haggerty, Brandon K. Fornwalt

TL;DR
This paper introduces unsupervised data homogenization pipelines for clustering patients using EHR data, improving the ability to analyze complex, heterogeneous medical records.
Contribution
It is the first to evaluate unsupervised homogenization pipelines specifically for EHR clustering, identifying two optimal methods through simulation testing.
Findings
Two optimal pipelines identified: MICE with Local Linear Embedding and MICE with Z-scoring and Autoencoders.
The pipelines improve clustering accuracy on simulated heterogeneous EHR data.
The study provides a foundation for better patient stratification using unsupervised methods.
Abstract
Electronic health records (EHR) contain a large variety of information on the clinical history of patients such as vital signs, demographics, diagnostic codes and imaging data. The enormous potential for discovery in this rich dataset is hampered by its complexity and heterogeneity. We present the first study to assess unsupervised homogenization pipelines designed for EHR clustering. To identify the optimal pipeline, we tested accuracy on simulated data with varying amounts of redundancy, heterogeneity, and missingness. We identified two optimal pipelines: 1) Multiple Imputation by Chained Equations (MICE) combined with Local Linear Embedding; and 2) MICE, Z-scoring, and Deep Autoencoders.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMachine Learning in Healthcare · Bayesian Methods and Mixture Models · Topic Modeling
