An Unsupervised Homogenization Pipeline for Clustering Similar Patients   using Electronic Health Record Data

Alvaro Ulloa; Anna Basile; Gregory J. Wehner; Linyuan Jing; Marylyn D.; Ritchie; Brett Beaulieu-Jones; Christopher M. Haggerty; Brandon K. Fornwalt

arXiv:1801.00065·q-bio.QM·March 22, 2018·5 cites

An Unsupervised Homogenization Pipeline for Clustering Similar Patients using Electronic Health Record Data

Alvaro Ulloa, Anna Basile, Gregory J. Wehner, Linyuan Jing, Marylyn D., Ritchie, Brett Beaulieu-Jones, Christopher M. Haggerty, Brandon K. Fornwalt

PDF

Open Access

TL;DR

This paper introduces unsupervised data homogenization pipelines for clustering patients using EHR data, improving the ability to analyze complex, heterogeneous medical records.

Contribution

It is the first to evaluate unsupervised homogenization pipelines specifically for EHR clustering, identifying two optimal methods through simulation testing.

Findings

01

Two optimal pipelines identified: MICE with Local Linear Embedding and MICE with Z-scoring and Autoencoders.

02

The pipelines improve clustering accuracy on simulated heterogeneous EHR data.

03

The study provides a foundation for better patient stratification using unsupervised methods.

Abstract

Electronic health records (EHR) contain a large variety of information on the clinical history of patients such as vital signs, demographics, diagnostic codes and imaging data. The enormous potential for discovery in this rich dataset is hampered by its complexity and heterogeneity. We present the first study to assess unsupervised homogenization pipelines designed for EHR clustering. To identify the optimal pipeline, we tested accuracy on simulated data with varying amounts of redundancy, heterogeneity, and missingness. We identified two optimal pipelines: 1) Multiple Imputation by Chained Equations (MICE) combined with Local Linear Embedding; and 2) MICE, Z-scoring, and Deep Autoencoders.

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMachine Learning in Healthcare · Bayesian Methods and Mixture Models · Topic Modeling