A latent topic model for mining heterogenous non-randomly missing electronic health records data
Yue Li, Manolis Kellis

TL;DR
This paper introduces mixEHR, a novel unsupervised model that effectively analyzes heterogeneous and biased electronic health records to uncover disease patterns, impute missing data, and predict patient outcomes.
Contribution
The paper presents mixEHR, a new generative model combining collaborative filtering and latent topic modeling for EHR data analysis, addressing heterogeneity and bias.
Findings
mixEHR outperforms previous methods in simulations and real data
It reveals meaningful multi-disease insights from EHR data
The model improves data imputation and mortality prediction
Abstract
Electronic health records (EHR) are rich heterogeneous collection of patient health information, whose broad adoption provides great opportunities for systematic health data mining. However, heterogeneous EHR data types and biased ascertainment impose computational challenges. Here, we present mixEHR, an unsupervised generative model integrating collaborative filtering and latent topic models, which jointly models the discrete distributions of data observation bias and actual data using latent disease-topic distributions. We apply mixEHR on 12.8 million phenotypic observations from the MIMIC dataset, and use it to reveal latent disease topics, interpret EHR results, impute missing data, and predict mortality in intensive care units. Using both simulation and real data, we show that mixEHR outperforms previous methods and reveals meaningful multi-disease insights.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMachine Learning in Healthcare · Data-Driven Disease Surveillance · Bayesian Methods and Mixture Models
