MIMIC-Extract: A Data Extraction, Preprocessing, and Representation Pipeline for MIMIC-III
Shirly Wang, Matthew B. A. McDermott, Geeticka Chauhan, Michael C., Hughes, Tristan Naumann, Marzyeh Ghassemi

TL;DR
MIMIC-Extract is an open-source pipeline that standardizes, preprocesses, and makes complex EHR data from MIMIC-III accessible for machine learning, addressing reproducibility and usability challenges in healthcare AI research.
Contribution
It introduces a comprehensive, extensible framework for transforming raw EHR data into usable formats, facilitating reproducible and scalable machine learning in healthcare.
Findings
Standardized data processing functions implemented
Preserves time series data for clinical prediction tasks
Demonstrated utility with benchmark tasks and baseline results
Abstract
Robust machine learning relies on access to data that can be used with standardized frameworks in important tasks and the ability to develop models whose performance can be reasonably reproduced. In machine learning for healthcare, the community faces reproducibility challenges due to a lack of publicly accessible data and a lack of standardized data processing frameworks. We present MIMIC-Extract, an open-source pipeline for transforming raw electronic health record (EHR) data for critical care patients contained in the publicly-available MIMIC-III database into dataframes that are directly usable in common machine learning pipelines. MIMIC-Extract addresses three primary challenges in making complex health records data accessible to the broader machine learning community. First, it provides standardized data processing functions, including unit conversion, outlier detection, and…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMachine Learning in Healthcare · Sepsis Diagnosis and Treatment · Time Series Analysis and Forecasting
