A Multimodal Data Processing Pipeline for MIMIC-IV Dataset
Farzana Islam Adiba, Varsha Danduri, Fahmida Liza Piya, Ali Abbasi, Mehak Gupta, Rahmatollah Beheshti

TL;DR
This paper introduces a comprehensive, customizable multimodal data processing pipeline for the MIMIC-IV dataset, streamlining integration of diverse data types to facilitate clinical machine learning research.
Contribution
It presents a new pipeline that automates and standardizes multimodal data integration from MIMIC-IV, improving efficiency and reproducibility over existing methods.
Findings
Reduces multimodal processing time significantly.
Supports arbitrary downstream applications.
Enhances reproducibility of MIMIC-based studies.
Abstract
The MIMIC-IV dataset is a large, publicly available electronic health record (EHR) resource widely used for clinical machine learning research. It comprises multiple modalities, including structured data, clinical notes, waveforms, and imaging data. Working with these disjointed modalities requires an extensive manual effort to preprocess and align them for downstream analysis. While several pipelines for MIMIC-IV data extraction are available, they target a small subset of modalities or do not fully support arbitrary downstream applications. In this work, we greatly expand our prior popular unimodal pipeline and present a comprehensive and customizable multimodal pipeline that can significantly reduce multimodal processing time and enhance the reproducibility of MIMIC-based studies. Our pipeline systematically integrates the listed modalities, enabling automated cohort selection,…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMachine Learning in Healthcare · Healthcare Technology and Patient Monitoring · Electronic Health Records Systems
