MIMIC-\RNum{4}-Ext-22MCTS: A 22 Millions-Event Temporal Clinical Time-Series Dataset with Relative Timestamp for Risk Prediction

Jing Wang; Xing Niu; Tong Zhang; Jie Shen; Juyong Kim; Jeremy C. Weiss

arXiv:2505.00827·cs.AI·November 19, 2025

MIMIC-\RNum{4}-Ext-22MCTS: A 22 Millions-Event Temporal Clinical Time-Series Dataset with Relative Timestamp for Risk Prediction

Jing Wang, Xing Niu, Tong Zhang, Jie Shen, Juyong Kim, Jeremy C. Weiss

PDF

TL;DR

This paper introduces MIMIC-4-Ext-22MCTS, a large-scale clinical time-series dataset with nearly 23 million events, developed from discharge summaries using a novel framework involving chunking, semantic search, and Llama-3.1-8B prompts, to improve healthcare risk prediction models.

Contribution

The paper presents a new large-scale clinical dataset with a unique framework for extracting temporal information from unstructured discharge summaries, enhancing model performance in healthcare tasks.

Findings

01

Fine-tuned BERT achieves 10% better accuracy in medical question answering.

02

Fine-tuned BERT improves clinical trial matching by 3%.

03

The dataset enables significant improvements in healthcare risk prediction models.

Abstract

A crucial component for clinical risk prediction is developing a reliable prediction model is collecting high-quality time series clinical events. In this work, we release such a dataset that consists of 22,588,586 Clinical Time Series events, which we term MIMIC-\RNum{4}-Ext-22MCTS. Our source data are discharge summaries selected from the well-known yet unstructured MIMIC-IV-Note \cite{Johnson2023-pg}. The general-purpose MIMIC-IV-Note pose specific challenges for our work: it turns out that the discharge summaries are too lengthy for typical natural language models to process, and the clinical events of interest often are not accompanied with explicit timestamps. Therefore, we propose a new framework that works as follows: 1) we break each discharge summary into manageably small text chunks; 2) we apply contextual BM25 and contextual semantic search to retrieve chunks that have a…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.