CXR-TFT: Multi-Modal Temporal Fusion Transformer for Predicting Chest X-ray Trajectories

Mehak Arora; Ayman Ali; Kaiyuan Wu; Carolyn Davis; Takashi Shimazui; Mahmoud Alwakeel; Victor Moas; Philip Yang; Annette Esper; Rishikesan Kamaleswaran

arXiv:2507.14766·cs.LG·July 22, 2025

CXR-TFT: Multi-Modal Temporal Fusion Transformer for Predicting Chest X-ray Trajectories

Mehak Arora, Ayman Ali, Kaiyuan Wu, Carolyn Davis, Takashi Shimazui, Mahmoud Alwakeel, Victor Moas, Philip Yang, Annette Esper, Rishikesan Kamaleswaran

PDF

TL;DR

CXR-TFT is a multi-modal transformer framework that predicts chest X-ray trajectories in ICU patients by integrating sparse imaging, reports, and high-frequency clinical data, enabling early detection of radiographic abnormalities.

Contribution

This work introduces a novel multi-modal transformer model that combines diverse clinical data sources to forecast CXR findings over time, addressing the limitations of cross-sectional analysis.

Findings

01

High accuracy in predicting abnormal CXR findings up to 12 hours early.

02

Effective integration of imaging, reports, and clinical data for trajectory forecasting.

03

Potential to improve early intervention in critical care settings.

Abstract

In intensive care units (ICUs), patients with complex clinical conditions require vigilant monitoring and prompt interventions. Chest X-rays (CXRs) are a vital diagnostic tool, providing insights into clinical trajectories, but their irregular acquisition limits their utility. Existing tools for CXR interpretation are constrained by cross-sectional analysis, failing to capture temporal dynamics. To address this, we introduce CXR-TFT, a novel multi-modal framework that integrates temporally sparse CXR imaging and radiology reports with high-frequency clinical data, such as vital signs, laboratory values, and respiratory flow sheets, to predict the trajectory of CXR findings in critically ill patients. CXR-TFT leverages latent embeddings from a vision encoder that are temporally aligned with hourly clinical data through interpolation. A transformer model is then trained to predict CXR…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.