Cardiac-CLIP: A Vision-Language Foundation Model for 3D Cardiac CT Images

Yutao Hu; Ying Zheng; Shumei Miao; Xiaolei Zhang; Jiahao Xia; Yaolei Qi; Yiyang Zhang; Yuting He; Qian Chen; Jing Ye; Hongyan Qiao; Xiuhua Hu; Lei Xu; Jiayin Zhang; Hui Liu; Minwen Zheng; Yining Wang; Daimin Zhang; Ji Zhang; Wenqi Shao; Yun Liu; Longjiang Zhang; Guanyu Yang

arXiv:2507.22024·eess.IV·July 30, 2025

Cardiac-CLIP: A Vision-Language Foundation Model for 3D Cardiac CT Images

Yutao Hu, Ying Zheng, Shumei Miao, Xiaolei Zhang, Jiahao Xia, Yaolei Qi, Yiyang Zhang, Yuting He, Qian Chen, Jing Ye, Hongyan Qiao, Xiuhua Hu, Lei Xu, Jiayin Zhang, Hui Liu, Minwen Zheng, Yining Wang, Daimin Zhang, Ji Zhang, Wenqi Shao, Yun Liu, Longjiang Zhang, Guanyu Yang

PDF

TL;DR

Cardiac-CLIP is a novel multi-modal foundation model for 3D cardiac CT images that leverages self-supervised and contrastive learning to improve cardiovascular diagnostics and clinical task performance.

Contribution

The paper introduces Cardiac-CLIP, a two-stage pre-training framework combining 3D masked autoencoder and contrastive learning for enhanced medical image and text understanding.

Findings

01

Achieves state-of-the-art results in cardiovascular abnormality classification.

02

Effectively supports clinical tasks like acute coronary syndrome prediction.

03

Demonstrates strong generalization across multiple datasets.

Abstract

Foundation models have demonstrated remarkable potential in medical domain. However, their application to complex cardiovascular diagnostics remains underexplored. In this paper, we present Cardiac-CLIP, a multi-modal foundation model designed for 3D cardiac CT images. Cardiac-CLIP is developed through a two-stage pre-training strategy. The first stage employs a 3D masked autoencoder (MAE) to perform self-supervised representation learning from large-scale unlabeled volumetric data, enabling the visual encoder to capture rich anatomical and contextual features. In the second stage, contrastive learning is introduced to align visual and textual representations, facilitating cross-modal understanding. To support the pre-training, we collect 16641 real clinical CT scans, supplemented by 114k publicly available data. Meanwhile, we standardize free-text radiology reports into unified…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.