CT-GLIP: 3D Grounded Language-Image Pretraining with CT Scans and Radiology Reports for Full-Body Scenarios

Jingyang Lin; Yingda Xia; Jianpeng Zhang; Ke Yan; Kai Cao; Le Lu; Jiebo Luo; Ling Zhang

arXiv:2404.15272·cs.CV·December 3, 2025·3 cites

CT-GLIP: 3D Grounded Language-Image Pretraining with CT Scans and Radiology Reports for Full-Body Scenarios

Jingyang Lin, Yingda Xia, Jianpeng Zhang, Ke Yan, Kai Cao, Le Lu, Jiebo Luo, Ling Zhang

PDF

Open Access

TL;DR

CT-GLIP introduces a 3D grounded language-image pretraining approach that enhances fine-grained alignment between CT scans and reports, significantly improving zero-shot medical diagnosis tasks.

Contribution

The paper proposes a novel grounded cross-modal contrastive learning method for 3D medical VL pretraining, addressing the limitations of global VL alignment in existing models.

Findings

01

Outperforms existing methods with global VL alignment in multiple tasks.

02

Achieves 15.1% F1 score improvement in zero-shot abnormality detection.

03

Demonstrates effective organ and abnormality identification in a zero-shot setting.

Abstract

3D medical vision-language (VL) pretraining has shown potential in radiology by leveraging large-scale multimodal datasets with CT-report pairs. However, existing methods primarily rely on a global VL alignment directly adapted from 2D scenarios. The entire 3D image is transformed into one global embedding, resulting in a loss of sparse but critical semantics essential for accurately aligning with the corresponding diagnosis. To address this limitation, we propose CT-GLIP, a 3D Grounded Language-Image Pretrained model that constructs fine-grained CT-report pairs to enhance \textit{grounded} cross-modal contrastive learning, effectively aligning grounded visual features with precise textual descriptions. Leveraging the grounded cross-modal alignment, CT-GLIP improves performance across diverse downstream tasks and can even identify organs and abnormalities in a zero-shot manner using…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsRadiomics and Machine Learning in Medical Imaging · Radiology practices and education · Lung Cancer Diagnosis and Treatment

MethodsSparse Evolutionary Training · Focus · Contrastive Learning · Contrastive Language-Image Pre-training