CT-GLIP: 3D Grounded Language-Image Pretraining with CT Scans and Radiology Reports for Full-Body Scenarios
Jingyang Lin, Yingda Xia, Jianpeng Zhang, Ke Yan, Kai Cao, Le Lu, Jiebo Luo, Ling Zhang

TL;DR
CT-GLIP introduces a 3D grounded language-image pretraining approach that enhances fine-grained alignment between CT scans and reports, significantly improving zero-shot medical diagnosis tasks.
Contribution
The paper proposes a novel grounded cross-modal contrastive learning method for 3D medical VL pretraining, addressing the limitations of global VL alignment in existing models.
Findings
Outperforms existing methods with global VL alignment in multiple tasks.
Achieves 15.1% F1 score improvement in zero-shot abnormality detection.
Demonstrates effective organ and abnormality identification in a zero-shot setting.
Abstract
3D medical vision-language (VL) pretraining has shown potential in radiology by leveraging large-scale multimodal datasets with CT-report pairs. However, existing methods primarily rely on a global VL alignment directly adapted from 2D scenarios. The entire 3D image is transformed into one global embedding, resulting in a loss of sparse but critical semantics essential for accurately aligning with the corresponding diagnosis. To address this limitation, we propose CT-GLIP, a 3D Grounded Language-Image Pretrained model that constructs fine-grained CT-report pairs to enhance \textit{grounded} cross-modal contrastive learning, effectively aligning grounded visual features with precise textual descriptions. Leveraging the grounded cross-modal alignment, CT-GLIP improves performance across diverse downstream tasks and can even identify organs and abnormalities in a zero-shot manner using…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsRadiomics and Machine Learning in Medical Imaging · Radiology practices and education · Lung Cancer Diagnosis and Treatment
MethodsSparse Evolutionary Training · Focus · Contrastive Learning · Contrastive Language-Image Pre-training
