CLIP Architecture for Abdominal CT Image-Text Alignment and Zero-Shot Learning: Investigating Batch Composition and Data Scaling
Shivika, Kartik Bose, Pankaj Gupta

TL;DR
This study explores how batch composition and data scaling affect contrastive learning for 3D abdominal CT and report alignment, revealing that random sampling diversity outperforms explicit class balancing.
Contribution
It reproduces Merlin, a dual-encoder model for CT-report alignment, and systematically investigates the effects of batch composition and data scaling on model performance.
Findings
Unbalanced batch composition slightly outperforms balanced sampling.
Performance improves sub-linearly with increased data size.
Explicit class balancing degrades performance regardless of dataset size.
Abstract
Vision-language models trained with contrastive learning on paired medical images and reports show strong zero-shot diagnostic capabilities, yet the effect of training batch composition on learned representations remains unexplored for 3D medical imaging. We reproduce Merlin, a dual-encoder model that aligns 3D abdominal CT volumes with radiology reports using symmetric InfoNCE loss, achieving a zero-shot macro F1 of 74.45% across 30 findings (original: 73.00%). We then investigate two axes of variation. First, we control the normal-to-abnormal ratio within training batches at 25:75, 50:50, and 75:25 using section-level balanced sampling on the full dataset. All three configurations underperform the unbalanced baseline by 2.4 to 2.8 points, with 75:25 achieving the best result (72.02%) among balanced variants. Second, we conduct data scaling ablations on a 4,362-study subset, training…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
