Representation geometry shapes task performance in vision-language modeling for CT enterography
Cristian Minoccheri, Emily Wittrup, Kayvan Najarian, Ryan Stidham

TL;DR
This study explores how representation choices in vision-language models impact performance on CT enterography analysis, revealing that tissue contrast and pooling methods significantly influence outcomes.
Contribution
It provides the first analysis of vision-language transfer learning for CT enterography, highlighting the effects of pooling strategies and tissue contrast on model performance.
Findings
Mean pooling improves disease classification accuracy.
Attention pooling enhances cross-modal retrieval.
Multi-window RGB encoding outperforms multiplanar sampling.
Abstract
Computed tomography (CT) enterography is a primary imaging modality for assessing inflammatory bowel disease (IBD), yet the representational choices that best support automated analysis of this modality are unknown. We present the first study of vision-language transfer learning on abdominal CT enterography and identify two main findings. First, mean pooling of slice embeddings gives better categorical disease assessment (59.2\% three-class accuracy), whereas attention pooling gives better cross-modal retrieval (0.235 text-to-image MRR). This pattern holds across all LoRA configurations tested and suggests that the two aggregators emphasize different properties of the learned representation. Second, per-slice tissue contrast matters more than broader spatial coverage: multi-window RGB encoding, which maps complementary Hounsfield Unit windows to RGB channels, outperforms all strategies…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
