Representation geometry shapes task performance in vision-language modeling for CT enterography

Cristian Minoccheri; Emily Wittrup; Kayvan Najarian; Ryan Stidham

arXiv:2604.13021·cs.CV·April 15, 2026

Representation geometry shapes task performance in vision-language modeling for CT enterography

Cristian Minoccheri, Emily Wittrup, Kayvan Najarian, Ryan Stidham

PDF

TL;DR

This study explores how representation choices in vision-language models impact performance on CT enterography analysis, revealing that tissue contrast and pooling methods significantly influence outcomes.

Contribution

It provides the first analysis of vision-language transfer learning for CT enterography, highlighting the effects of pooling strategies and tissue contrast on model performance.

Findings

01

Mean pooling improves disease classification accuracy.

02

Attention pooling enhances cross-modal retrieval.

03

Multi-window RGB encoding outperforms multiplanar sampling.

Abstract

Computed tomography (CT) enterography is a primary imaging modality for assessing inflammatory bowel disease (IBD), yet the representational choices that best support automated analysis of this modality are unknown. We present the first study of vision-language transfer learning on abdominal CT enterography and identify two main findings. First, mean pooling of slice embeddings gives better categorical disease assessment (59.2\% three-class accuracy), whereas attention pooling gives better cross-modal retrieval (0.235 text-to-image MRR). This pattern holds across all LoRA configurations tested and suggests that the two aggregators emphasize different properties of the learned representation. Second, per-slice tissue contrast matters more than broader spatial coverage: multi-window RGB encoding, which maps complementary Hounsfield Unit windows to RGB channels, outperforms all strategies…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.