Vision-Language Modeling in PET/CT for Visual Grounding of Positive Findings
Zachary Huemann, Samuel Church, Joshua D. Warner, Daniel Tran, Xin, Tie, Alan B McMillan, Junjie Hu, Steve Y. Cho, Meghan Lubner, Tyler J., Bradshaw

TL;DR
This study presents a weakly supervised pipeline to generate annotated PET/CT datasets and trains a 3D vision-language model that outperforms other automated methods but still lags behind expert physicians in localizing findings.
Contribution
We developed an automated weak-labeling pipeline for PET/CT reports and trained a novel 3D vision-language model, improving automated lesion localization in medical imaging.
Findings
Weak-labeling pipeline achieved 98% accuracy in lesion localization.
ConTEXTual Net 3D outperformed other models with an F1 score of 0.80.
Model performance was consistent across lesion sizes but varied by radiotracer.
Abstract
Vision-language models can connect the text description of an object to its specific location in an image through visual grounding. This has potential applications in enhanced radiology reporting. However, these models require large annotated image-text datasets, which are lacking for PET/CT. We developed an automated pipeline to generate weak labels linking PET/CT report descriptions to their image locations and used it to train a 3D vision-language visual grounding model. Our pipeline finds positive findings in PET/CT reports by identifying mentions of SUVmax and axial slice numbers. From 25,578 PET/CT exams, we extracted 11,356 sentence-label pairs. Using this data, we trained ConTEXTual Net 3D, which integrates text embeddings from a large language model with a 3D nnU-Net via token-level cross-attention. The model's performance was compared against LLMSeg, a 2.5D version of…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications
