Anatomy-VLM: A Fine-grained Vision-Language Model for Medical Interpretation
Difei Gu, Yunhe Gao, Mu Zhou, Dimitris Metaxas

TL;DR
Anatomy-VLM is a novel fine-grained vision-language model designed for medical interpretation, integrating anatomical details and structured knowledge to improve disease diagnosis and enable zero-shot interpretation.
Contribution
It introduces a multi-scale, anatomy-aware model that localizes key features and aligns medical information for enhanced interpretability and diagnostic accuracy.
Findings
Achieves high performance on in- and out-of-distribution datasets.
Improves downstream image segmentation tasks.
Enables zero-shot anatomy-wise interpretation.
Abstract
Accurate disease interpretation from radiology remains challenging due to imaging heterogeneity. Achieving expert-level diagnostic decisions requires integration of subtle image features with clinical knowledge. Yet major vision-language models (VLMs) treat images as holistic entities and overlook fine-grained image details that are vital for disease diagnosis. Clinicians analyze images by utilizing their prior medical knowledge and identify anatomical structures as important region of interests (ROIs). Inspired from this human-centric workflow, we introduce Anatomy-VLM, a fine-grained, vision-language model that incorporates multi-scale information. First, we design a model encoder to localize key anatomical features from entire medical images. Second, these regions are enriched with structured knowledge for contextually-aware interpretation. Finally, the model encoder aligns…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · COVID-19 diagnosis using AI · Machine Learning in Healthcare
