Describe Anything in Medical Images

Xi Xiao; Yunbei Zhang; Thanh-Huy Nguyen; Ba-Thinh Lam; Janet Wang; Lin Zhao; Jihun Hamm; Tianyang Wang; Xingjian Li; Xiao Wang; Hao Xu; Tianming Liu; Min Xu

arXiv:2505.05804·cs.CV·May 27, 2025·2 cites

Describe Anything in Medical Images

Xi Xiao, Yunbei Zhang, Thanh-Huy Nguyen, Ba-Thinh Lam, Janet Wang, Lin Zhao, Jihun Hamm, Tianyang Wang, Xingjian Li, Xiao Wang, Hao Xu, Tianming Liu, Min Xu

PDF

Open Access

TL;DR

This paper introduces MedDAM, a novel framework that applies large vision-language models to generate region-specific descriptions in medical images, addressing the gap in specialized domain understanding.

Contribution

MedDAM is the first comprehensive medical imaging framework leveraging large vision-language models with expert prompts and a new evaluation benchmark for clinical factuality.

Findings

01

MedDAM outperforms existing models on multiple medical datasets.

02

Region-level semantic alignment improves medical image understanding.

03

The benchmark effectively evaluates clinical factuality without ground-truth captions.

Abstract

Localized image captioning has made significant progress with models like the Describe Anything Model (DAM), which can generate detailed region-specific descriptions without explicit region-text supervision. However, such capabilities have yet to be widely applied to specialized domains like medical imaging, where diagnostic interpretation relies on subtle regional findings rather than global understanding. To mitigate this gap, we propose MedDAM, the first comprehensive framework leveraging large vision-language models for region-specific captioning in medical images. MedDAM employs medical expert-designed prompts tailored to specific imaging modalities and establishes a robust evaluation benchmark comprising a customized assessment protocol, data pre-processing pipeline, and specialized QA template library. This benchmark evaluates both MedDAM and other adaptable large vision-language…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Generative Adversarial Networks and Image Synthesis · Topic Modeling