Can Generalist Vision Language Models (VLMs) Rival Specialist Medical VLMs? Benchmarking and Strategic Insights
Yuan Zhong, Ruinan Jin, Qi Dou, Xiaoxiao Li

TL;DR
This paper compares generalist and specialist medical Vision Language Models, showing that well fine-tuned generalist models can match or outperform specialists in many clinical tasks, especially for unseen modalities.
Contribution
It provides a benchmarking analysis demonstrating the competitive performance of generalist VLMs against specialists in medical imaging tasks.
Findings
Generalist VLMs can achieve comparable or better performance than specialist models.
Fine-tuned generalist models excel in unseen or rare medical modalities.
Specialist models remain valuable for modality-specific tasks.
Abstract
Vision Language Models (VLMs) have shown promise in automating image diagnosis and interpretation in clinical settings. However, developing specialist medical VLMs requires substantial computational resources and carefully curated datasets, and it remains unclear under which conditions generalist and specialist medical VLMs each perform best. This study highlights the complementary strengths of specialist medical and generalist VLMs. Specialists remain valuable in modality-aligned use cases, but we find that efficiently fine-tuned generalist VLMs can achieve comparable or even superior performance in most tasks, particularly when transferring to unseen or rare OOD medical modalities. These results suggest that generalist VLMs, rather than being constrained by their lack of specialist medical pretraining, may offer a scalable and cost-effective pathway for advancing clinical AI…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
