MedFM-Robust: Benchmarking Robustness of Medical Foundation Models
Xiangxiang Cui, Tianjin Huang, Yifang Wang, Lijie Hu, Lu Yin

TL;DR
This paper introduces MedFM-Robust, a benchmark designed to evaluate the reliability of medical foundation models across vision-language and segmentation tasks in real-world clinical scenarios.
Contribution
It provides a comprehensive benchmark for assessing the robustness of MedFMs, including models like Med-VLMs and segmentation models, under practical clinical conditions.
Findings
MedFMs show variable robustness across tasks and conditions.
Benchmark reveals strengths and weaknesses of current models in clinical settings.
Results guide future improvements for reliable medical AI deployment.
Abstract
Medical foundation models (MedFMs) have emerged as transformative tools in healthcare, demonstrating capabilities across diverse clinical applications. These models can be broadly categorized into two paradigms: Medical Vision-Language Models (Med-VLMs) and segmentation foundation models. Med-VLMs range from medical-specialized models such as LLaVA-Med and MedGemma, to general-purpose models like GPT-4o and Gemini, all capable of medical image understanding tasks including visual question answering (VQA), report generation, and visual grounding. Concurrently, the Segment Anything Model (SAM) has catalyzed a new generation of medical segmentation models, with adaptations like SAM-Med2D and MedSAM. The widespread clinical deployment of these models thus necessitates rigorous evaluation of their reliability under real-world conditions.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
