MediConfusion: Can you trust your AI radiologist? Probing the reliability of multimodal medical foundation models
Mohammad Shahab Sepehri, Zalan Fabian, Maryam Soltanolkotabi, Mahdi Soltanolkotabi

TL;DR
This paper introduces MediConfusion, a challenging benchmark dataset for medical multimodal models, revealing their vulnerabilities and exposing significant reliability issues in current state-of-the-art models used in healthcare.
Contribution
The paper presents MediConfusion, a novel VQA benchmark that systematically uncovers failure modes of medical MLLMs, highlighting their unreliability and guiding future improvements.
Findings
Models perform below random chance on MediConfusion
State-of-the-art models are easily confused by visually dissimilar images
Common failure patterns identified to improve model trustworthiness
Abstract
Multimodal Large Language Models (MLLMs) have tremendous potential to improve the accuracy, availability, and cost-effectiveness of healthcare by providing automated solutions or serving as aids to medical professionals. Despite promising first steps in developing medical MLLMs in the past few years, their capabilities and limitations are not well-understood. Recently, many benchmark datasets have been proposed that test the general medical knowledge of such models across a variety of medical areas. However, the systematic failure modes and vulnerabilities of such models are severely underexplored with most medical benchmarks failing to expose the shortcomings of existing models in this safety-critical domain. In this paper, we introduce MediConfusion, a challenging medical Visual Question Answering (VQA) benchmark dataset, that probes the failure modes of medical MLLMs from a vision…
Peer Reviews
Decision·ICLR 2025 Poster
Originality: The paper addresses a critical gap in the assessment of medical MLLMs by introducing MediConfusion, a novel benchmark explicitly designed to test the reliability of these models in healthcare. While prior benchmarks focus on overall performance in typical medical scenarios, this work innovatively emphasizes systematic failure modes by identifying “confusing image pairs” that challenge models with subtle, clinically important distinctions. The benchmark not only probes weaknesses but
Limited Dataset Size: While the MediConfusion benchmark is rigorous, the dataset size (352 questions across 9 categories) may be relatively small to capture the full spectrum of medical image complexity. Expanding the dataset with a broader range of confusing image pairs, potentially covering additional anatomic regions and diagnostic subtleties, could improve its generalizability and make it more comprehensive for model evaluation across diverse medical contexts.
1. The paper addresses a well-motivated and underexplored problem in the medical domain. 2. The benchmark creation process is robust, with manual evaluation by a radiologist enhancing its credibility. 3. The experiments are comprehensive, demonstrating that MediConfusion presents substantial challenges to current VLMs, with promising potential for inspiring future research.
While this work provides valuable insights, it appears to closely follow the Multimodal Visual Patterns (MMVP) benchmark framework, potentially limiting its novelty. Many findings align with known challenges in general VLMs.
- The idea of using the same question with similar images could offer a interesting way to probe model weaknesses. - With the help of an expert-in-the-loop pipeline, the paper thoroughly examines why models struggle to differentiate between visually similar medical images. - The evaluation metrics are comprehensive, particularly accommodating models unable to directly answer multiple-choice questions.
- The motivation is not clear. The authors claimed that "existing benchmark datasets are focused on evaluating the medical knowledge of MLLMs across large evaluation sets, heavily biased towards common or typical scenarios." What are the common and typical scenarios? Could you provide a more detailed discussion about this limitation of existing benchmarks and how the proposed benchmark in this paper can overcome the limitations? - The results show that all available models (open-source or propri
Code & Models
Videos
Taxonomy
TopicsArtificial Intelligence in Healthcare and Education · Radiology practices and education
