MediConfusion: Can you trust your AI radiologist? Probing the reliability of multimodal medical foundation models

Mohammad Shahab Sepehri; Zalan Fabian; Maryam Soltanolkotabi; Mahdi Soltanolkotabi

arXiv:2409.15477·cs.CV·May 22, 2025·2 cites

MediConfusion: Can you trust your AI radiologist? Probing the reliability of multimodal medical foundation models

Mohammad Shahab Sepehri, Zalan Fabian, Maryam Soltanolkotabi, Mahdi Soltanolkotabi

PDF

Open Access 2 Repos 1 Datasets 1 Video 3 Reviews

TL;DR

This paper introduces MediConfusion, a challenging benchmark dataset for medical multimodal models, revealing their vulnerabilities and exposing significant reliability issues in current state-of-the-art models used in healthcare.

Contribution

The paper presents MediConfusion, a novel VQA benchmark that systematically uncovers failure modes of medical MLLMs, highlighting their unreliability and guiding future improvements.

Findings

01

Models perform below random chance on MediConfusion

02

State-of-the-art models are easily confused by visually dissimilar images

03

Common failure patterns identified to improve model trustworthiness

Abstract

Multimodal Large Language Models (MLLMs) have tremendous potential to improve the accuracy, availability, and cost-effectiveness of healthcare by providing automated solutions or serving as aids to medical professionals. Despite promising first steps in developing medical MLLMs in the past few years, their capabilities and limitations are not well-understood. Recently, many benchmark datasets have been proposed that test the general medical knowledge of such models across a variety of medical areas. However, the systematic failure modes and vulnerabilities of such models are severely underexplored with most medical benchmarks failing to expose the shortcomings of existing models in this safety-critical domain. In this paper, we introduce MediConfusion, a challenging medical Visual Question Answering (VQA) benchmark dataset, that probes the failure modes of medical MLLMs from a vision…

Peer Reviews

Decision·ICLR 2025 Poster

Reviewer 01Rating 6Confidence 4

Strengths

Originality: The paper addresses a critical gap in the assessment of medical MLLMs by introducing MediConfusion, a novel benchmark explicitly designed to test the reliability of these models in healthcare. While prior benchmarks focus on overall performance in typical medical scenarios, this work innovatively emphasizes systematic failure modes by identifying “confusing image pairs” that challenge models with subtle, clinically important distinctions. The benchmark not only probes weaknesses but

Weaknesses

Limited Dataset Size: While the MediConfusion benchmark is rigorous, the dataset size (352 questions across 9 categories) may be relatively small to capture the full spectrum of medical image complexity. Expanding the dataset with a broader range of confusing image pairs, potentially covering additional anatomic regions and diagnostic subtleties, could improve its generalizability and make it more comprehensive for model evaluation across diverse medical contexts.

Reviewer 02Rating 8Confidence 3

Strengths

1. The paper addresses a well-motivated and underexplored problem in the medical domain. 2. The benchmark creation process is robust, with manual evaluation by a radiologist enhancing its credibility. 3. The experiments are comprehensive, demonstrating that MediConfusion presents substantial challenges to current VLMs, with promising potential for inspiring future research.

Weaknesses

While this work provides valuable insights, it appears to closely follow the Multimodal Visual Patterns (MMVP) benchmark framework, potentially limiting its novelty. Many findings align with known challenges in general VLMs.

Reviewer 03Rating 5Confidence 4

Strengths

- The idea of using the same question with similar images could offer a interesting way to probe model weaknesses. - With the help of an expert-in-the-loop pipeline, the paper thoroughly examines why models struggle to differentiate between visually similar medical images. - The evaluation metrics are comprehensive, particularly accommodating models unable to directly answer multiple-choice questions.

Weaknesses

- The motivation is not clear. The authors claimed that "existing benchmark datasets are focused on evaluating the medical knowledge of MLLMs across large evaluation sets, heavily biased towards common or typical scenarios." What are the common and typical scenarios? Could you provide a more detailed discussion about this limitation of existing benchmarks and how the proposed benchmark in this paper can overcome the limitations? - The results show that all available models (open-source or propri

Code & Models

Repositories

Datasets

shahab7899/MediConfusion
dataset· 115 dl
115 dl

Videos

MediConfusion: Can you trust your AI radiologist? Probing the reliability of multimodal medical foundation models· slideslive

Taxonomy

TopicsArtificial Intelligence in Healthcare and Education · Radiology practices and education