Worse than Random? An Embarrassingly Simple Probing Evaluation of Large Multimodal Models in Medical VQA
Qianqi Yan, Xuehai He, Xiang Yue, Xin Eric Wang

TL;DR
This paper critically evaluates large multimodal models in medical visual question answering, revealing they often perform worse than random guessing on diagnostic questions, highlighting significant limitations in their reliability for medical applications.
Contribution
The study introduces the ProbMed probing dataset and evaluation methods to rigorously assess LMM performance in medical diagnosis tasks, exposing their weaknesses.
Findings
State-of-the-art models perform worse than random on diagnostic questions.
Models struggle with fine-grained medical inquiries and general questions.
Transferability of expertise varies across modalities and organs.
Abstract
Large Multimodal Models (LMMs) have shown remarkable progress in medical Visual Question Answering (Med-VQA), achieving high accuracy on existing benchmarks. However, their reliability under robust evaluation is questionable. This study reveals that when subjected to simple probing evaluation, state-of-the-art models perform worse than random guessing on medical diagnosis questions. To address this critical evaluation problem, we introduce the Probing Evaluation for Medical Diagnosis (ProbMed) dataset to rigorously assess LMM performance in medical imaging through probing evaluation and procedural diagnosis. Particularly, probing evaluation features pairing original questions with negation questions with hallucinated attributes, while procedural diagnosis requires reasoning across various diagnostic dimensions for each image, including modality recognition, organ identification,…
Peer Reviews
Decision·Submitted to ICLR 2025
1. By introducing adversarial negative samples, the dataset tests model robustness, offering a more challenging evaluation standard. 2. ProbMed includes various diagnostic tasks across different imaging modalities and body organs, providing a evaluation setting for models. 3. The model’s performance improvement strategies are validated through methods like chain-of-thought reasoning and visual description enhancement, providing a foundation for future advancements.
1. The dataset has a significantly higher number of chest X-ray images than other types, which may lead to poorer model performance on other organs. Has the model's performance across different modalities been experimentally verified? 2. How does the study prevent hallucinations when using GPT-4 for caption analysis, abnormality identification, and positional descriptions through few-shot prompting? 3. ProbMed uses closed-ended questions and lacks open-ended tasks, limiting the dataset's compreh
It is a well-written paper with many experiments to evaluate the models on the proposed dataset. Several popular models have been studied and also authors propose mitigation strategies to improve the models' performance. Overall, the paper targets important questions and has the potential to be a good contribution to the field.
After reading the paper, I have a few questions and concerns as follows: 1. (major) If I understood it correctly, the dataset has been curated using the publically available data on the internet. My main concern is the possible data contamination in larger closed or open-source models. When some models have been trained on the data and some not, evaluation using this dataset loses its fairness. 2. If I have gotten it right, all the adversarial questions are in the form of negated questions a
- A new dataset, PromMed, was curated for medical VQA benchmarking, which contains adversarial question pairs - Comprehensive experiments on multiple VLMs - Insightful findings: SOTA VLMs perform worse than random guessing on specialized diagnostic questions
- Some questions are too trivial in the dataset. The ultimate goal of multimodal models is deployment in real clinical practice when the accuracy is good enough. However, no clinician will ask questions on basic modalities (e.g., CT, MR) or organs because they are too trivial. Arguments such as “CheXagent, trained exclusively on chest X-rays, achieved the highest accuracy in determining abnormalities and conditions. However, its performance in general tasks like identifying image modality and or
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAcute Ischemic Stroke Management
