SURE-VQA: Systematic Understanding of Robustness Evaluation in Medical VQA Tasks
Kim-Celine Kahl, Selen Erkan, Jeremias Traub, Carsten T. L\"uth, Klaus Maier-Hein, Lena Maier-Hein, Paul F. Jaeger

TL;DR
This paper introduces SURE-VQA, a comprehensive framework for evaluating the robustness of medical VQA models using real-world shifts, semantic-aware metrics, and meaningful baselines, revealing insights into model stability and performance.
Contribution
The paper presents SURE-VQA, a novel evaluation framework that addresses current limitations by focusing on real-world shifts, semantic evaluation with LLMs, and baseline reporting for medical VQA robustness analysis.
Findings
No fine-tuning method consistently outperforms others in robustness.
Robustness trends are more stable across fine-tuning methods than across distribution shifts.
Simple baselines without image data can perform surprisingly well.
Abstract
Vision-Language Models (VLMs) have great potential in medical tasks, like Visual Question Answering (VQA), where they could act as interactive assistants for both patients and clinicians. Yet their robustness to distribution shifts on unseen data remains a key concern for safe deployment. Evaluating such robustness requires a controlled experimental setup that allows for systematic insights into the model's behavior. However, we demonstrate that current setups fail to offer sufficiently thorough evaluations. To address this gap, we introduce a novel framework, called SURE-VQA, centered around three key requirements to overcome current pitfalls and systematically analyze VLM robustness: 1) Since robustness on synthetic shifts does not necessarily translate to real-world shifts, it should be measured on real-world shifts that are inherent to the VQA data; 2) Traditional token-matching…
Peer Reviews
Decision·Submitted to ICLR 2025
1. The paper is well-structured and easy to follow, with clear figures and visualizations that enhance comprehension. 2. The three requirements proposed address key concerns in the field of OOD robustness. 3. The codebase is open-sourced, which may facilitate future researchers to evaluate other VLMs.
1. Only LLaVA-Med is evaluated in the experiments, which restricts the generalizability of the findings. Testing additional models, especially state-of-the-art general-purposed VLMs, would provide a more comprehensive view of robustness across different architectures. 2. The experimental section focuses heavily on PEFT methods, which may divert focus from the broader aim of evaluating overall robustness. This narrow focus could limit insights into robustness aspects that might arise from other f
1. The paper proposes a new evaluation framework (SURE-VQA) that addresses issues present in current evaluation methods for visual question answering (VQA) tasks, particularly in dealing with real-world distribution shifts. This framework may incorporate new evaluation metrics or methods, enhancing the evaluation standards for VQA models. 2. It utilizes abundant and representative datasets and proposes a comprehensive taxonomy that simulates four types of real-world data distribution shifts. Th
1. The article briefly mentions the scope of robustness discussed in the Introduction and R1 sections, but it lacks a more explicit and comprehensive definition of the robustness of VLMs in Med VQA tasks. It may necessary to further compare with the existing work on the evaluation of VLM generalization ability to illustrate the innovation points. 2. Based on the descriptions in the article, the focus primarily lies on the interference resistance of VLMs against data distribution shifts. The aut
SURE-VQA’s focus on clinically relevant shifts (e.g., gender and ethnicity) is valuable for medical VQA applications, making it more practical for real-world use. The code is open-sourced.
The authors present different types of shifts in the dataset in Figure 2 and this work is to evaluate the robustnees of VLMs in medical VQA tasks. So how do you consider the potential shifts between pre-training distribution in VLMs and dataset distribution? I mean, there may be a case that distribution shifts indeed exist in the dataset but the dataset distribution is totally included in the pre-training data of VLMs, then that makes no sense to evaluate the robustness of VLMs. I think a defini
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsQuality and Safety in Healthcare
