Knowing or Guessing? Robust Medical Visual Question Answering via Joint Consistency and Contrastive Learning

Songtao Jiang; Yuxi Chen; Sibo Song; Yan Zhang; Yeying Jin; Yang Feng; Jian Wu; Zuozhu Liu

arXiv:2508.18687·cs.CL·August 27, 2025

Knowing or Guessing? Robust Medical Visual Question Answering via Joint Consistency and Contrastive Learning

Songtao Jiang, Yuxi Chen, Sibo Song, Yan Zhang, Yeying Jin, Yang Feng, Jian Wu, Zuozhu Liu

PDF

TL;DR

This paper identifies robustness issues in medical visual question answering models due to semantic misalignment and biases, and proposes a joint consistency and contrastive learning approach to improve reliability and answer stability.

Contribution

It introduces RoMed, a new dataset for evaluating robustness, and proposes CCL, a novel training method combining consistency and contrastive learning to enhance model robustness.

Findings

01

Significant performance drops of SOTA models on RoMed dataset.

02

CCL improves answer consistency by 50% on RoMed.

03

State-of-the-art results achieved on multiple VQA benchmarks.

Abstract

In high-stakes medical applications, consistent answering across diverse question phrasings is essential for reliable diagnosis. However, we reveal that current Medical Vision-Language Models (Med-VLMs) exhibit concerning fragility in Medical Visual Question Answering, as their answers fluctuate significantly when faced with semantically equivalent rephrasings of medical questions. We attribute this to two limitations: (1) insufficient alignment of medical concepts, leading to divergent reasoning patterns, and (2) hidden biases in training data that prioritize syntactic shortcuts over semantic understanding. To address these challenges, we construct RoMed, a dataset built upon original VQA datasets containing 144k questions with variations spanning word-level, sentence-level, and semantic-level perturbations. When evaluating state-of-the-art (SOTA) models like LLaVA-Med on RoMed, we…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.