TL;DR
This paper introduces MedThinkVQA, a benchmark for medical reasoning with multiple images, revealing current models' limitations in multi-view evidence integration and emphasizing the need for reliable grounding mechanisms.
Contribution
The creation of MedThinkVQA, a dense multi-image medical benchmark with expert annotations, and analysis of model performance highlighting core challenges in multi-view reasoning.
Findings
Models perform poorly on multi-image medical reasoning tasks.
Grounded multi-image reasoning is the main bottleneck for current models.
Providing expert cues improves model performance, while self-generated intermediates can reduce accuracy.
Abstract
Large language models perform well on many medical QA benchmarks, but real clinical reasoning often requires integrating evidence across multiple images rather than interpreting a single view. We introduce MedThinkVQA, an expert-annotated benchmark for thinking with multiple images, where models must interpret each image, combine cross-view evidence, and answer diagnostic questions with intermediate supervision and step-level evaluation. The dataset contains 8,067 cases, including 720 test cases, with an average of 6.62 images per case, substantially denser than prior work, whose expert-level benchmarks use at most 1.43 images per case. On the test set, the best closed-source models, Claude-4.6-Opus, Gemini-3-Pro, and GPT-5.2-xhigh, reach only 57.2%, 55.3%, and 54.9% accuracy, while GPT-5-mini and GPT-5-nano reach 39.7% and 30.8%. Strong open-source models lag behind, led by…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
