Medical thinking with multiple images

Zonghai Yao; Benlu Wang; Yifan Zhang; Junda Wang; Iris Xia; Zhipeng Tang; Shuo Han; Feiyun Ouyang; Zhichao Yang; Arman Cohan; Hong Yu

arXiv:2604.16506·cs.CV·May 5, 2026

Medical thinking with multiple images

Zonghai Yao, Benlu Wang, Yifan Zhang, Junda Wang, Iris Xia, Zhipeng Tang, Shuo Han, Feiyun Ouyang, Zhichao Yang, Arman Cohan, Hong Yu

PDF

1 Video

TL;DR

This paper introduces MedThinkVQA, a benchmark for medical reasoning with multiple images, revealing current models' limitations in multi-view evidence integration and emphasizing the need for reliable grounding mechanisms.

Contribution

The creation of MedThinkVQA, a dense multi-image medical benchmark with expert annotations, and analysis of model performance highlighting core challenges in multi-view reasoning.

Findings

01

Models perform poorly on multi-image medical reasoning tasks.

02

Grounded multi-image reasoning is the main bottleneck for current models.

03

Providing expert cues improves model performance, while self-generated intermediates can reduce accuracy.

Abstract

Large language models perform well on many medical QA benchmarks, but real clinical reasoning often requires integrating evidence across multiple images rather than interpreting a single view. We introduce MedThinkVQA, an expert-annotated benchmark for thinking with multiple images, where models must interpret each image, combine cross-view evidence, and answer diagnostic questions with intermediate supervision and step-level evaluation. The dataset contains 8,067 cases, including 720 test cases, with an average of 6.62 images per case, substantially denser than prior work, whose expert-level benchmarks use at most 1.43 images per case. On the test set, the best closed-source models, Claude-4.6-Opus, Gemini-3-Pro, and GPT-5.2-xhigh, reach only 57.2%, 55.3%, and 54.9% accuracy, while GPT-5-mini and GPT-5-nano reach 39.7% and 30.8%. Strong open-source models lag behind, led by…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

Medical thinking with multiple images· slideslive