MedFrameQA: A Multi-Image Medical VQA Benchmark for Clinical Reasoning
Suhao Yu, Haojin Wang, Juncheng Wu, Luyang Luo, Jingshen Wang, Cihang Xie, Pranav Rajpurkar, Carl Yang, Yang Yang, Kang Wang, Yannan Yu, Yuyin Zhou

TL;DR
MedFrameQA introduces a novel multi-image medical VQA benchmark that challenges models to perform complex clinical reasoning across multiple images, highlighting current deficiencies in multi-image understanding and reasoning in AI models.
Contribution
This paper presents MedFrameQA, the first benchmark for multi-image medical VQA using educational sequences, with a scalable pipeline for dataset creation and comprehensive evaluation of advanced models.
Findings
Models perform below 50% accuracy on multi-image reasoning tasks.
Current models treat images as isolated, failing to track disease progression.
Significant room for improvement in multi-image clinical reasoning capabilities.
Abstract
Real-world clinical practice demands multi-image comparative reasoning, yet current medical benchmarks remain limited to single-frame interpretation. We present MedFrameQA, the first benchmark explicitly designed to test multi-image medical VQA through educationally-validated diagnostic sequences. To construct this dataset, we develop a scalable pipeline that leverages narrative transcripts from medical education videos to align visual frames with textual concepts, automatically producing 2,851 high-quality multi-image VQA pairs with explicit, transcript-grounded reasoning chains. Our evaluation of 11 advanced MLLMs (including reasoning models) exposes severe deficiencies in multi-image synthesis, where accuracies mostly fall below 50% and exhibit instability across varying image counts. Error analysis demonstrates that models often treat images as isolated instances, failing to track…
Peer Reviews
Decision·ICLR 2026 Conference Withdrawn Submission
The task is clearly defined and close to clinical practice. It explicitly requires cross-frame evidence aggregation (multi-view, multi-timepoint, cross-modality) within a single clinical case, rather than treating multiple images as loose, unrelated inputs. The comparison with single-image/non-video datasets like VQA-RAD, SLAKE, PMC-VQA, and OmniMedVQA is clear. Results demonstrate that even SOTA models perform low overall and vary widely across subsets on this benchmark. The data pipeline is al
Many steps rely on models like GPT-4o. Even with human checks, this creates a risk of distribution shift. Adding multi-source expert review or cross-model verification could help. The paper mentions human evaluation (including filtering items “devoid of significant visual medical content”), but it isn’t clear whether there was systematic review by radiology/clinical experts or agreement metrics. The final version should report who the annotators were, their process, and inter-rater agreement. Cu
- The benchmark is diverse and focuses on multi image clinical reasoning, which is currently underexplored in medical AI evaluations. - The proposed pipeline automates the extraction of captions and the generation of multiple choice questions with clinical reasoning. The authors also evaluate several open and closed source models, showing that they underperform in this complex setting.
- This benchmark is built by curating relevant frames from YouTube videos with medical explanations. The images provided to the MLLMs are often slides containing screenshots of medical images with borders and annotations. This is not very indicative of real-world scenario where high resolution medical images are ingested by the MLLMs. For this reason, optimizing models for this benchmark may not translate well to practical clinical use cases. - The multiple choice questions, answers and reasonin
- The paper is clearly motivated: in medical practice, experts often look at multiple images of various imaging modalities and draw conclusions based on information integrated across multiple views, a key capability so far overlooked in single-image benchmarks. - The paper calls attention to the shortcomings of current models in leveraging multiple images in medical decision making, highlighting important directions for future work.
- The dataset synthesis pipeline desperately needs expert evaluation. Currently, GPT-4o is used to judge the quality and utility of extracted frames, and to correct and combine captions across multiple views. While this is an acceptable strategy to generate candidate VQAs, medical professionals are necessary to review such candidates, and correct mistakes. Otherwise, we are relying on a dataset generated by models with the same shortcomings that we are attempting to evaluate. The only step where
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Explainable Artificial Intelligence (XAI) · Generative Adversarial Networks and Image Synthesis
MethodsFocus
