MedVision: Dataset and Benchmark for Quantitative Medical Image Analysis
Yongcheng Yao, Yongshuo Zong, Raman Dutt, Yongxin Yang, Sotirios A Tsaftaris, Timothy Hospedales

TL;DR
MedVision introduces a large-scale dataset and benchmark to evaluate and improve vision-language models' ability to perform quantitative analysis in medical imaging, addressing a critical gap in current models.
Contribution
The paper presents MedVision, a comprehensive dataset and benchmark for quantitative medical image analysis, enabling development of models with robust quantitative reasoning.
Findings
Current VLMs perform poorly on quantitative tasks.
Supervised fine-tuning improves VLMs' accuracy and precision.
MedVision facilitates progress in quantitative medical imaging analysis.
Abstract
Current vision-language models (VLMs) in medicine are primarily designed for categorical question answering (e.g., "Is this normal or abnormal?") or qualitative descriptive tasks. However, clinical decision-making often relies on quantitative assessments, such as measuring the size of a tumor or the angle of a joint, from which physicians draw their own diagnostic conclusions. This quantitative reasoning capability remains underexplored and poorly supported in existing VLMs. In this work, we introduce MedVision, a large-scale dataset and benchmark specifically designed to evaluate and improve VLMs on quantitative medical image analysis. MedVision spans 22 public datasets covering diverse anatomies and modalities, with 30.8 million image-annotation pairs. We focus on three representative quantitative tasks: (1) detection of anatomical structures and abnormalities, (2) tumor/lesion (T/L)…
Peer Reviews
Decision·ICLR 2026 Conference Withdrawn Submission
MEDVISION integrates a diverse collection of publicly available datasets, which could serve as a valuable resource for the medical vision-language community. The authors conducted extensive experiments by evaluating a wide range of VLM models, providing useful baselines for future research.
While MEDVISION aggregates a large number of datasets, the curated annotations do not represent a significant improvement over the original data — in some cases, they are even of lower quality. For instance, the annotation process excluded 2D slices containing multiple bounding boxes, which may limit the dataset's applicability and realism. The authors justify this by stating that the study focuses on single-instance detection; however, this raises the question of clinical relevance — what is th
The paper is easy to follow, with well-structured descriptions of datasets and experiments. The benchmark design is systematic, providing a valuable foundation for evaluating quantitative reasoning in medical VLMs.
However, there exist the following concerns. The contribution remains limited, as the work mainly consolidates existing datasets and automatically extracts bounding boxes and bidirectional tumor/lesion sizes. The dataset construction process is more of an engineering integration rather than a methodological or conceptual advancement. The question design in Section 3 centers on dataset annotation rather than clinical reasoning. It does not clearly connect to diagnostic decision-making or show whe
1) The paper addresses an important area (aka medical image analysis) that could facilitate the development of large medical VLMs and drive better clinical decision-making. Although the data is not new, the authors did manually re-label many different image pairs. 2) The quantitative evaluation of size and angle is new and important. The results also demonstrate that current models struggle on these tasks. 3) Flow of the paper is good. Easy to read and understand.
1) Failure analysis needs improvement. The authors should isolate the effect of geometric reasoning from visual perception. The authors discuss failure modes like small-object detection and angle/value collapse, but it is still unclear if the model simply fails at perception or fails at reasoning (unit handling, geometry). ***One experiment you can try is to (on a small set of course) feed GT BBbox to the model to re-measure size / angle / distance, this would give a clear picture of exactly w
1. The research motivation is well aligned with practical clinical needs, and the designed tasks are highly valuable and relevant to real-world medical applications. 2. The collection of data containing physical measurements to construct the dataset is a reasonable and meaningful approach.
1. It is unclear whether quantitative assessment is an intrinsic capability that VLMs should possess. Would it be more practical to perform these tasks using a segmentation model combined with rule-based computation, especially since the current dataset is constructed from segmentation data? Perhaps VLMs would be better suited to handling such tasks in an agent-based paradigm. 2. Open-ended VQA may not be the most appropriate evaluation setting, as quantitative reasoning requires structured info
Code & Models
- 🤗YongchengYAO/MedVision__SFT-m__qwen25vl-7b__TLmodel· 70 dl70 dl
- 🤗YongchengYAO/MedVision__SFT-m__qwen25vl-7b__detectmodel· 63 dl63 dl
- 🤗YongchengYAO/MedVision__SFT-m__qwen25vl-32b__ADmodel· 12 dl12 dl
- 🤗YongchengYAO/MedVision__SFT-m__qwen25vl-7b__ADmodel· 11 dl11 dl
- 🤗YongchengYAO/MedVision__SFT-m__qwen25vl-32b__TLmodel· 90 dl90 dl
- 🤗YongchengYAO/MedVision__SFT-m__qwen25vl-32b__detectmodel· 99 dl99 dl
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Domain Adaptation and Few-Shot Learning · Advanced Neural Network Applications
