DeepTumorVQA: A Hierarchical 3D CT Benchmark for Stage-Wise Evaluation of Medical VLMs and Tool-Augmented Agents

Yixiong Chen; Wenjie Xiao; Pedro R. A. S. Bassi; Boyan Wang; Liang He; Xinze Zhou; Sezgin Er; Ibrahim Ethem Hamamci; Zongwei Zhou; Alan Yuille

arXiv:2605.09679·cs.CV·May 12, 2026

DeepTumorVQA: A Hierarchical 3D CT Benchmark for Stage-Wise Evaluation of Medical VLMs and Tool-Augmented Agents

Yixiong Chen, Wenjie Xiao, Pedro R. A. S. Bassi, Boyan Wang, Liang He, Xinze Zhou, Sezgin Er, Ibrahim Ethem Hamamci, Zongwei Zhou, Alan Yuille

PDF

1 Repo 1 Datasets

TL;DR

DeepTumorVQA introduces a hierarchical 3D CT benchmark for evaluating medical vision-language models and AI agents across multiple reasoning stages, emphasizing tool integration and step-wise diagnosis.

Contribution

It presents a novel multi-stage benchmark with tool interaction environments, enabling detailed analysis of model capabilities and challenges in medical image reasoning.

Findings

01

Reliable measurement is the main bottleneck for model performance.

02

Tool augmentation significantly improves reasoning accuracy.

03

Ground-truth step traces help supervise and reduce reasoning failures.

Abstract

Medical vision-language models (VLMs) and AI agents have made significant progress in learning to analyze and reason about clinical images. However, existing medical visual question answering (VQA) benchmarks collapse model capabilities into a single accuracy score, obscuring where and why models fail. We propose DeepTumorVQA, a hierarchical benchmark that follows the multi-stage evidence chain in tumor diagnosis and decomposes 3D CT reasoning into four stages: recognition, measurement, visual reasoning, and medical reasoning. Higher-level questions remain independently scorable, while their ground-truth evidence chains are defined over lower-level primitives. The benchmark contains 476K questions across 42 clinical subtypes on 9,262 3D CT volumes. In addition to a direct reasoning mode for VLMs, DeepTumorVQA provides tool-interaction environments for agent evaluation, where a model can…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

Schuture/DeepTumorVQA
github

Datasets

tumor-vqa/DeepTumorVQA_2.0
dataset· 1.0k dl
1.0k dl

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.