Med-StepBench: A Hierarchical Reasoning Framework for Evaluating Hallucinations in Medical Vision-Language Models

Minh Khoi Nguyen; Dai Lam Le; Amir Reza Jafari; Tuan Dung Nguyen; Mai Hong Son; Mai Huy Thong; Quang Huy Nguyen; Thanh Trung Nguyen; Reza Farahbakhsh; Noel Crespi; Phi Le Nguyen

arXiv:2605.10002·cs.CV·May 12, 2026

Med-StepBench: A Hierarchical Reasoning Framework for Evaluating Hallucinations in Medical Vision-Language Models

Minh Khoi Nguyen, Dai Lam Le, Amir Reza Jafari, Tuan Dung Nguyen, Mai Hong Son, Mai Huy Thong, Quang Huy Nguyen, Thanh Trung Nguyen, Reza Farahbakhsh, Noel Crespi, Phi Le Nguyen

PDF

TL;DR

Med-StepBench is a comprehensive benchmark for evaluating step-wise hallucination detection in 3D medical vision-language models, revealing critical limitations in current models' grounding and reasoning capabilities.

Contribution

It introduces the first large-scale, step-wise hallucination benchmark for 3D medical imaging, with clinician-verified annotations and detailed evaluation of model failure modes.

Findings

01

Current VLMs are highly susceptible to adversarial explanations.

02

Systematic failure modes are revealed through step-level evaluation.

03

Grounding multi-step reasoning remains a fundamental challenge.

Abstract

Large vision-language models (VLMs) demonstrate strong performance in medical image understanding, but frequently generate clinically plausible yet incorrect statements, raising significant safety concerns. Existing medical hallucination benchmarks primarily focus on 2D imaging with one-shot diagnostic questions, offering limited insight into whether predictions are grounded in correct localization and abnormality identification, allowing critical reasoning errors to remain hidden behind seemingly correct diagnoses. We introduce Med-StepBench, the first large-scale benchmark for step-wise hallucination detection in 3D oncological PET/CT, comprising over 12,000 images and more than 1,000,000 image-statement pairs across volumetric and multi-view 2D data, which decomposes clinical reasoning into four expert-designed diagnostic stages. Using clinician-verified annotations, we perform the…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.