Unveiling and Bridging the Functional Perception Gap in MLLMs: Atomic Visual Alignment and Hierarchical Evaluation via PET-Bench

Zanting Ye; Xiaolong Niu; Xuanbin Wu; Xu Han; Shengyuan Liu; Jing Hao; Zhihao Peng; Hao Sun; Jieqin Lv; Fanghu Wang; Yanchao Huang; Hubing Wu; Yixuan Yuan; Habib Zaidi; Arman Rahmim; Yefeng Zheng; and Lijun Lu

arXiv:2601.02737·cs.CV·January 16, 2026

Unveiling and Bridging the Functional Perception Gap in MLLMs: Atomic Visual Alignment and Hierarchical Evaluation via PET-Bench

Zanting Ye, Xiaolong Niu, Xuanbin Wu, Xu Han, Shengyuan Liu, Jing Hao, Zhihao Peng, Hao Sun, Jieqin Lv, Fanghu Wang, Yanchao Huang, Hubing Wu, Yixuan Yuan, Habib Zaidi, Arman Rahmim, Yefeng Zheng, and Lijun Lu

PDF

Open Access 1 Datasets

TL;DR

This paper identifies a perception gap in multimodal models for functional imaging, introduces PET-Bench for evaluation, and proposes Atomic Visual Alignment to improve diagnostic accuracy and reduce hallucinations in PET analysis.

Contribution

It introduces PET-Bench, a large-scale functional imaging benchmark, and proposes Atomic Visual Alignment to enhance MLLMs' understanding of functional PET data.

Findings

01

Standard Chain-of-Thought prompting causes hallucinations in PET diagnosis.

02

Atomic Visual Alignment significantly improves diagnostic accuracy.

03

The approach bridges the perception gap in functional imaging understanding.

Abstract

While Multimodal Large Language Models (MLLMs) have demonstrated remarkable proficiency in tasks such as abnormality detection and report generation for anatomical modalities, their capability in functional imaging remains largely unexplored. In this work, we identify and quantify a fundamental functional perception gap: the inability of current vision encoders to decode functional tracer biodistribution independent of morphological priors. Identifying Positron Emission Tomography (PET) as the quintessential modality to investigate this disconnect, we introduce PET-Bench, the first large-scale functional imaging benchmark comprising 52,308 hierarchical QA pairs from 9,732 multi-site, multi-tracer PET studies. Extensive evaluation of 19 state-of-the-art MLLMs reveals a critical safety hazard termed the Chain-of-Thought (CoT) hallucination trap. We observe that standard CoT prompting,…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Datasets

TZT21999/PET-Bench
dataset· 3 dl
3 dl

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsArtificial Intelligence in Healthcare and Education · Radiology practices and education · Topological and Geometric Data Analysis