Benchmarking Egocentric Clinical Intent Understanding Capability for Medical Multimodal Large Language Models

Shaonan Liu; Guo Yu; Xiaoling Luo; Shiyi Zheng; Wenting Chen; Jie Liu; Linlin Shen

arXiv:2601.06750·cs.CV·January 13, 2026

Benchmarking Egocentric Clinical Intent Understanding Capability for Medical Multimodal Large Language Models

Shaonan Liu, Guo Yu, Xiaoling Luo, Shiyi Zheng, Wenting Chen, Jie Liu, Linlin Shen

PDF

Open Access

TL;DR

This paper introduces MedGaze-Bench, a novel benchmark for evaluating egocentric clinical intent understanding in medical multimodal large language models using clinician gaze data across various clinical scenarios.

Contribution

It presents the first benchmark leveraging clinician gaze as a Cognitive Cursor to assess intent understanding, addressing key challenges in medical multimodal AI evaluation.

Findings

01

Current MLLMs struggle with egocentric intent understanding.

02

Models tend to over-rely on global features, causing hallucinations.

03

Stress-testing reveals issues with reliability and safety in models.

Abstract

Medical Multimodal Large Language Models (Med-MLLMs) require egocentric clinical intent understanding for real-world deployment, yet existing benchmarks fail to evaluate this critical capability. To address these challenges, we introduce MedGaze-Bench, the first benchmark leveraging clinician gaze as a Cognitive Cursor to assess intent understanding across surgery, emergency simulation, and diagnostic interpretation. Our benchmark addresses three fundamental challenges: visual homogeneity of anatomical structures, strict temporal-causal dependencies in clinical workflows, and implicit adherence to safety protocols. We propose a Three-Dimensional Clinical Intent Framework evaluating: (1) Spatial Intent: discriminating precise targets amid visual noise, (2) Temporal Intent: inferring causal rationale through retrospective and prospective reasoning, and (3) Standard Intent: verifying…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMachine Learning in Healthcare · Artificial Intelligence in Healthcare and Education · Multimodal Machine Learning Applications