Thinking with Gaze: Sequential Eye-Tracking as Visual Reasoning Supervision for Medical VLMs
Yiwei Li, Zihao Wu, Yanjun Lv, Hanqi Jiang, Weihang You, Zhengliang Liu, Dajiang Zhu, Xiang Li, Quanzheng Li, Tianming Liu, Lin Zhao

TL;DR
This paper introduces a novel method using eye-tracking data as supervision to guide vision-language models in medical imaging, enabling more human-like visual reasoning and improving performance on radiology tasks.
Contribution
It proposes a new approach that incorporates sequential gaze information into VLMs, enhancing their reasoning process and robustness in medical imaging applications.
Findings
Achieves state-of-the-art in-domain performance on MIMIC-EYE
Improves out-of-domain robustness in medical VLMs
Demonstrates the effectiveness of gaze supervision in visual reasoning
Abstract
Vision--language models (VLMs) process images as visual tokens, yet their intermediate reasoning is often carried out in text, which can be suboptimal for visually grounded radiology tasks. Radiologists instead diagnose via sequential visual search; eye-tracking captures this process as time-ordered gaze trajectories that reveal how evidence is acquired over time. We use eye-gaze as supervision to guide VLM reasoning by introducing a small set of dedicated gaze tokens. These tokens are trained to predict gaze-selected image patch indices in temporal order, encouraging the model to follow human-like evidence acquisition and integration. Experiments on MIMIC-EYE and multiple external zero-shot benchmarks show consistent gains over baselines, achieving state-of-the-art in-domain performance and improved out-of-domain robustness. These results highlight temporally ordered gaze as an…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Radiology practices and education · Domain Adaptation and Few-Shot Learning
