Thinking with Gaze: Sequential Eye-Tracking as Visual Reasoning Supervision for Medical VLMs

Yiwei Li; Zihao Wu; Yanjun Lv; Hanqi Jiang; Weihang You; Zhengliang Liu; Dajiang Zhu; Xiang Li; Quanzheng Li; Tianming Liu; Lin Zhao

arXiv:2603.06697·cs.CV·March 10, 2026

Thinking with Gaze: Sequential Eye-Tracking as Visual Reasoning Supervision for Medical VLMs

Yiwei Li, Zihao Wu, Yanjun Lv, Hanqi Jiang, Weihang You, Zhengliang Liu, Dajiang Zhu, Xiang Li, Quanzheng Li, Tianming Liu, Lin Zhao

PDF

Open Access

TL;DR

This paper introduces a novel method using eye-tracking data as supervision to guide vision-language models in medical imaging, enabling more human-like visual reasoning and improving performance on radiology tasks.

Contribution

It proposes a new approach that incorporates sequential gaze information into VLMs, enhancing their reasoning process and robustness in medical imaging applications.

Findings

01

Achieves state-of-the-art in-domain performance on MIMIC-EYE

02

Improves out-of-domain robustness in medical VLMs

03

Demonstrates the effectiveness of gaze supervision in visual reasoning

Abstract

Vision--language models (VLMs) process images as visual tokens, yet their intermediate reasoning is often carried out in text, which can be suboptimal for visually grounded radiology tasks. Radiologists instead diagnose via sequential visual search; eye-tracking captures this process as time-ordered gaze trajectories that reveal how evidence is acquired over time. We use eye-gaze as supervision to guide VLM reasoning by introducing a small set of dedicated gaze tokens. These tokens are trained to predict gaze-selected image patch indices in temporal order, encouraging the model to follow human-like evidence acquisition and integration. Experiments on MIMIC-EYE and multiple external zero-shot benchmarks show consistent gains over baselines, achieving state-of-the-art in-domain performance and improved out-of-domain robustness. These results highlight temporally ordered gaze as an…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Radiology practices and education · Domain Adaptation and Few-Shot Learning