Gaze-VLM:Bridging Gaze and VLMs through Attention Regularization for Egocentric Understanding

Anupam Pani; Yanchao Yang

arXiv:2510.21356·cs.CV·March 25, 2026

Gaze-VLM:Bridging Gaze and VLMs through Attention Regularization for Egocentric Understanding

Anupam Pani, Yanchao Yang

PDF

TL;DR

Gaze-VLM introduces a gaze-regularized attention mechanism during training to improve egocentric understanding tasks in vision-language models, enhancing future event prediction and activity comprehension.

Contribution

The paper presents a novel gaze-regularized training framework that aligns model attention with human gaze, improving VLM performance in egocentric tasks without using gaze at inference.

Findings

01

Up to 11% improvement in future event prediction accuracy

02

Approximately 7% enhancement in current activity understanding

03

Gaze-guided training boosts model robustness and accuracy

Abstract

Eye gaze offers valuable cues about attention, short-term intent, and future actions, making it a powerful signal for modeling egocentric behavior. In this work, we propose a gaze-regularized framework that enhances VLMs for two key egocentric understanding tasks: fine-grained future event prediction and current activity understanding. Unlike prior approaches that rely solely on visual inputs or use gaze as an auxiliary input signal , our method uses gaze only during training. We introduce a gaze-regularized attention mechanism that aligns model focus with human visual gaze. This design is flexible and modular, allowing it to generalize across multiple VLM architectures that utilize attention. Experimental results show that our approach improves semantic prediction scores by up to 11 for future event prediction and around 7 for current activity understanding, compared to the…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.