Gaze-Regularized VLMs for Ego-Centric Behavior Understanding

Anupam Pani; Yanchao Yang

arXiv:2603.23190·cs.CV·March 25, 2026

Gaze-Regularized VLMs for Ego-Centric Behavior Understanding

Anupam Pani, Yanchao Yang

PDF

Open Access

TL;DR

This paper presents a gaze-regularized framework that enhances vision-language models for egocentric behavior understanding by integrating gaze data, leading to improved future event prediction accuracy.

Contribution

The study introduces a novel method for incorporating gaze information into VLMs during training, improving their ability to predict future actions in egocentric scenarios.

Findings

01

13% improvement in semantic scores over baseline models

02

Effective alignment of model attention with human gaze patterns

03

Enhanced prediction of future events in egocentric videos

Abstract

Eye gaze, encompassing fixations and saccades, provides critical insights into human intentions and future actions. This study introduces a gaze-regularized framework that enhances Vision Language Models (VLMs) for egocentric behavior understanding. Unlike existing methods that rely solely on visual data and overlook gaze information, our approach directly incorporates gaze information into the VLM architecture during training. By generating gaze-based queries, the model dynamically focuses on gaze-highlighted regions, while a gaze-regularization mechanism ensures the alignment of model attention with human attention patterns. To better understand how gaze can be effectively integrated into VLMs, we conducted extensive experiments exploring various strategies for incorporating gaze data. These innovations enable the prediction of future events with detailed action descriptions.…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Gaze Tracking and Assistive Technology · Visual Attention and Saliency Detection