Gaze-Regularized VLMs for Ego-Centric Behavior Understanding
Anupam Pani, Yanchao Yang

TL;DR
This paper presents a gaze-regularized framework that enhances vision-language models for egocentric behavior understanding by integrating gaze data, leading to improved future event prediction accuracy.
Contribution
The study introduces a novel method for incorporating gaze information into VLMs during training, improving their ability to predict future actions in egocentric scenarios.
Findings
13% improvement in semantic scores over baseline models
Effective alignment of model attention with human gaze patterns
Enhanced prediction of future events in egocentric videos
Abstract
Eye gaze, encompassing fixations and saccades, provides critical insights into human intentions and future actions. This study introduces a gaze-regularized framework that enhances Vision Language Models (VLMs) for egocentric behavior understanding. Unlike existing methods that rely solely on visual data and overlook gaze information, our approach directly incorporates gaze information into the VLM architecture during training. By generating gaze-based queries, the model dynamically focuses on gaze-highlighted regions, while a gaze-regularization mechanism ensures the alignment of model attention with human attention patterns. To better understand how gaze can be effectively integrated into VLMs, we conducted extensive experiments exploring various strategies for incorporating gaze data. These innovations enable the prediction of future events with detailed action descriptions.…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Gaze Tracking and Assistive Technology · Visual Attention and Saliency Detection
