ViTGaze: Gaze Following with Interaction Features in Vision Transformers
Yuehao Song, Xinggang Wang, Jingfeng Yao, Wenyu Liu, Jinglin Zhang and, Xiangmin Xu

TL;DR
ViTGaze introduces a novel single-modality gaze following framework based on vision transformers, leveraging self-attention for human-scene interaction understanding, achieving state-of-the-art results with fewer parameters.
Contribution
The paper proposes a new ViTGaze framework that uses self-attention in vision transformers to model human-scene interactions for gaze following, reducing complexity and improving performance.
Findings
Achieves 3.4% higher AUC than previous single-modality methods.
Attains 5.1% higher average precision (AP).
Uses 59% fewer parameters than multi-modality approaches.
Abstract
Gaze following aims to interpret human-scene interactions by predicting the person's focal point of gaze. Prevailing approaches often adopt a two-stage framework, whereby multi-modality information is extracted in the initial stage for gaze target prediction. Consequently, the efficacy of these methods highly depends on the precision of the preceding modality extraction. Others use a single-modality approach with complex decoders, increasing network computational load. Inspired by the remarkable success of pre-trained plain vision transformers (ViTs), we introduce a novel single-modality gaze following framework called ViTGaze. In contrast to previous methods, it creates a novel gaze following framework based mainly on powerful encoders (relative decoder parameters less than 1%). Our principal insight is that the inter-token interactions within self-attention can be transferred to…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsGaze Tracking and Assistive Technology · Visual Attention and Saliency Detection · Teleoperation and Haptic Systems
