ViTGaze: Gaze Following with Interaction Features in Vision Transformers

Yuehao Song; Xinggang Wang; Jingfeng Yao; Wenyu Liu; Jinglin Zhang and; Xiangmin Xu

arXiv:2403.12778·cs.CV·November 15, 2024·1 cites

ViTGaze: Gaze Following with Interaction Features in Vision Transformers

Yuehao Song, Xinggang Wang, Jingfeng Yao, Wenyu Liu, Jinglin Zhang and, Xiangmin Xu

PDF

Open Access 1 Repo 1 Models

TL;DR

ViTGaze introduces a novel single-modality gaze following framework based on vision transformers, leveraging self-attention for human-scene interaction understanding, achieving state-of-the-art results with fewer parameters.

Contribution

The paper proposes a new ViTGaze framework that uses self-attention in vision transformers to model human-scene interactions for gaze following, reducing complexity and improving performance.

Findings

01

Achieves 3.4% higher AUC than previous single-modality methods.

02

Attains 5.1% higher average precision (AP).

03

Uses 59% fewer parameters than multi-modality approaches.

Abstract

Gaze following aims to interpret human-scene interactions by predicting the person's focal point of gaze. Prevailing approaches often adopt a two-stage framework, whereby multi-modality information is extracted in the initial stage for gaze target prediction. Consequently, the efficacy of these methods highly depends on the precision of the preceding modality extraction. Others use a single-modality approach with complex decoders, increasing network computational load. Inspired by the remarkable success of pre-trained plain vision transformers (ViTs), we introduce a novel single-modality gaze following framework called ViTGaze. In contrast to previous methods, it creates a novel gaze following framework based mainly on powerful encoders (relative decoder parameters less than 1%). Our principal insight is that the inter-token interactions within self-attention can be transferred to…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

hustvl/vitgaze
pytorchOfficial

Models

🤗
yhsong/ViTGaze
model· ♡ 1
♡ 1

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsGaze Tracking and Assistive Technology · Visual Attention and Saliency Detection · Teleoperation and Haptic Systems