Gazeformer: Scalable, Effective and Fast Prediction of Goal-Directed Human Attention
Sounak Mondal, Zhibo Yang, Seoyoung Ahn, Dimitris Samaras, Gregory, Zelinsky, Minh Hoai

TL;DR
Gazeformer is a transformer-based model that predicts human gaze for unseen objects using natural language encoding, achieving superior accuracy and speed in goal-directed attention tasks, especially in zero-shot scenarios.
Contribution
The paper introduces Gazeformer, a novel zero-shot gaze prediction model that encodes targets via language, overcoming scalability issues of previous detector-based methods.
Findings
Gazeformer outperforms existing models in ZeroGaze tasks.
It surpasses target-detection models on standard gaze prediction.
Gazeformer is over five times faster than previous models.
Abstract
Predicting human gaze is important in Human-Computer Interaction (HCI). However, to practically serve HCI applications, gaze prediction models must be scalable, fast, and accurate in their spatial and temporal gaze predictions. Recent scanpath prediction models focus on goal-directed attention (search). Such models are limited in their application due to a common approach relying on trained target detectors for all possible objects, and the availability of human gaze data for their training (both not scalable). In response, we pose a new task called ZeroGaze, a new variant of zero-shot learning where gaze is predicted for never-before-searched objects, and we develop a novel model, Gazeformer, to solve the ZeroGaze problem. In contrast to existing methods using object detector modules, Gazeformer encodes the target using a natural language model, thus leveraging semantic similarities in…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsGaze Tracking and Assistive Technology · Visual Attention and Saliency Detection · Neonatal and fetal brain pathology
