OAT: Object-Level Attention Transformer for Gaze Scanpath Prediction
Yini Fang, Jingling Yu, Haozheng Zhang, Ralf van der Lans, Bertram Shi

TL;DR
This paper introduces OAT, a transformer-based model that predicts human gaze scanpaths during visual search by focusing on objects rather than pixels, improving alignment with human attention patterns.
Contribution
The paper presents a novel object-level attention transformer architecture with a new positional encoding for predicting human scanpaths in cluttered scenes.
Findings
OAT outperforms pixel-based models on established metrics.
OAT generalizes well to unseen layouts and objects.
OAT's predictions closely match human gaze behavior.
Abstract
Visual search is important in our daily life. The efficient allocation of visual attention is critical to effectively complete visual search tasks. Prior research has predominantly modelled the spatial allocation of visual attention in images at the pixel level, e.g. using a saliency map. However, emerging evidence shows that visual attention is guided by objects rather than pixel intensities. This paper introduces the Object-level Attention Transformer (OAT), which predicts human scanpaths as they search for a target object within a cluttered scene of distractors. OAT uses an encoder-decoder architecture. The encoder captures information about the position and appearance of the objects within an image and about the target. The decoder predicts the gaze scanpath as a sequence of object fixations, by integrating output features from both the encoder and decoder. We also propose a new…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsGaze Tracking and Assistive Technology · Video Surveillance and Tracking Methods · Brain Tumor Detection and Classification
MethodsResidual Connection · Byte Pair Encoding · Layer Normalization · Label Smoothing · Linear Layer · Adam · Dropout · Multi-Head Attention · Dense Connections · Softmax
