OAT: Object-Level Attention Transformer for Gaze Scanpath Prediction

Yini Fang; Jingling Yu; Haozheng Zhang; Ralf van der Lans; Bertram Shi

arXiv:2407.13335·cs.CV·July 19, 2024

OAT: Object-Level Attention Transformer for Gaze Scanpath Prediction

Yini Fang, Jingling Yu, Haozheng Zhang, Ralf van der Lans, Bertram Shi

PDF

Open Access 1 Repo

TL;DR

This paper introduces OAT, a transformer-based model that predicts human gaze scanpaths during visual search by focusing on objects rather than pixels, improving alignment with human attention patterns.

Contribution

The paper presents a novel object-level attention transformer architecture with a new positional encoding for predicting human scanpaths in cluttered scenes.

Findings

01

OAT outperforms pixel-based models on established metrics.

02

OAT generalizes well to unseen layouts and objects.

03

OAT's predictions closely match human gaze behavior.

Abstract

Visual search is important in our daily life. The efficient allocation of visual attention is critical to effectively complete visual search tasks. Prior research has predominantly modelled the spatial allocation of visual attention in images at the pixel level, e.g. using a saliency map. However, emerging evidence shows that visual attention is guided by objects rather than pixel intensities. This paper introduces the Object-level Attention Transformer (OAT), which predicts human scanpaths as they search for a target object within a cluttered scene of distractors. OAT uses an encoder-decoder architecture. The encoder captures information about the position and appearance of the objects within an image and about the target. The decoder predicts the gaze scanpath as a sequence of object fixations, by integrating output features from both the encoder and decoder. We also propose a new…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

hkust-nisl/oat_eccv24
pytorchOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsGaze Tracking and Assistive Technology · Video Surveillance and Tracking Methods · Brain Tumor Detection and Classification

MethodsResidual Connection · Byte Pair Encoding · Layer Normalization · Label Smoothing · Linear Layer · Adam · Dropout · Multi-Head Attention · Dense Connections · Softmax