Unified Attention Modeling for Efficient Free-Viewing and Visual Search via Shared Representations
Fatma Youssef Mohammed, Kostas Alexis

TL;DR
This paper proposes a shared neural network architecture that models human attention for both free-viewing and visual search, enabling efficient transfer of learned representations with minimal performance loss and significant computational savings.
Contribution
It introduces a unified attention model based on HAT that demonstrates shared representations between free-viewing and visual search tasks, reducing training costs and maintaining high accuracy.
Findings
Shared representations enable transfer with only 3.86% performance drop.
Model reduces computational costs by over 92% in GFLOPs.
Transferability maintains high similarity to human scanpaths.
Abstract
Computational human attention modeling in free-viewing and task-specific settings is often studied separately, with limited exploration of whether a common representation exists between them. This work investigates this question and proposes a neural network architecture that builds upon the Human Attention transformer (HAT) to test the hypothesis. Our results demonstrate that free-viewing and visual search can efficiently share a common representation, allowing a model trained in free-viewing attention to transfer its knowledge to task-driven visual search with a performance drop of only 3.86% in the predicted fixation scanpaths, measured by the semantic sequence score (SemSS) metric which reflects the similarity between predicted and human scanpaths. This transfer reduces computational costs by 92.29% in terms of GFLOPs and 31.23% in terms of trainable parameters.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAdvanced Image and Video Retrieval Techniques · Visual Attention and Saliency Detection · Image Retrieval and Classification Techniques
