Contrastive Language-Image Pretrained Models are Zero-Shot Human Scanpath Predictors
Dario Zanca, Andrea Zugarini, Simon Dietz, Thomas R. Altstidl, Mark A., Turban Ndjeuha, Leo Schwinn, Bjoern Eskofier

TL;DR
This paper introduces NevaClip, a zero-shot model combining CLIP and neural attention to predict human-like scanpaths, validated on a new dataset, advancing understanding of task-driven visual exploration.
Contribution
The paper presents NevaClip, a novel zero-shot method for scanpath prediction that integrates language and visual models, and introduces CapMIT1003, a new dataset for studying attention during captioning tasks.
Findings
NevaClip outperforms existing models in scanpath plausibility.
Caption guidance significantly influences scanpath behavior.
Incorrect captions lead to more random scanpaths.
Abstract
Understanding the mechanisms underlying human attention is a fundamental challenge for both vision science and artificial intelligence. While numerous computational models of free-viewing have been proposed, less is known about the mechanisms underlying task-driven image exploration. To address this gap, we present CapMIT1003, a database of captions and click-contingent image explorations collected during captioning tasks. CapMIT1003 is based on the same stimuli from the well-known MIT1003 benchmark, for which eye-tracking data under free-viewing conditions is available, which offers a promising opportunity to concurrently study human attention under both tasks. We make this dataset publicly available to facilitate future research in this field. In addition, we introduce NevaClip, a novel zero-shot method for predicting visual scanpaths that combines contrastive language-image…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsVisual Attention and Saliency Detection · Multimodal Machine Learning Applications · Retinal Imaging and Analysis
