Contrastive Language-Image Pretrained Models are Zero-Shot Human   Scanpath Predictors

Dario Zanca; Andrea Zugarini; Simon Dietz; Thomas R. Altstidl; Mark A.; Turban Ndjeuha; Leo Schwinn; Bjoern Eskofier

arXiv:2305.12380·cs.CV·May 24, 2023·1 cites

Contrastive Language-Image Pretrained Models are Zero-Shot Human Scanpath Predictors

Dario Zanca, Andrea Zugarini, Simon Dietz, Thomas R. Altstidl, Mark A., Turban Ndjeuha, Leo Schwinn, Bjoern Eskofier

PDF

Open Access 1 Datasets

TL;DR

This paper introduces NevaClip, a zero-shot model combining CLIP and neural attention to predict human-like scanpaths, validated on a new dataset, advancing understanding of task-driven visual exploration.

Contribution

The paper presents NevaClip, a novel zero-shot method for scanpath prediction that integrates language and visual models, and introduces CapMIT1003, a new dataset for studying attention during captioning tasks.

Findings

01

NevaClip outperforms existing models in scanpath plausibility.

02

Caption guidance significantly influences scanpath behavior.

03

Incorrect captions lead to more random scanpaths.

Abstract

Understanding the mechanisms underlying human attention is a fundamental challenge for both vision science and artificial intelligence. While numerous computational models of free-viewing have been proposed, less is known about the mechanisms underlying task-driven image exploration. To address this gap, we present CapMIT1003, a database of captions and click-contingent image explorations collected during captioning tasks. CapMIT1003 is based on the same stimuli from the well-known MIT1003 benchmark, for which eye-tracking data under free-viewing conditions is available, which offers a promising opportunity to concurrently study human attention under both tasks. We make this dataset publicly available to facilitate future research in this field. In addition, we introduce NevaClip, a novel zero-shot method for predicting visual scanpaths that combines contrastive language-image…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Datasets

azugarini/CapMIT1003
dataset· 20 dl
20 dl

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsVisual Attention and Saliency Detection · Multimodal Machine Learning Applications · Retinal Imaging and Analysis