Caption-Driven Explorations: Aligning Image and Text Embeddings through   Human-Inspired Foveated Vision

Dario Zanca; Andrea Zugarini; Simon Dietz; Thomas R. Altstidl; Mark A.; Turban Ndjeuha; Leo Schwinn; Bjoern Eskofier

arXiv:2408.09948·cs.CV·August 20, 2024

Caption-Driven Explorations: Aligning Image and Text Embeddings through Human-Inspired Foveated Vision

Dario Zanca, Andrea Zugarini, Simon Dietz, Thomas R. Altstidl, Mark A., Turban Ndjeuha, Leo Schwinn, Bjoern Eskofier

PDF

Open Access

TL;DR

This paper introduces a new dataset and a zero-shot model for predicting human-like visual scanpaths during image captioning, improving understanding of attention in vision and AI.

Contribution

It presents CapMIT1003 dataset and NevaClip model, combining CLIP and NeVA algorithms for task-driven scanpath prediction, a novel approach in attention modeling.

Findings

01

NevaClip outperforms existing models in plausibility

02

CapMIT1003 enables studying attention during captioning

03

Zero-shot approach reduces need for training data

Abstract

Understanding human attention is crucial for vision science and AI. While many models exist for free-viewing, less is known about task-driven image exploration. To address this, we introduce CapMIT1003, a dataset with captions and click-contingent image explorations, to study human attention during the captioning task. We also present NevaClip, a zero-shot method for predicting visual scanpaths by combining CLIP models with NeVA algorithms. NevaClip generates fixations to align the representations of foveated visual stimuli and captions. The simulated scanpaths outperform existing human attention models in plausibility for captioning and free-viewing tasks. This research enhances the understanding of human attention and advances scanpath prediction models.

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSubtitles and Audiovisual Media · Language, Metaphor, and Cognition · Multimodal Machine Learning Applications

MethodsSoftmax · Attention Is All You Need · ALIGN · Contrastive Language-Image Pre-training