Caption-Driven Explorations: Aligning Image and Text Embeddings through Human-Inspired Foveated Vision
Dario Zanca, Andrea Zugarini, Simon Dietz, Thomas R. Altstidl, Mark A., Turban Ndjeuha, Leo Schwinn, Bjoern Eskofier

TL;DR
This paper introduces a new dataset and a zero-shot model for predicting human-like visual scanpaths during image captioning, improving understanding of attention in vision and AI.
Contribution
It presents CapMIT1003 dataset and NevaClip model, combining CLIP and NeVA algorithms for task-driven scanpath prediction, a novel approach in attention modeling.
Findings
NevaClip outperforms existing models in plausibility
CapMIT1003 enables studying attention during captioning
Zero-shot approach reduces need for training data
Abstract
Understanding human attention is crucial for vision science and AI. While many models exist for free-viewing, less is known about task-driven image exploration. To address this, we introduce CapMIT1003, a dataset with captions and click-contingent image explorations, to study human attention during the captioning task. We also present NevaClip, a zero-shot method for predicting visual scanpaths by combining CLIP models with NeVA algorithms. NevaClip generates fixations to align the representations of foveated visual stimuli and captions. The simulated scanpaths outperform existing human attention models in plausibility for captioning and free-viewing tasks. This research enhances the understanding of human attention and advances scanpath prediction models.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSubtitles and Audiovisual Media · Language, Metaphor, and Cognition · Multimodal Machine Learning Applications
MethodsSoftmax · Attention Is All You Need · ALIGN · Contrastive Language-Image Pre-training
