TL;DR
This paper introduces a sequential cross-modal alignment model for image captioning that leverages human gaze data to produce more natural and speaker-aligned descriptions, highlighting the importance of temporal gaze information in visual language tasks.
Contribution
It presents the first sequential gaze-driven image captioning model, demonstrating improved description quality by integrating gaze data with a recurrent attention mechanism.
Findings
Gaze-driven models produce more natural and diverse descriptions.
Sequential processing of gaze data enhances alignment with human descriptions.
Gaze encoding with recurrent components improves caption quality.
Abstract
When speakers describe an image, they tend to look at objects before mentioning them. In this paper, we investigate such sequential cross-modal alignment by modelling the image description generation process computationally. We take as our starting point a state-of-the-art image captioning system and develop several model variants that exploit information from human gaze patterns recorded during language production. In particular, we propose the first approach to image description generation where visual processing is modelled . Our experiments and analyses confirm that better descriptions can be obtained by exploiting gaze-driven attention and shed light on human cognitive processes by comparing different ways of aligning the gaze modality with language production. We find that processing gaze data sequentially leads to descriptions that are better aligned to…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
