Seeing with Humans: Gaze-Assisted Neural Image Captioning
Yusuke Sugano, Andreas Bulling

TL;DR
This paper explores how human gaze data can enhance neural image captioning by integrating gaze into attention mechanisms, leading to improved captioning performance and better scene understanding.
Contribution
It introduces a novel split attention model that incorporates human gaze into neural captioning, demonstrating the benefit of gaze data for scene-centric tasks.
Findings
Gaze data improves image captioning accuracy.
Gaze complements machine attention in scene understanding.
The proposed model outperforms baseline captioning methods.
Abstract
Gaze reflects how humans process visual scenes and is therefore increasingly used in computer vision systems. Previous works demonstrated the potential of gaze for object-centric tasks, such as object localization and recognition, but it remains unclear if gaze can also be beneficial for scene-centric tasks, such as image captioning. We present a new perspective on gaze-assisted image captioning by studying the interplay between human gaze and the attention mechanism of deep neural networks. Using a public large-scale gaze dataset, we first assess the relationship between state-of-the-art object and scene recognition models, bottom-up visual saliency, and human gaze. We then propose a novel split attention model for image captioning. Our model integrates human gaze information into an attention-based long short-term memory architecture, and allows the algorithm to allocate attention…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Visual Attention and Saliency Detection · Gaze Tracking and Assistive Technology
