Seeing with Humans: Gaze-Assisted Neural Image Captioning

Yusuke Sugano; Andreas Bulling

arXiv:1608.05203·cs.CV·August 19, 2016·49 cites

Seeing with Humans: Gaze-Assisted Neural Image Captioning

Yusuke Sugano, Andreas Bulling

PDF

Open Access

TL;DR

This paper explores how human gaze data can enhance neural image captioning by integrating gaze into attention mechanisms, leading to improved captioning performance and better scene understanding.

Contribution

It introduces a novel split attention model that incorporates human gaze into neural captioning, demonstrating the benefit of gaze data for scene-centric tasks.

Findings

01

Gaze data improves image captioning accuracy.

02

Gaze complements machine attention in scene understanding.

03

The proposed model outperforms baseline captioning methods.

Abstract

Gaze reflects how humans process visual scenes and is therefore increasingly used in computer vision systems. Previous works demonstrated the potential of gaze for object-centric tasks, such as object localization and recognition, but it remains unclear if gaze can also be beneficial for scene-centric tasks, such as image captioning. We present a new perspective on gaze-assisted image captioning by studying the interplay between human gaze and the attention mechanism of deep neural networks. Using a public large-scale gaze dataset, we first assess the relationship between state-of-the-art object and scene recognition models, bottom-up visual saliency, and human gaze. We then propose a novel split attention model for image captioning. Our model integrates human gaze information into an attention-based long short-term memory architecture, and allows the algorithm to allocate attention…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Visual Attention and Saliency Detection · Gaze Tracking and Assistive Technology