Exploring Overall Contextual Information for Image Captioning in   Human-Like Cognitive Style

Hongwei Ge; Zehang Yan; Kai Zhang; Mingde Zhao; Liang Sun

arXiv:1910.06475·cs.CV·October 16, 2019·6 cites

Exploring Overall Contextual Information for Image Captioning in Human-Like Cognitive Style

Hongwei Ge, Zehang Yan, Kai Zhang, Mingde Zhao, Liang Sun

PDF

Open Access

TL;DR

This paper introduces a human-like cognitive approach to image captioning using a novel bidirectional LSTM network and cross-modal attention, leading to improved caption quality and state-of-the-art results on COCO.

Contribution

It proposes MaBi-LSTMs for capturing overall contextual information and a cross-modal attention mechanism to enhance image captioning performance.

Findings

01

Achieves state-of-the-art results on Microsoft COCO dataset.

02

Improves encoder-decoder models with bidirectional context understanding.

03

Enhances caption quality by fusing forward and backward sentence information.

Abstract

Image captioning is a research hotspot where encoder-decoder models combining convolutional neural network (CNN) and long short-term memory (LSTM) achieve promising results. Despite significant progress, these models generate sentences differently from human cognitive styles. Existing models often generate a complete sentence from the first word to the end, without considering the influence of the following words on the whole sentence generation. In this paper, we explore the utilization of a human-like cognitive style, i.e., building overall cognition for the image to be described and the sentence to be constructed, for enhancing computer image understanding. This paper first proposes a Mutual-aid network structure with Bidirectional LSTMs (MaBi-LSTMs) for acquiring overall contextual information. In the training process, the forward and backward LSTMs encode the succeeding and…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Advanced Image and Video Retrieval Techniques · Domain Adaptation and Few-Shot Learning

MethodsSigmoid Activation · Tanh Activation · Long Short-Term Memory