Exploring Overall Contextual Information for Image Captioning in Human-Like Cognitive Style
Hongwei Ge, Zehang Yan, Kai Zhang, Mingde Zhao, Liang Sun

TL;DR
This paper introduces a human-like cognitive approach to image captioning using a novel bidirectional LSTM network and cross-modal attention, leading to improved caption quality and state-of-the-art results on COCO.
Contribution
It proposes MaBi-LSTMs for capturing overall contextual information and a cross-modal attention mechanism to enhance image captioning performance.
Findings
Achieves state-of-the-art results on Microsoft COCO dataset.
Improves encoder-decoder models with bidirectional context understanding.
Enhances caption quality by fusing forward and backward sentence information.
Abstract
Image captioning is a research hotspot where encoder-decoder models combining convolutional neural network (CNN) and long short-term memory (LSTM) achieve promising results. Despite significant progress, these models generate sentences differently from human cognitive styles. Existing models often generate a complete sentence from the first word to the end, without considering the influence of the following words on the whole sentence generation. In this paper, we explore the utilization of a human-like cognitive style, i.e., building overall cognition for the image to be described and the sentence to be constructed, for enhancing computer image understanding. This paper first proposes a Mutual-aid network structure with Bidirectional LSTMs (MaBi-LSTMs) for acquiring overall contextual information. In the training process, the forward and backward LSTMs encode the succeeding and…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Advanced Image and Video Retrieval Techniques · Domain Adaptation and Few-Shot Learning
MethodsSigmoid Activation · Tanh Activation · Long Short-Term Memory
