Guiding Long-Short Term Memory for Image Caption Generation
Xu Jia, Efstratios Gavves, Basura Fernando, Tinne Tuytelaars

TL;DR
This paper introduces gLSTM, an enhanced LSTM model for image captioning that incorporates semantic image information and improved beam search strategies, achieving competitive results on standard datasets.
Contribution
The paper presents a novel gLSTM model that integrates semantic image features into each LSTM unit for better caption generation.
Findings
gLSTM outperforms standard LSTM on benchmark datasets
Semantic guidance improves caption relevance
Length normalization enhances beam search results
Abstract
In this work we focus on the problem of image caption generation. We propose an extension of the long short term memory (LSTM) model, which we coin gLSTM for short. In particular, we add semantic information extracted from the image as extra input to each unit of the LSTM block, with the aim of guiding the model towards solutions that are more tightly coupled to the image content. Additionally, we explore different length normalization strategies for beam search in order to prevent from favoring short sentences. On various benchmark datasets such as Flickr8K, Flickr30K and MS COCO, we obtain results that are on par with or even outperform the current state-of-the-art.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Advanced Image and Video Retrieval Techniques · Human Pose and Action Recognition
MethodsSigmoid Activation · Tanh Activation · Long Short-Term Memory
