Boosting Image Captioning with Attributes
Ting Yao, Yingwei Pan, Yehao Li, Zhaofan Qiu, Tao Mei

TL;DR
This paper introduces LSTM-A, an innovative image captioning model that integrates attributes with CNNs and RNNs, significantly improving captioning accuracy on the COCO dataset.
Contribution
The paper presents a novel architecture that end-to-end incorporates attributes into CNN-RNN frameworks for image captioning, enhancing performance.
Findings
Achieved state-of-the-art METEOR and CIDEr-D scores on COCO dataset.
Outperformed existing deep models in image captioning tasks.
Obtained top-1 performance on COCO captioning leaderboard.
Abstract
Automatically describing an image with a natural language has been an emerging challenge in both fields of computer vision and natural language processing. In this paper, we present Long Short-Term Memory with Attributes (LSTM-A) - a novel architecture that integrates attributes into the successful Convolutional Neural Networks (CNNs) plus Recurrent Neural Networks (RNNs) image captioning framework, by training them in an end-to-end manner. To incorporate attributes, we construct variants of architectures by feeding image representations and attributes into RNNs in different ways to explore the mutual but also fuzzy relationship between them. Extensive experiments are conducted on COCO image captioning dataset and our framework achieves superior results when compared to state-of-the-art deep models. Most remarkably, we obtain METEOR/CIDEr-D of 25.2%/98.6% on testing data of widely used…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Advanced Image and Video Retrieval Techniques · Domain Adaptation and Few-Shot Learning
Methods1x1 Convolution · Convolution · Average Pooling · Local Response Normalization · Auxiliary Classifier · Inception Module · *Communicated@Fast*How Do I Communicate to Expedia? · Dropout · Dense Connections · Max Pooling
