CNN+CNN: Convolutional Decoders for Image Captioning
Qingzhong Wang, Antoni B. Chan

TL;DR
This paper introduces a convolutional neural network framework for image captioning, achieving faster training and comparable or better performance than traditional RNN/LSTM-based models by leveraging parallel computation.
Contribution
The authors propose a CNN-only model for image captioning that outperforms LSTM-based models in training speed and achieves competitive or superior captioning metrics.
Findings
CNN-based model trains 3 times faster than LSTM-based NIC.
The model achieves comparable BLEU and METEOR scores, higher CIDEr scores.
Outperforms hierarchical LSTMs on paragraph annotation tasks.
Abstract
Image captioning is a challenging task that combines the field of computer vision and natural language processing. A variety of approaches have been proposed to achieve the goal of automatically describing an image, and recurrent neural network (RNN) or long-short term memory (LSTM) based models dominate this field. However, RNNs or LSTMs cannot be calculated in parallel and ignore the underlying hierarchical structure of a sentence. In this paper, we propose a framework that only employs convolutional neural networks (CNNs) to generate captions. Owing to parallel computing, our basic model is around 3 times faster than NIC (an LSTM-based model) during training time, while also providing better results. We conduct extensive experiments on MSCOCO and investigate the influence of the model width and depth. Compared with LSTM-based models that apply similar attention mechanisms, our…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Topic Modeling · Advanced Image and Video Retrieval Techniques
