Analysis of Convolutional Decoder for Image Caption Generation
Sulabh Katiyar, Samir Kumar Borgohain

TL;DR
This paper investigates the effectiveness of convolutional neural network decoders for image captioning, revealing their limitations in modeling long sentences and their limited benefits from network depth, data augmentation, and attention mechanisms.
Contribution
The study provides a comprehensive analysis of convolutional decoders for image captioning, highlighting their performance constraints compared to recurrent decoders.
Findings
Convolutional decoders do not benefit from increased network depth.
Data augmentation offers limited improvements for convolutional decoders.
Convolutional decoders perform well only with shorter sentences up to 15 words.
Abstract
Recently Convolutional Neural Networks have been proposed for Sequence Modelling tasks such as Image Caption Generation. However, unlike Recurrent Neural Networks, the performance of Convolutional Neural Networks as Decoders for Image Caption Generation has not been extensively studied. In this work, we analyse various aspects of Convolutional Neural Network based Decoders such as Network complexity and depth, use of Data Augmentation, Attention mechanism, length of sentences used during training, etc on performance of the model. We perform experiments using Flickr8k and Flickr30k image captioning datasets and observe that unlike Recurrent Neural Network based Decoder, Convolutional Decoder for Image Captioning does not generally benefit from increase in network depth, in the form of stacked Convolutional Layers, and also the use of Data Augmentation techniques. In addition, use of…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Domain Adaptation and Few-Shot Learning · Human Pose and Action Recognition
