TL;DR
This paper systematically compares 17 CNN architectures to determine their effectiveness in image caption generation, revealing that higher complexity and object recognition accuracy do not always lead to better captioning performance.
Contribution
It provides a comprehensive evaluation of various CNN architectures for image captioning, filling a gap in understanding their relative efficacy.
Findings
Model complexity does not correlate with captioning performance.
Object recognition accuracy is not a reliable predictor of captioning quality.
Certain CNN architectures outperform others regardless of size or recognition accuracy.
Abstract
Aided by recent advances in Deep Learning, Image Caption Generation has seen tremendous progress over the last few years. Most methods use transfer learning to extract visual information, in the form of image features, with the help of pre-trained Convolutional Neural Network models followed by transformation of the visual information using a Caption Generator module to generate the output sentences. Different methods have used different Convolutional Neural Network Architectures and, to the best of our knowledge, there is no systematic study which compares the relative efficacy of different Convolutional Neural Network architectures for extracting the visual information. In this work, we have evaluated 17 different Convolutional Neural Networks on two popular Image Caption Generation frameworks: the first based on Neural Image Caption (NIC) generation model and the second based on…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
