Image Representations and New Domains in Neural Image Captioning
Jack Hessel, Nicolas Savva, Michael J. Wilber

TL;DR
This paper investigates whether recent advances in neural image captioning are primarily driven by language models rather than image representations, showing that high-quality captions can be generated even with poor image features and exploring dataset factors.
Contribution
It demonstrates that current captioning models rely heavily on language models and that dataset characteristics like multiple captions per image influence performance.
Findings
Caption quality remains high with poor image representations.
Multiple captions per image improve captioning performance.
Dataset choice significantly impacts captioning results.
Abstract
We examine the possibility that recent promising results in automatic caption generation are due primarily to language models. By varying image representation quality produced by a convolutional neural network, we find that a state-of-the-art neural captioning algorithm is able to produce quality captions even when provided with surprisingly poor image representations. We replicate this result in a new, fine-grained, transfer learned captioning domain, consisting of 66K recipe image/title pairs. We also provide some experiments regarding the appropriateness of datasets for automatic captioning, and find that having multiple captions per image is beneficial, but not an absolute requirement.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
