TL;DR
This paper introduces a flexible method using constrained beam search and pretrained embeddings to improve out-of-domain image captioning, enabling better generalization to novel scenes and objects without retraining.
Contribution
The authors propose a novel approach that incorporates image tags at test time with constrained beam search, enhancing captioning for unseen objects and scenes without retraining models.
Findings
Achieved state-of-the-art out-of-domain captioning results on MSCOCO.
Significantly outperformed methods that use tag predictions during training.
Improved ImageNet caption quality by leveraging ground-truth labels.
Abstract
Existing image captioning models do not generalize well to out-of-domain images containing novel scenes or objects. This limitation severely hinders the use of these models in real world applications dealing with images in the wild. We address this problem using a flexible approach that enables existing deep captioning architectures to take advantage of image taggers at test time, without re-training. Our method uses constrained beam search to force the inclusion of selected tag words in the output, and fixed, pretrained word embeddings to facilitate vocabulary expansion to previously unseen tag words. Using this approach we achieve state of the art results for out-of-domain captioning on MSCOCO (and improved results for in-domain captioning). Perhaps surprisingly, our results significantly outperform approaches that incorporate the same tag predictions into the learning algorithm. We…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
