Towards Retrieval-Augmented Architectures for Image Captioning
Sara Sarto, Marcella Cornia, Lorenzo Baraldi, Alessandro Nicolosi,, Rita Cucchiara

TL;DR
This paper introduces retrieval-augmented image captioning models that leverage external visual memory to enhance caption quality, demonstrating significant improvements on COCO and nocaps datasets.
Contribution
It presents a novel retrieval-augmented architecture with external memory and a knowledge retriever, advancing image captioning methods beyond traditional deep learning models.
Findings
External memory improves caption quality significantly.
Larger retrieval corpus yields better results.
Models outperform baseline on COCO and nocaps datasets.
Abstract
The objective of image captioning models is to bridge the gap between the visual and linguistic modalities by generating natural language descriptions that accurately reflect the content of input images. In recent years, researchers have leveraged deep learning-based models and made advances in the extraction of visual features and the design of multimodal connections to tackle this task. This work presents a novel approach towards developing image captioning models that utilize an external kNN memory to improve the generation process. Specifically, we propose two model variants that incorporate a knowledge retriever component that is based on visual similarities, a differentiable encoder to represent input images, and a kNN-augmented language model to predict tokens based on contextual cues and text retrieved from the external memory. We experimentally validate our approach on COCO and…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Advanced Image and Video Retrieval Techniques · Video Analysis and Summarization
