LMCap: Few-shot Multilingual Image Captioning by Retrieval Augmented Language Model Prompting
Rita Ramos, Bruno Martins, Desmond Elliott

TL;DR
LMCap introduces a retrieval-augmented, few-shot multilingual image captioning approach that leverages language models and retrieved captions, eliminating the need for large-scale multilingual caption datasets.
Contribution
It proposes a novel image-blind captioning method using retrieval and prompting, bypassing traditional training on multilingual caption data.
Findings
Competitive with fully-supervised models
No need for captioning training data
Effective across diverse geographic images
Abstract
Multilingual image captioning has recently been tackled by training with large-scale machine translated data, which is an expensive, noisy, and time-consuming process. Without requiring any multilingual caption data, we propose LMCap, an image-blind few-shot multilingual captioning model that works by prompting a language model with retrieved captions. Specifically, instead of following the standard encoder-decoder paradigm, given an image, LMCap first retrieves the captions of similar images using a multilingual CLIP encoder. These captions are then combined into a prompt for an XGLM decoder, in order to generate captions in the desired language. In other words, the generation model does not directly process the image, instead processing retrieved captions. Experiments on the XM3600 dataset of geographically diverse images show that our model is competitive with fully-supervised…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Natural Language Processing Techniques · Advanced Image and Video Retrieval Techniques
MethodsContrastive Language-Image Pre-training
