LMCap: Few-shot Multilingual Image Captioning by Retrieval Augmented   Language Model Prompting

Rita Ramos; Bruno Martins; Desmond Elliott

arXiv:2305.19821·cs.CL·June 1, 2023·1 cites

LMCap: Few-shot Multilingual Image Captioning by Retrieval Augmented Language Model Prompting

Rita Ramos, Bruno Martins, Desmond Elliott

PDF

Open Access 1 Repo

TL;DR

LMCap introduces a retrieval-augmented, few-shot multilingual image captioning approach that leverages language models and retrieved captions, eliminating the need for large-scale multilingual caption datasets.

Contribution

It proposes a novel image-blind captioning method using retrieval and prompting, bypassing traditional training on multilingual caption data.

Findings

01

Competitive with fully-supervised models

02

No need for captioning training data

03

Effective across diverse geographic images

Abstract

Multilingual image captioning has recently been tackled by training with large-scale machine translated data, which is an expensive, noisy, and time-consuming process. Without requiring any multilingual caption data, we propose LMCap, an image-blind few-shot multilingual captioning model that works by prompting a language model with retrieved captions. Specifically, instead of following the standard encoder-decoder paradigm, given an image, LMCap first retrieves the captions of similar images using a multilingual CLIP encoder. These captions are then combined into a prompt for an XGLM decoder, in order to generate captions in the desired language. In other words, the generation model does not directly process the image, instead processing retrieved captions. Experiments on the XM3600 dataset of geographically diverse images show that our model is competitive with fully-supervised…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

ritaramo/lmcap
pytorchOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Natural Language Processing Techniques · Advanced Image and Video Retrieval Techniques

MethodsContrastive Language-Image Pre-training