Re-ViLM: Retrieval-Augmented Visual Language Model for Zero and Few-Shot   Image Captioning

Zhuolin Yang; Wei Ping; Zihan Liu; Vijay Korthikanti; Weili Nie; De-An; Huang; Linxi Fan; Zhiding Yu; Shiyi Lan; Bo Li; Ming-Yu Liu; Yuke Zhu,; Mohammad Shoeybi; Bryan Catanzaro; Chaowei Xiao; Anima Anandkumar

arXiv:2302.04858·cs.CV·October 24, 2023·1 cites

Re-ViLM: Retrieval-Augmented Visual Language Model for Zero and Few-Shot Image Captioning

Zhuolin Yang, Wei Ping, Zihan Liu, Vijay Korthikanti, Weili Nie, De-An, Huang, Linxi Fan, Zhiding Yu, Shiyi Lan, Bo Li, Ming-Yu Liu, Yuke Zhu,, Mohammad Shoeybi, Bryan Catanzaro, Chaowei Xiao, Anima Anandkumar

PDF

Open Access

TL;DR

Re-ViLM is a retrieval-augmented visual language model that enhances zero and few-shot image captioning by retrieving knowledge from external databases, reducing model size and improving adaptability to new data.

Contribution

This work introduces Re-ViLM, a novel retrieval-augmented model that supports efficient zero and few-shot image captioning with fewer parameters and easier data updates.

Findings

01

Re-ViLM outperforms baseline models in zero-shot and few-shot image captioning.

02

Re-ViLM uses 4 times fewer parameters than comparable methods.

03

The model demonstrates strong performance in out-of-domain settings.

Abstract

Augmenting pretrained language models (LMs) with a vision encoder (e.g., Flamingo) has obtained the state-of-the-art results in image-to-text generation. However, these models store all the knowledge within their parameters, thus often requiring enormous model parameters to model the abundant visual concepts and very rich textual descriptions. Additionally, they are inefficient in incorporating new data, requiring a computational-expensive fine-tuning process. In this work, we introduce a Retrieval-augmented Visual Language Model, Re-ViLM, built upon the Flamingo, that supports retrieving the relevant knowledge from the external database for zero and in-context few-shot image-to-text generations. By storing certain knowledge explicitly in the external database, our approach reduces the number of model parameters and can easily accommodate new data during evaluation by simply updating…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Domain Adaptation and Few-Shot Learning · Advanced Image and Video Retrieval Techniques