Re-ViLM: Retrieval-Augmented Visual Language Model for Zero and Few-Shot Image Captioning
Zhuolin Yang, Wei Ping, Zihan Liu, Vijay Korthikanti, Weili Nie, De-An, Huang, Linxi Fan, Zhiding Yu, Shiyi Lan, Bo Li, Ming-Yu Liu, Yuke Zhu,, Mohammad Shoeybi, Bryan Catanzaro, Chaowei Xiao, Anima Anandkumar

TL;DR
Re-ViLM is a retrieval-augmented visual language model that enhances zero and few-shot image captioning by retrieving knowledge from external databases, reducing model size and improving adaptability to new data.
Contribution
This work introduces Re-ViLM, a novel retrieval-augmented model that supports efficient zero and few-shot image captioning with fewer parameters and easier data updates.
Findings
Re-ViLM outperforms baseline models in zero-shot and few-shot image captioning.
Re-ViLM uses 4 times fewer parameters than comparable methods.
The model demonstrates strong performance in out-of-domain settings.
Abstract
Augmenting pretrained language models (LMs) with a vision encoder (e.g., Flamingo) has obtained the state-of-the-art results in image-to-text generation. However, these models store all the knowledge within their parameters, thus often requiring enormous model parameters to model the abundant visual concepts and very rich textual descriptions. Additionally, they are inefficient in incorporating new data, requiring a computational-expensive fine-tuning process. In this work, we introduce a Retrieval-augmented Visual Language Model, Re-ViLM, built upon the Flamingo, that supports retrieving the relevant knowledge from the external database for zero and in-context few-shot image-to-text generations. By storing certain knowledge explicitly in the external database, our approach reduces the number of model parameters and can easily accommodate new data during evaluation by simply updating…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Domain Adaptation and Few-Shot Learning · Advanced Image and Video Retrieval Techniques
