Retrieval-Augmented Multimodal Language Modeling

Michihiro Yasunaga; Armen Aghajanyan; Weijia Shi; Rich James; Jure; Leskovec; Percy Liang; Mike Lewis; Luke Zettlemoyer; Wen-tau Yih

arXiv:2211.12561·cs.CV·June 7, 2023·29 cites

Retrieval-Augmented Multimodal Language Modeling

Michihiro Yasunaga, Armen Aghajanyan, Weijia Shi, Rich James, Jure, Leskovec, Percy Liang, Mike Lewis, Luke Zettlemoyer, Wen-tau Yih

PDF

Open Access 1 Video

TL;DR

This paper introduces RA-CM3, a retrieval-augmented multimodal model that fetches relevant external information to improve text and image generation, outperforming prior models like DALL-E with less training compute.

Contribution

The paper presents RA-CM3, the first multimodal model integrating retrieval of external data for both text and image generation, enhancing scalability and performance.

Findings

01

RA-CM3 outperforms DALL-E and CM3 on image and caption generation tasks.

02

RA-CM3 requires less training compute (<30% of DALL-E).

03

RA-CM3 demonstrates novel capabilities like faithful image generation and in-context learning.

Abstract

Recent multimodal models such as DALL-E and CM3 have achieved remarkable progress in text-to-image and image-to-text generation. However, these models store all learned knowledge (e.g., the appearance of the Eiffel Tower) in the model parameters, requiring increasingly larger models and training data to capture more knowledge. To integrate knowledge in a more scalable and modular way, we propose a retrieval-augmented multimodal model, which enables a base multimodal model (generator) to refer to relevant text and images fetched by a retriever from external memory (e.g., documents on the web). Specifically, for the retriever, we use a pretrained CLIP, and for the generator, we train a CM3 Transformer on the LAION dataset. Our resulting model, named Retrieval-Augmented CM3 (RA-CM3), is the first multimodal model that can retrieve and generate both text and images. We show that RA-CM3…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

Retrieval-Augmented Multimodal Language Modeling· slideslive

Taxonomy

TopicsMultimodal Machine Learning Applications · Advanced Image and Video Retrieval Techniques · Topic Modeling

MethodsMulti-Head Attention · Attention Is All You Need · Layer Normalization · Adam · Absolute Position Encodings · Linear Layer · Dense Connections · Residual Connection · Byte Pair Encoding · Position-Wise Feed-Forward Layer