Emu: Generative Pretraining in Multimodality
Quan Sun, Qiying Yu, Yufeng Cui, Fan Zhang, Xiaosong Zhang, Yueze, Wang, Hongcheng Gao, Jingjing Liu, Tiejun Huang, Xinlong Wang

TL;DR
Emu is a versatile Transformer-based multimodal foundation model capable of generating and understanding images and texts across various data types, demonstrating superior performance on multiple zero-shot and few-shot tasks.
Contribution
Introduces Emu, a unified autoregressive multimodal model that processes any combination of visual and textual data, enabling broad applications and improved performance.
Findings
Outperforms state-of-the-art large multimodal models on various tasks.
Supports diverse data sources like videos, webpages, and web-scale image-text pairs.
Demonstrates capabilities as a multimodal assistant with instruction tuning.
Abstract
We present Emu, a Transformer-based multimodal foundation model, which can seamlessly generate images and texts in multimodal context. This omnivore model can take in any single-modality or multimodal data input indiscriminately (e.g., interleaved image, text and video) through a one-model-for-all autoregressive training process. First, visual signals are encoded into embeddings, and together with text tokens form an interleaved input sequence. Emu is then end-to-end trained with a unified objective of classifying the next text token or regressing the next visual embedding in the multimodal sequence. This versatile multimodality empowers the exploration of diverse pretraining data sources at scale, such as videos with interleaved frames and text, webpages with interleaved images and text, as well as web-scale image-text pairs and video-text pairs. Emu can serve as a generalist…
Peer Reviews
Decision·ICLR 2024 poster
+ I agree with that the importance of video as a data source for learning large multimodal models has been overlooked so far. Leveraging videos as interleaved data will definitely provide much diverse supervision signals and facilitate scaling up training data + The writing and presentation of the paper is really good. The paper is overall well-written and easy to read. Especially, the authors describe all the details of the model architecture and training/inference procedures. + The experimenta
Emu applies the regression loss to latent embeddings computed by the Causal Transformer, whose parameters are randomly initialized and also learned during pretraining. I was surprised that the training went well with the proposed objective, because I think that without additional constraints, the model may easily fall into a degenerate case, like the Causal Transformer always outputting constant vectors. Please elaborate on the mechanism of the proposed l2 regression loss.
The model shows ability to do versatile generation and strong in-context learning capability.
Certain model details is not clear: 1. How does causal transformer convert an image as multiple visual tokens. In section 2, is {z_1, z_2, ... z_N} the same as g(I). 2. Is the N visual embeddings for the image decoder the same as the visual ebmedding after the causal transformer. Potential data issue impact the "zero-shot" restult: EMU is trained with Laion-COCO, which has image caption in the style of COCO. How does that impact results in table 1? Is the zero-shot results as good if removing
- The paper is well-organized and the problem is clearly defined. The authors provide a comprehensive introduction to the problem and the proposed solution, Emu, which appears to be novel and well thought out. - The unified objective for both text and visual data seems to be a promising approach to handle multimodal tasks, and the autoregressive training process is well justified. - The authors have undertaken extensive evaluations including zero-shot, few-shot, and in-the-wild evaluations, sh
- The paper misses some key and very relevant comparative works like Cm3Leon (https://arxiv.org/abs/2309.02591), AnyMAL(https://arxiv.org/abs/2309.16058) etc. These papers should be compared against and explained how the authors work differes from the same. - It would be beneficial to see a discussion on the scalability of Emu with respect to the size and diversity of training data, and how the model might perform with fewer resources or less diverse data. - Autoregressive models are know to be
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Topic Modeling · Advanced Image and Video Retrieval Techniques
