Jointly Training Large Autoregressive Multimodal Models
Emanuele Aiello, Lili Yu, Yixin Nie, Armen Aghajanyan, Barlas Oguz

TL;DR
This paper introduces the JAM framework, a modular approach for jointly training large autoregressive models that generate high-quality multimodal outputs, combining text and image generation capabilities efficiently.
Contribution
The paper presents the first model explicitly designed for seamless multimodal generation, integrating existing models with a novel instruction-tuning strategy.
Findings
Unparalleled performance in multimodal output quality
Effective fusion of text and image generation models
Data-efficient instruction-tuning strategy
Abstract
In recent years, advances in the large-scale pretraining of language and text-to-image models have revolutionized the field of machine learning. Yet, integrating these two modalities into a single, robust model capable of generating seamless multimodal outputs remains a significant challenge. To address this gap, we present the Joint Autoregressive Mixture (JAM) framework, a modular approach that systematically fuses existing text and image generation models. We also introduce a specialized, data-efficient instruction-tuning strategy, tailored for mixed-modal generation tasks. Our final instruct-tuned model demonstrates unparalleled performance in generating high-quality multimodal outputs and represents the first model explicitly designed for this purpose.
Peer Reviews
Decision·ICLR 2024 poster
This work proposes a novel approach to bridge two separate image2text and text2image models into a new unified model that can generate both image and text. The resulting model demonstrates remarkable ability on generating interleaved image-text sequence under adequate qualitative evaluation compared with GILL.
**Method is restricted.** 1. The method is not general and nearly impossible for the community to follow. Because your method requires two identical image2text and text2image models, but nearly all available image-to-text and text-to-image models are of different architectures. 2. Thus, the only way for the community to test your method's effectiveness is to first pretraining two **identical** separate models. I think this is rather inefficient and prohibitive. **Experiments seem too casual.**
To the best of my knowledge, this work is, if not the earliest one, among the pioneering works that explore fusing two decoder-only models of different modality into one, to enable seamless image+text generation. Previous works have considered merging the token spaces from different modalities, and use single decoder to enable generating both modalities (like AudioPaLM); However, the idea of fusing two decoders, and arm the new decoder with the capability of generating high-quality multimodal ou
On Reasoning/Understanding: One weak point is that JAM models (both JAM-Uniform, Width or Cross) are all much weaker compared to LLaMA and GPT-3, while LLaMA is even smaller in terms of model size compared to JAM-Width and JAM-Cross. This is acceptable as this work is not focusing on reasoning and understanding. On Scaling up: Compared to GPT-3, the JAM model is still pretty small. Scaling up the model size could potentially help with the performance in terms of common sense reasoning and a
- The idea of bi-directional cross-attention layers between two generative backbones is an elegant approach for performing model merging. Indeed, the authors obtained SOTA results with JAM-Cross on MS-COCO (147.6 PPL), also showcasing the positive influence of the pre-trained joint text decoder. - Comparing the three ways of performing model merging while demonstrating the superiority of JAM-Cross between the three sheds light on what is the best way of performing such an operation.
- While the experimental section is fair, pointing to a new state-of-the-art over CM3leon, the improvement is only 1.4 PPL points with an increase of the model size of more than double (19B vs 7B of CM3leon) and increased training time. The authors suggest that minimal performance degradation post-merging should be studied in the evaluation part. While I agree minimal degradation is a necessary condition, it is not sufficient: given that the final model performs the same tasks as CM3leon, improv
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Topic Modeling · Natural Language Processing Techniques
