Beyond Language Modeling: An Exploration of Multimodal Pretraining

Shengbang Tong; David Fan; John Nguyen; Ellis Brown; Gaoyue Zhou; Shengyi Qian; Boyang Zheng; Th\'eophane Vallaeys; Junlin Han; Rob Fergus; Naila Murray; Marjan Ghazvininejad; Mike Lewis; Nicolas Ballas; Amir Bar; Michael Rabbat; Jakob Verbeek; Luke Zettlemoyer; Koustuv Sinha; Yann LeCun; Saining Xie

arXiv:2603.03276·cs.CV·March 4, 2026

Beyond Language Modeling: An Exploration of Multimodal Pretraining

Shengbang Tong, David Fan, John Nguyen, Ellis Brown, Gaoyue Zhou, Shengyi Qian, Boyang Zheng, Th\'eophane Vallaeys, Junlin Han, Rob Fergus, Naila Murray, Marjan Ghazvininejad, Mike Lewis, Nicolas Ballas, Amir Bar, Michael Rabbat, Jakob Verbeek, Luke Zettlemoyer, Koustuv Sinha

PDF

Open Access

TL;DR

This paper investigates the design and scaling of multimodal foundation models using controlled pretraining experiments, revealing key insights into representation, synergy, world modeling, and efficient scaling with Mixture-of-Experts.

Contribution

It introduces a comprehensive empirical analysis of multimodal pretraining, highlighting the effectiveness of Representation Autoencoder and Mixture-of-Experts for scalable, unified multimodal models.

Findings

01

RAE offers optimal visual representation for understanding and generation.

02

Visual and language data complement each other, enhancing downstream tasks.

03

Vision data is more data-hungry than language, and MoE helps balance this asymmetry.

Abstract

The visual world offers a critical axis for advancing foundation models beyond language. Despite growing interest in this direction, the design space for native multimodal models remains opaque. We provide empirical clarity through controlled, from-scratch pretraining experiments, isolating the factors that govern multimodal pretraining without interference from language pretraining. We adopt the Transfusion framework, using next-token prediction for language and diffusion for vision, to train on diverse data including text, video, image-text pairs, and even action-conditioned video. Our experiments yield four key insights: (i) Representation Autoencoder (RAE) provides an optimal unified visual representation by excelling at both visual understanding and generation; (ii) visual and language data are complementary and yield synergy for downstream capabilities; (iii) unified multimodal…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Generative Adversarial Networks and Image Synthesis · Domain Adaptation and Few-Shot Learning