Mixture-of-Transformers: A Sparse and Scalable Architecture for   Multi-Modal Foundation Models

Weixin Liang; Lili Yu; Liang Luo; Srinivasan Iyer; Ning Dong; Chunting; Zhou; Gargi Ghosh; Mike Lewis; Wen-tau Yih; Luke Zettlemoyer; Xi Victoria Lin

arXiv:2411.04996·cs.CL·May 9, 2025

Mixture-of-Transformers: A Sparse and Scalable Architecture for Multi-Modal Foundation Models

Weixin Liang, Lili Yu, Liang Luo, Srinivasan Iyer, Ning Dong, Chunting, Zhou, Gargi Ghosh, Mike Lewis, Wen-tau Yih, Luke Zettlemoyer, Xi Victoria Lin

PDF

Open Access 1 Repo

TL;DR

Mixture-of-Transformers (MoT) is a sparse, scalable multi-modal architecture that reduces training costs while maintaining performance across text, image, and speech tasks, enabling efficient large-scale multi-modal models.

Contribution

We introduce MoT, a novel sparse multi-modal transformer architecture that decouples modality-specific parameters, significantly reducing computational costs for training large multi-modal models.

Findings

01

MoT matches dense baseline performance with 55.8% FLOPs in text-and-image tasks.

02

MoT achieves speech performance comparable to dense models with only 37.2% FLOPs.

03

MoT outperforms dense models in image generation metrics at reduced computational costs.

Abstract

The development of large language models (LLMs) has expanded to multi-modal systems capable of processing text, images, and speech within a unified framework. Training these models demands significantly larger datasets and computational resources compared to text-only LLMs. To address the scaling challenges, we introduce Mixture-of-Transformers (MoT), a sparse multi-modal transformer architecture that significantly reduces pretraining computational costs. MoT decouples non-embedding parameters of the model by modality -- including feed-forward networks, attention matrices, and layer normalization -- enabling modality-specific processing with global self-attention over the full input sequence. We evaluate MoT across multiple settings and model scales. In the Chameleon 7B setting (autoregressive text-and-image generation), MoT matches the dense baseline's performance using only 55.8\% of…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

allenzren/open-pi-zero
pytorch

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsBIM and Construction Integration · Modular Robots and Swarm Intelligence

MethodsSoftmax · Attention Is All You Need · Layer Normalization