Mixture-of-Transformers: A Sparse and Scalable Architecture for Multi-Modal Foundation Models
Weixin Liang, Lili Yu, Liang Luo, Srinivasan Iyer, Ning Dong, Chunting, Zhou, Gargi Ghosh, Mike Lewis, Wen-tau Yih, Luke Zettlemoyer, Xi Victoria Lin

TL;DR
Mixture-of-Transformers (MoT) is a sparse, scalable multi-modal architecture that reduces training costs while maintaining performance across text, image, and speech tasks, enabling efficient large-scale multi-modal models.
Contribution
We introduce MoT, a novel sparse multi-modal transformer architecture that decouples modality-specific parameters, significantly reducing computational costs for training large multi-modal models.
Findings
MoT matches dense baseline performance with 55.8% FLOPs in text-and-image tasks.
MoT achieves speech performance comparable to dense models with only 37.2% FLOPs.
MoT outperforms dense models in image generation metrics at reduced computational costs.
Abstract
The development of large language models (LLMs) has expanded to multi-modal systems capable of processing text, images, and speech within a unified framework. Training these models demands significantly larger datasets and computational resources compared to text-only LLMs. To address the scaling challenges, we introduce Mixture-of-Transformers (MoT), a sparse multi-modal transformer architecture that significantly reduces pretraining computational costs. MoT decouples non-embedding parameters of the model by modality -- including feed-forward networks, attention matrices, and layer normalization -- enabling modality-specific processing with global self-attention over the full input sequence. We evaluate MoT across multiple settings and model scales. In the Chameleon 7B setting (autoregressive text-and-image generation), MoT matches the dense baseline's performance using only 55.8\% of…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsBIM and Construction Integration · Modular Robots and Swarm Intelligence
MethodsSoftmax · Attention Is All You Need · Layer Normalization
