CuMo: Scaling Multimodal LLM with Co-Upcycled Mixture-of-Experts
Jiachen Li, Xinyao Wang, Sijie Zhu, Chia-Wen Kuo, Lu Xu, Fan Chen,, Jitesh Jain, Humphrey Shi, Longyin Wen

TL;DR
CuMo introduces a scalable multimodal LLM architecture that integrates mixture-of-experts in both vision and language components, achieving superior performance with minimal inference overhead by leveraging pre-training and expert balancing techniques.
Contribution
The paper proposes CuMo, a novel multimodal LLM framework that incorporates co-upcycled mixture-of-experts in vision and language modules, improving scalability and performance efficiently.
Findings
Outperforms state-of-the-art multimodal LLMs on VQA and visual instruction benchmarks.
Uses open-source datasets for training, ensuring accessibility and reproducibility.
Achieves high performance with minimal additional inference costs.
Abstract
Recent advancements in Multimodal Large Language Models (LLMs) have focused primarily on scaling by increasing text-image pair data and enhancing LLMs to improve performance on multimodal tasks. However, these scaling approaches are computationally expensive and overlook the significance of improving model capabilities from the vision side. Inspired by the successful applications of Mixture-of-Experts (MoE) in LLMs, which improves model scalability during training while keeping inference costs similar to those of smaller models, we propose CuMo. CuMo incorporates Co-upcycled Top-K sparsely-gated Mixture-of-experts blocks into both the vision encoder and the MLP connector, thereby enhancing the multimodal LLMs with minimal additional activated parameters during inference. CuMo first pre-trains the MLP blocks and then initializes each expert in the MoE block from the pre-trained MLP block…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
Taxonomy
TopicsNatural Language Processing Techniques
MethodsMixture of Experts
