CuMo: Scaling Multimodal LLM with Co-Upcycled Mixture-of-Experts

Jiachen Li; Xinyao Wang; Sijie Zhu; Chia-Wen Kuo; Lu Xu; Fan Chen,; Jitesh Jain; Humphrey Shi; Longyin Wen

arXiv:2405.05949·cs.CV·May 10, 2024·2 cites

CuMo: Scaling Multimodal LLM with Co-Upcycled Mixture-of-Experts

Jiachen Li, Xinyao Wang, Sijie Zhu, Chia-Wen Kuo, Lu Xu, Fan Chen,, Jitesh Jain, Humphrey Shi, Longyin Wen

PDF

Open Access 1 Repo 1 Video

TL;DR

CuMo introduces a scalable multimodal LLM architecture that integrates mixture-of-experts in both vision and language components, achieving superior performance with minimal inference overhead by leveraging pre-training and expert balancing techniques.

Contribution

The paper proposes CuMo, a novel multimodal LLM framework that incorporates co-upcycled mixture-of-experts in vision and language modules, improving scalability and performance efficiently.

Findings

01

Outperforms state-of-the-art multimodal LLMs on VQA and visual instruction benchmarks.

02

Uses open-source datasets for training, ensuring accessibility and reproducibility.

03

Achieves high performance with minimal additional inference costs.

Abstract

Recent advancements in Multimodal Large Language Models (LLMs) have focused primarily on scaling by increasing text-image pair data and enhancing LLMs to improve performance on multimodal tasks. However, these scaling approaches are computationally expensive and overlook the significance of improving model capabilities from the vision side. Inspired by the successful applications of Mixture-of-Experts (MoE) in LLMs, which improves model scalability during training while keeping inference costs similar to those of smaller models, we propose CuMo. CuMo incorporates Co-upcycled Top-K sparsely-gated Mixture-of-experts blocks into both the vision encoder and the MLP connector, thereby enhancing the multimodal LLMs with minimal additional activated parameters during inference. CuMo first pre-trains the MLP blocks and then initializes each expert in the MoE block from the pre-trained MLP block…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

shi-labs/cumo
pytorchOfficial

Videos

CuMo: Scaling Multimodal LLM with Co-Upcycled Mixture-of-Experts· slideslive

Taxonomy

TopicsNatural Language Processing Techniques

MethodsMixture of Experts