Omni-SMoLA: Boosting Generalist Multimodal Models with Soft Mixture of   Low-rank Experts

Jialin Wu; Xia Hu; Yaqing Wang; Bo Pang; Radu Soricut

arXiv:2312.00968·cs.CV·April 4, 2024·2 cites

Omni-SMoLA: Boosting Generalist Multimodal Models with Soft Mixture of Low-rank Experts

Jialin Wu, Xia Hu, Yaqing Wang, Bo Pang, Radu Soricut

PDF

Open Access

TL;DR

Omni-SMoLA introduces a Soft Mixture of Low-rank Experts architecture for large multimodal models, enhancing their performance across diverse vision-and-language tasks without significantly increasing parameters.

Contribution

It proposes a parameter-efficient MoE approach that improves generalist multimodal model performance by residually learning specialized knowledge with lightweight experts.

Findings

01

Achieves state-of-the-art generalist performance on vision-language tasks.

02

Matches or surpasses specialized models in various benchmarks.

03

Improves task performance with minimal additional parameters.

Abstract

Large multi-modal models (LMMs) exhibit remarkable performance across numerous tasks. However, generalist LMMs often suffer from performance degradation when tuned over a large collection of tasks. Recent research suggests that Mixture of Experts (MoE) architectures are useful for instruction tuning, but for LMMs of parameter size around O(50-100B), the prohibitive cost of replicating and storing the expert models severely limits the number of experts we can use. We propose Omni-SMoLA, an architecture that uses the Soft MoE approach to (softly) mix many multimodal low rank experts, and avoids introducing a significant number of new parameters compared to conventional MoE models. The core intuition here is that the large model provides a foundational backbone, while different lightweight experts residually learn specialized knowledge, either per-modality or multimodally. Extensive…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Domain Adaptation and Few-Shot Learning · Speech and dialogue systems