Scaling Vision-Language Models with Sparse Mixture of Experts
Sheng Shen, Zhewei Yao, Chunyuan Li, Trevor Darrell, Kurt Keutzer,, Yuxiong He

TL;DR
This paper investigates the use of sparsely-gated mixture-of-experts techniques to scale vision-language models, achieving state-of-the-art results while addressing training stability, interpretability, and computational trade-offs.
Contribution
It demonstrates the effectiveness of MoE in scaling VLMs, providing insights into training stability, interpretability, and performance trade-offs compared to dense models.
Findings
MoE models outperform dense models of similar computational cost.
Training stability of MoE models can be improved with specific techniques.
MoE enhances interpretability of vision-language models.
Abstract
The field of natural language processing (NLP) has made significant strides in recent years, particularly in the development of large-scale vision-language models (VLMs). These models aim to bridge the gap between text and visual information, enabling a more comprehensive understanding of multimedia data. However, as these models become larger and more complex, they also become more challenging to train and deploy. One approach to addressing this challenge is the use of sparsely-gated mixture-of-experts (MoE) techniques, which divide the model into smaller, specialized sub-models that can jointly solve a task. In this paper, we explore the effectiveness of MoE in scaling vision-language models, demonstrating its potential to achieve state-of-the-art performance on a range of benchmarks over dense models of equivalent computational cost. Our research offers valuable insights into…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Topic Modeling · Domain Adaptation and Few-Shot Learning
