MoE-LLaVA: Mixture of Experts for Large Vision-Language Models

Bin Lin; Zhenyu Tang; Yang Ye; Jinfa Huang; Junwu Zhang; Yatian Pang,; Peng Jin; Munan Ning; Jiebo Luo; Li Yuan

arXiv:2401.15947·cs.CV·December 24, 2024·33 cites

MoE-LLaVA: Mixture of Experts for Large Vision-Language Models

Bin Lin, Zhenyu Tang, Yang Ye, Jinfa Huang, Junwu Zhang, Yatian Pang,, Peng Jin, Munan Ning, Jiebo Luo, Li Yuan

PDF

Open Access 3 Repos 10 Models 1 Datasets

TL;DR

This paper introduces MoE-LLaVA, a sparse vision-language model using Mixture of Experts that achieves high performance with significantly fewer active parameters, reducing costs while maintaining or surpassing larger models.

Contribution

The paper proposes MoE-Tuning and MoE-LLaVA, novel methods for constructing sparse, efficient large vision-language models with competitive performance.

Findings

01

MoE-LLaVA achieves comparable performance to larger models with only 3B active parameters.

02

MoE-LLaVA surpasses larger models in object hallucination benchmarks.

03

The approach reduces training and inference costs significantly.

Abstract

Recent advances demonstrate that scaling Large Vision-Language Models (LVLMs) effectively improves downstream task performances. However, existing scaling methods enable all model parameters to be active for each token in the calculation, which brings massive training and inferring costs. In this work, we propose a simple yet effective training strategy MoE-Tuning for LVLMs. This strategy innovatively addresses the common issue of performance degradation in multi-modal sparsity learning, consequently constructing a sparse model with an outrageous number of parameters but a constant computational cost. Furthermore, we present the MoE-LLaVA, a MoE-based sparse LVLM architecture, which uniquely activates only the top-k experts through routers during deployment, keeping the remaining experts inactive. Extensive experiments show the significant performance of MoE-LLaVA in a variety of visual…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

Models

Datasets

LanguageBind/MoE-LLaVA
dataset· 420 dl
420 dl

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Topic Modeling · Natural Language Processing Techniques