MoE-LLaVA: Mixture of Experts for Large Vision-Language Models
Bin Lin, Zhenyu Tang, Yang Ye, Jinfa Huang, Junwu Zhang, Yatian Pang,, Peng Jin, Munan Ning, Jiebo Luo, Li Yuan

TL;DR
This paper introduces MoE-LLaVA, a sparse vision-language model using Mixture of Experts that achieves high performance with significantly fewer active parameters, reducing costs while maintaining or surpassing larger models.
Contribution
The paper proposes MoE-Tuning and MoE-LLaVA, novel methods for constructing sparse, efficient large vision-language models with competitive performance.
Findings
MoE-LLaVA achieves comparable performance to larger models with only 3B active parameters.
MoE-LLaVA surpasses larger models in object hallucination benchmarks.
The approach reduces training and inference costs significantly.
Abstract
Recent advances demonstrate that scaling Large Vision-Language Models (LVLMs) effectively improves downstream task performances. However, existing scaling methods enable all model parameters to be active for each token in the calculation, which brings massive training and inferring costs. In this work, we propose a simple yet effective training strategy MoE-Tuning for LVLMs. This strategy innovatively addresses the common issue of performance degradation in multi-modal sparsity learning, consequently constructing a sparse model with an outrageous number of parameters but a constant computational cost. Furthermore, we present the MoE-LLaVA, a MoE-based sparse LVLM architecture, which uniquely activates only the top-k experts through routers during deployment, keeping the remaining experts inactive. Extensive experiments show the significant performance of MoE-LLaVA in a variety of visual…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
- 🤗LanguageBind/MoE-LLaVA-StableLM-1.6B-4emodel· 61 dl· ♡ 861 dl♡ 8
- 🤗LanguageBind/MoE-LLaVA-Phi2-2.7B-4emodel· 139 dl· ♡ 40139 dl♡ 40
- 🤗LanguageBind/MoE-LLaVA-Qwen-1.8B-4emodel· 303 dl· ♡ 15303 dl♡ 15
- 🤗LanguageBind/MoE-LLaVA-Phi2-2.7B-4e-384model· 27 dl· ♡ 3227 dl♡ 32
- 🤗LanguageBind/MoE-LLaVA-Phi2-Pretrainmodel· 10 dl10 dl
- 🤗LanguageBind/MoE-LLaVA-Qwen-Pretrainmodel· 9 dl· ♡ 19 dl♡ 1
- 🤗LanguageBind/MoE-LLaVA-StableLM-Pretrainmodel· 9 dl9 dl
- 🤗LanguageBind/MoE-LLaVA-Phi2-384-Pretrainmodel· 8 dl8 dl
- 🤗LanguageBind/MoE-LLaVA-OpenChat-7B-4emodel· 13 dl· ♡ 113 dl♡ 1
- 🤗LanguageBind/MoE-LLaVA-StableLM-1.6B-4e-384model· 12 dl· ♡ 812 dl♡ 8
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Topic Modeling · Natural Language Processing Techniques
