LLaMA-MoE v2: Exploring Sparsity of LLaMA from Perspective of   Mixture-of-Experts with Post-Training

Xiaoye Qu; Daize Dong; Xuyang Hu; Tong Zhu; Weigao Sun; Yu Cheng

arXiv:2411.15708·cs.CL·November 26, 2024

LLaMA-MoE v2: Exploring Sparsity of LLaMA from Perspective of Mixture-of-Experts with Post-Training

Xiaoye Qu, Daize Dong, Xuyang Hu, Tong Zhu, Weigao Sun, Yu Cheng

PDF

Open Access 1 Repo 2 Models

TL;DR

This paper investigates the sparsity of LLaMA models by constructing Mixture-of-Experts (MoE) modules in transformer blocks, evaluating their performance across domains, and proposing a post-training strategy to mitigate sparsity-induced performance loss.

Contribution

It introduces a comprehensive analysis of sparsifying LLaMA with MoE, explores expert construction methods, and proposes a two-stage post-training strategy to improve performance after sparsification.

Findings

01

Sparsified LLaMA models maintain competitive performance across domains.

02

Different expert construction methods impact sparsity and effectiveness.

03

Post-training strategies can recover performance degraded by sparsity.

Abstract

Recently, inspired by the concept of sparsity, Mixture-of-Experts (MoE) models have gained increasing popularity for scaling model size while keeping the number of activated parameters constant. In this study, we thoroughly investigate the sparsity of the dense LLaMA model by constructing MoE for both the attention (i.e., Attention MoE) and MLP (i.e., MLP MoE) modules in the transformer blocks. Specifically, we investigate different expert construction methods and granularities under the same activation conditions to analyze the impact of sparsifying the model. Additionally, to comprehensively evaluate the model's capabilities across various domains (e.g., conversation, code, math) after sparsification, we apply sparsity to the instructed large language models (LLMs) and construct instructed MoE models. To counteract the performance degradation resulting from increased sparsity, we…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

opensparsellms/llama-moe-v2
pytorchOfficial

Models

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMachine Learning and Algorithms · Computability, Logic, AI Algorithms · Topic Modeling

MethodsSoftmax · Attention Is All You Need · Mixture of Experts · LLaMA