DualSparse-MoE: Coordinating Tensor/Neuron-Level Sparsity with Expert Partition and Reconstruction
Weilin Cai, Le Qin, Shwai He, Junwei Cui, Ang Li, Jiayi Huang

TL;DR
DualSparse-MoE introduces a novel approach to improve the efficiency of Mixture of Experts models by combining tensor and neuron-level sparsity through post-training expert partitioning and dynamic computation dropping, achieving significant speedups with minimal accuracy loss.
Contribution
It proposes a new method for inducing dual sparsity in MoE models via post-training expert partitioning and dynamic dropping, enhancing efficiency without retraining.
Findings
25% computation drop reduces accuracy by only 0.08%-0.28%.
Nearly all computation dropping degrees yield proportional speedups.
Load-imbalance aware expert parallelism achieves 1.41x speedup with 0.5% accuracy loss.
Abstract
Mixture of Experts (MoE) has become a mainstream architecture for building Large Language Models (LLMs) by reducing per-token computation while enabling model scaling. It can be viewed as partitioning a large Feed-Forward Network (FFN) at the tensor level into fine-grained sub-FFNs, or experts, and activating only a sparse subset for each input. While this sparsity improves efficiency, MoE still faces substantial challenges due to their massive computational scale and unpredictable activation patterns. To enable efficient MoE deployment, we identify dual sparsity at the tensor and neuron levels in pre-trained MoE modules as a key factor for both accuracy and efficiency. Unlike prior work that increases tensor-level sparsity through finer-grained expert design during pre-training, we introduce post-training expert partitioning to induce such sparsity without retraining. This preserves…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
