DSMoE: Matrix-Partitioned Experts with Dynamic Routing for Computation-Efficient Dense LLMs
Minxuan Lv, Zhenpeng Su, Leiyu Pan, Yizhe Xiong, Zijia Lin, Hui Chen, Wei Zhou, Jungong Han, Guiguang Ding, Cheng Luo, Di Zhang, Kun Gai, Songlin Hu

TL;DR
DSMoe introduces a dynamic, partitioned expert routing method for dense LLMs that enhances efficiency and performance by adaptively allocating computational resources based on input complexity.
Contribution
It presents a novel matrix-partitioned expert architecture with dynamic routing and a sparsity loss, improving efficiency without sacrificing model knowledge.
Findings
Outperforms existing pruning and MoE methods under similar computational budgets.
Learns distinctive layerwise activation patterns that inform future MoE designs.
Excels particularly in language generation tasks.
Abstract
As large language models continue to scale, computational costs and resource consumption have emerged as significant challenges. While existing sparsification methods like pruning reduce computational overhead, they risk losing model knowledge through parameter removal. This paper proposes DSMoE (Dynamic Sparse Mixture-of-Experts), a novel approach that achieves sparsification by partitioning pre-trained FFN layers into computational blocks. We implement adaptive expert routing using sigmoid activation and straight-through estimators, enabling tokens to flexibly access different aspects of model knowledge based on input complexity. Additionally, we introduce a sparsity loss term to balance performance and computational efficiency. Extensive experiments on LLaMA models demonstrate that under equivalent computational constraints, DSMoE achieves superior performance compared to existing…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
Taxonomy
TopicsMachine Learning and Data Classification · Machine Learning and Algorithms · Data Mining Algorithms and Applications
MethodsMixture of Experts · LLaMA · Pruning · Sigmoid Activation
