ToMoE: Converting Dense Large Language Models to Mixture-of-Experts through Dynamic Structural Pruning

Shangqian Gao; Ting Hua; Reza Shirkavand; Chi-Heng Lin; Zheng Tang; Zhengao Li; Longge Yuan; Fangyi Li; Zeyu Zhang; Alireza Ganjdanesh; Lou Qian; Xu Jie; Yen-Chang Hsu

arXiv:2501.15316·cs.LG·January 6, 2026

ToMoE: Converting Dense Large Language Models to Mixture-of-Experts through Dynamic Structural Pruning

Shangqian Gao, Ting Hua, Reza Shirkavand, Chi-Heng Lin, Zheng Tang, Zhengao Li, Longge Yuan, Fangyi Li, Zeyu Zhang, Alireza Ganjdanesh, Lou Qian, Xu Jie, Yen-Chang Hsu

PDF

Open Access

TL;DR

This paper introduces ToMoE, a differentiable dynamic pruning technique that converts dense large language models into Mixture-of-Experts architectures, reducing active parameters without permanent removal, thereby improving efficiency while maintaining performance.

Contribution

The paper presents a novel differentiable dynamic pruning method that transforms dense models into MoE architectures without fine-tuning, outperforming previous pruning approaches across multiple model families.

Findings

01

Outperforms previous structural pruning methods.

02

Works effectively across diverse model families.

03

Maintains performance without fine-tuning.

Abstract

Large Language Models (LLMs) have demonstrated remarkable abilities in tackling a wide range of complex tasks. However, their huge computational and memory costs raise significant challenges in deploying these models on resource-constrained devices or efficiently serving them. Prior approaches have attempted to alleviate these problems by permanently removing less important model structures, yet these methods often result in substantial performance degradation due to the permanent deletion of model parameters. In this work, we tried to mitigate this issue by reducing the number of active parameters without permanently removing them. Specifically, we introduce a differentiable dynamic pruning method that pushes dense models to maintain a fixed number of active parameters by converting their MLP layers into a Mixture of Experts (MoE) architecture. Our method, even without fine-tuning,…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsExpert finding and Q&A systems · Speech and dialogue systems · Traffic Prediction and Management Techniques

MethodsPruning