Condense, Don't Just Prune: Enhancing Efficiency and Performance in MoE Layer Pruning

Mingyu Cao; Gen Li; Jie Ji; Jiaqi Zhang; Ajay Jaiswal; Li Shen; Xiaolong Ma; Shiwei Liu; Lu Yin

arXiv:2412.00069·cs.LG·April 21, 2026

Condense, Don't Just Prune: Enhancing Efficiency and Performance in MoE Layer Pruning

Mingyu Cao, Gen Li, Jie Ji, Jiaqi Zhang, Ajay Jaiswal, Li Shen, Xiaolong Ma, Shiwei Liu, Lu Yin

PDF

1 Repo

TL;DR

This paper introduces ConDense-MoE, a method to condense sparse MoE layers into smaller dense layers, reducing memory and increasing speed while maintaining high accuracy, with minimal fine-tuning.

Contribution

It proposes a novel layer condensation technique for fine-grained MoE models that preserves performance and improves efficiency, demonstrated on large language models.

Findings

01

Maintains 90% accuracy with 27.5% memory reduction on DeepSeekMoE-16B.

02

Increases inference speed by 1.26 times.

03

Recovers 98% of original performance with 5 hours of lightweight fine-tuning.

Abstract

Mixture-of-Experts (MoE) has garnered significant attention for its ability to scale up neural networks while utilizing the same or even fewer active parameters. However, MoE does not alleviate the massive memory requirements of networks, which limits their practicality in real-world applications, especially in the era of large language models (LLMs). While recent work explores the possibility of removing entire layers of MoE to reduce memory, the performance degradation is still notable. In this paper, we propose ConDense-MoE (CD-MoE), which, instead of dropping the entire MoE layer, condenses the large, sparse MoE layer into a smaller, denser layer with only a few experts activated for all tokens, while maintaining hardware friendliness. Our approach is specifically designed for fine-grained MoE with shared experts, where Feed-Forward Networks are split into many small experts, with…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

duterscmy/CD-MoE/tree/main
github

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.