TL;DR
This paper introduces ConDense-MoE, a method to condense sparse MoE layers into smaller dense layers, reducing memory and increasing speed while maintaining high accuracy, with minimal fine-tuning.
Contribution
It proposes a novel layer condensation technique for fine-grained MoE models that preserves performance and improves efficiency, demonstrated on large language models.
Findings
Maintains 90% accuracy with 27.5% memory reduction on DeepSeekMoE-16B.
Increases inference speed by 1.26 times.
Recovers 98% of original performance with 5 hours of lightweight fine-tuning.
Abstract
Mixture-of-Experts (MoE) has garnered significant attention for its ability to scale up neural networks while utilizing the same or even fewer active parameters. However, MoE does not alleviate the massive memory requirements of networks, which limits their practicality in real-world applications, especially in the era of large language models (LLMs). While recent work explores the possibility of removing entire layers of MoE to reduce memory, the performance degradation is still notable. In this paper, we propose ConDense-MoE (CD-MoE), which, instead of dropping the entire MoE layer, condenses the large, sparse MoE layer into a smaller, denser layer with only a few experts activated for all tokens, while maintaining hardware friendliness. Our approach is specifically designed for fine-grained MoE with shared experts, where Feed-Forward Networks are split into many small experts, with…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
