TL;DR
This paper introduces an analytical post-training method to efficiently convert dense feed-forward networks into sparse Mixture-of-Experts architectures using minimal data, significantly reducing inference costs.
Contribution
It presents a novel activation pattern analysis framework that enables rapid FFN-to-MoE restructuring without extensive retraining or large datasets.
Findings
Achieves up to 1.17x speedup in compute-bound scenarios.
Requires only minutes of processing and 2000 samples for fine-tuning.
Outperforms existing methods that need much more resources.
Abstract
Scaling large language models (LLMs) improves performance but significantly increases inference costs, with feed-forward networks (FFNs) consuming the majority of computational resources. While Mixture-of-Experts (MoE) architectures can reduce this cost through sparse activation, restructuring existing dense models into MoEs typically requires extensive retraining on hundreds of billions of tokens. We propose an analytical post-training framework that rapidly restructures FFNs into sparse MoE architectures using only a small calibration dataset. The method analyzes neuron activation patterns to partition neurons into always-active shared experts and conditionally activated routed experts, then constructs a router analytically from representative neuron statistics, enabling immediate deployment or optional lightweight fine-tuning. This approach applies both to dense models and…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
