MoEfication: Transformer Feed-forward Layers are Mixtures of Experts
Zhengyan Zhang, Yankai Lin, Zhiyuan Liu, Peng Li, Maosong Sun, Jie, Zhou

TL;DR
This paper introduces MoEfication, a method to convert Transformer feed-forward layers into Mixtures of Experts, reducing computation while preserving performance and revealing functional partitions within FFNs.
Contribution
It proposes a novel approach to transform FFNs into MoE structures, enabling efficient inference and deeper understanding of FFN mechanisms.
Findings
MoEfication activates only 10-30% of FFN parameters per input.
Maintains over 95% of original model performance after conversion.
Achieves 2x inference speedup with 25% of FFN parameters.
Abstract
Recent work has shown that feed-forward networks (FFNs) in pre-trained Transformers are a key component, storing various linguistic and factual knowledge. However, the computational patterns of FFNs are still unclear. In this work, we study the computational patterns of FFNs and observe that most inputs only activate a tiny ratio of neurons of FFNs. This phenomenon is similar to the sparsity of the human brain, which drives research on functional partitions of the human brain. To verify whether functional partitions also emerge in FFNs, we propose to convert a model into its MoE version with the same parameters, namely MoEfication. Specifically, MoEfication consists of two phases: (1) splitting the parameters of FFNs into multiple functional partitions as experts, and (2) building expert routers to decide which experts will be used for each input. Experimental results show that…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
- 🤗SparseLLM/ReluLLaMA-7Bmodel· 206 dl· ♡ 11206 dl♡ 11
- 🤗SparseLLM/ReluFalcon-40Bmodel· 12 dl· ♡ 412 dl♡ 4
- 🤗SparseLLM/ReluLLaMA-13Bmodel· 30 dl· ♡ 430 dl♡ 4
- 🤗SparseLLM/ReluLLaMA-70Bmodel· 29 dl· ♡ 729 dl♡ 7
- 🤗Tiiny/Bamboo-base-v0_1model· 15 dl· ♡ 2115 dl♡ 21
- 🤗Tiiny/Bamboo-DPO-v0_1model· 4 dl· ♡ 74 dl♡ 7
- 🤗QuantFactory/Bamboo-base-v0.1-GGUFmodel· 39 dl39 dl
- 🤗RichardErkhov/SparseLLM_-_ReluLLaMA-13B-ggufmodel· 20 dl20 dl
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsTopic Modeling · Machine Learning in Bioinformatics · Ferroelectric and Negative Capacitance Devices
