MoEfication: Transformer Feed-forward Layers are Mixtures of Experts

Zhengyan Zhang; Yankai Lin; Zhiyuan Liu; Peng Li; Maosong Sun; Jie; Zhou

arXiv:2110.01786·cs.CL·April 6, 2022·1 cites

MoEfication: Transformer Feed-forward Layers are Mixtures of Experts

Zhengyan Zhang, Yankai Lin, Zhiyuan Liu, Peng Li, Maosong Sun, Jie, Zhou

PDF

Open Access 1 Repo 8 Models

TL;DR

This paper introduces MoEfication, a method to convert Transformer feed-forward layers into Mixtures of Experts, reducing computation while preserving performance and revealing functional partitions within FFNs.

Contribution

It proposes a novel approach to transform FFNs into MoE structures, enabling efficient inference and deeper understanding of FFN mechanisms.

Findings

01

MoEfication activates only 10-30% of FFN parameters per input.

02

Maintains over 95% of original model performance after conversion.

03

Achieves 2x inference speedup with 25% of FFN parameters.

Abstract

Recent work has shown that feed-forward networks (FFNs) in pre-trained Transformers are a key component, storing various linguistic and factual knowledge. However, the computational patterns of FFNs are still unclear. In this work, we study the computational patterns of FFNs and observe that most inputs only activate a tiny ratio of neurons of FFNs. This phenomenon is similar to the sparsity of the human brain, which drives research on functional partitions of the human brain. To verify whether functional partitions also emerge in FFNs, we propose to convert a model into its MoE version with the same parameters, namely MoEfication. Specifically, MoEfication consists of two phases: (1) splitting the parameters of FFNs into multiple functional partitions as experts, and (2) building expert routers to decide which experts will be used for each input. Experimental results show that…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

thunlp/moefication
pytorchOfficial

Models

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsTopic Modeling · Machine Learning in Bioinformatics · Ferroelectric and Negative Capacitance Devices