Mixture-of-Modules: Reinventing Transformers as Dynamic Assemblies of Modules
Zhuocheng Gong, Ang Lv, Jian Guan, Junxi Yan, Wei Wu, Huishuai Zhang,, Minlie Huang, Dongyan Zhao, Rui Yan

TL;DR
This paper introduces a novel mixture-of-modules architecture that allows dynamic assembly of transformer modules, enabling flexible computation paths and improved efficiency, outperforming traditional transformers on benchmarks.
Contribution
The paper proposes a new architecture called mixture-of-modules (MoM) that breaks the depth-ordered computation in transformers by dynamically selecting modules, enhancing flexibility and efficiency.
Findings
MoMs outperform vanilla transformers on GLUE and XSUM.
MoM-large increases effective depth by over 38% with better performance.
MoM-large reduces depth by over 60%, saving TFLOPs and memory.
Abstract
Is it always necessary to compute tokens from shallow to deep layers in Transformers? The continued success of vanilla Transformers and their variants suggests an undoubted "yes". In this work, however, we attempt to break the depth-ordered convention by proposing a novel architecture dubbed mixture-of-modules (MoM), which is motivated by an intuition that any layer, regardless of its position, can be used to compute a token as long as it possesses the needed processing capabilities. The construction of MoM starts from a finite set of modules defined by multi-head attention and feed-forward networks, each distinguished by its unique parameterization. Two routers then iteratively select attention modules and feed-forward modules from the set to process a token. The selection dynamically expands the computation graph in the forward pass of the token, culminating in an assembly of modules.…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsManufacturing Process and Optimization
MethodsSparse Evolutionary Training · Attention Is All You Need · Softmax · Byte Pair Encoding · Layer Normalization · Linear Layer · Label Smoothing · Absolute Position Encodings · Position-Wise Feed-Forward Layer · Adam
