Mixture-of-Modules: Reinventing Transformers as Dynamic Assemblies of   Modules

Zhuocheng Gong; Ang Lv; Jian Guan; Junxi Yan; Wei Wu; Huishuai Zhang,; Minlie Huang; Dongyan Zhao; Rui Yan

arXiv:2407.06677·cs.CL·July 10, 2024

Mixture-of-Modules: Reinventing Transformers as Dynamic Assemblies of Modules

Zhuocheng Gong, Ang Lv, Jian Guan, Junxi Yan, Wei Wu, Huishuai Zhang,, Minlie Huang, Dongyan Zhao, Rui Yan

PDF

Open Access

TL;DR

This paper introduces a novel mixture-of-modules architecture that allows dynamic assembly of transformer modules, enabling flexible computation paths and improved efficiency, outperforming traditional transformers on benchmarks.

Contribution

The paper proposes a new architecture called mixture-of-modules (MoM) that breaks the depth-ordered computation in transformers by dynamically selecting modules, enhancing flexibility and efficiency.

Findings

01

MoMs outperform vanilla transformers on GLUE and XSUM.

02

MoM-large increases effective depth by over 38% with better performance.

03

MoM-large reduces depth by over 60%, saving TFLOPs and memory.

Abstract

Is it always necessary to compute tokens from shallow to deep layers in Transformers? The continued success of vanilla Transformers and their variants suggests an undoubted "yes". In this work, however, we attempt to break the depth-ordered convention by proposing a novel architecture dubbed mixture-of-modules (MoM), which is motivated by an intuition that any layer, regardless of its position, can be used to compute a token as long as it possesses the needed processing capabilities. The construction of MoM starts from a finite set of modules defined by multi-head attention and feed-forward networks, each distinguished by its unique parameterization. Two routers then iteratively select attention modules and feed-forward modules from the set to process a token. The selection dynamically expands the computation graph in the forward pass of the token, culminating in an assembly of modules.…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsManufacturing Process and Optimization

MethodsSparse Evolutionary Training · Attention Is All You Need · Softmax · Byte Pair Encoding · Layer Normalization · Linear Layer · Label Smoothing · Absolute Position Encodings · Position-Wise Feed-Forward Layer · Adam