MoEUT: Mixture-of-Experts Universal Transformers
R\'obert Csord\'as, Kazuki Irie, J\"urgen Schmidhuber, Christopher, Potts, Christopher D. Manning

TL;DR
MoEUT introduces a mixture-of-experts shared-layer Transformer that outperforms standard Transformers on language modeling tasks while using less compute and memory, addressing previous limitations of parameter sharing.
Contribution
It proposes a novel MoE-based shared-layer Transformer architecture with new normalization and grouping schemes, enabling competitive performance with fewer resources.
Findings
MoEUT slightly outperforms standard Transformers on language modeling benchmarks.
MoEUT uses significantly less compute and memory than traditional models.
The architecture successfully combines recent MoE advances with UT-specific modifications.
Abstract
Previous work on Universal Transformers (UTs) has demonstrated the importance of parameter sharing across layers. By allowing recurrence in depth, UTs have advantages over standard Transformers in learning compositional generalizations, but layer-sharing comes with a practical limitation of parameter-compute ratio: it drastically reduces the parameter count compared to the non-shared model with the same dimensionality. Naively scaling up the layer size to compensate for the loss of parameters makes its computational resource requirements prohibitive. In practice, no previous work has succeeded in proposing a shared-layer Transformer design that is competitive in parameter count-dominated tasks such as language modeling. Here we propose MoEUT (pronounced "moot"), an effective mixture-of-experts (MoE)-based shared-layer Transformer architecture, which combines several recent advances in…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSemantic Web and Ontologies
MethodsLinear Layer · Byte Pair Encoding · Label Smoothing · Adam · Attention Is All You Need · Residual Connection · Position-Wise Feed-Forward Layer · Multi-Head Attention · Dropout · Dense Connections
