MoEUT: Mixture-of-Experts Universal Transformers

R\'obert Csord\'as; Kazuki Irie; J\"urgen Schmidhuber; Christopher; Potts; Christopher D. Manning

arXiv:2405.16039·cs.LG·October 15, 2024·1 cites

MoEUT: Mixture-of-Experts Universal Transformers

R\'obert Csord\'as, Kazuki Irie, J\"urgen Schmidhuber, Christopher, Potts, Christopher D. Manning

PDF

Open Access 1 Repo

TL;DR

MoEUT introduces a mixture-of-experts shared-layer Transformer that outperforms standard Transformers on language modeling tasks while using less compute and memory, addressing previous limitations of parameter sharing.

Contribution

It proposes a novel MoE-based shared-layer Transformer architecture with new normalization and grouping schemes, enabling competitive performance with fewer resources.

Findings

01

MoEUT slightly outperforms standard Transformers on language modeling benchmarks.

02

MoEUT uses significantly less compute and memory than traditional models.

03

The architecture successfully combines recent MoE advances with UT-specific modifications.

Abstract

Previous work on Universal Transformers (UTs) has demonstrated the importance of parameter sharing across layers. By allowing recurrence in depth, UTs have advantages over standard Transformers in learning compositional generalizations, but layer-sharing comes with a practical limitation of parameter-compute ratio: it drastically reduces the parameter count compared to the non-shared model with the same dimensionality. Naively scaling up the layer size to compensate for the loss of parameters makes its computational resource requirements prohibitive. In practice, no previous work has succeeded in proposing a shared-layer Transformer design that is competitive in parameter count-dominated tasks such as language modeling. Here we propose MoEUT (pronounced "moot"), an effective mixture-of-experts (MoE)-based shared-layer Transformer architecture, which combines several recent advances in…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

robertcsordas/moeut
pytorchOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSemantic Web and Ontologies

MethodsLinear Layer · Byte Pair Encoding · Label Smoothing · Adam · Attention Is All You Need · Residual Connection · Position-Wise Feed-Forward Layer · Multi-Head Attention · Dropout · Dense Connections