Sparse Universal Transformer

Shawn Tan; Yikang Shen; Zhenfang Chen; Aaron Courville; Chuang Gan

arXiv:2310.07096·cs.CL·October 12, 2023

Sparse Universal Transformer

Shawn Tan, Yikang Shen, Zhenfang Chen, Aaron Courville, Chuang Gan

PDF

Open Access 2 Repos

TL;DR

The Sparse Universal Transformer (SUT) improves parameter efficiency and reduces computation in universal transformers by using sparse mixture of experts and a dynamic halting mechanism, maintaining strong performance and generalization.

Contribution

Introduces SUT, combining SMoE and a novel halting mechanism to reduce computation while preserving the benefits of Universal Transformers.

Findings

01

Achieves similar performance with half the parameters and computation.

02

Demonstrates strong generalization on formal language tasks.

03

Enables 50% inference computation reduction with minimal performance loss.

Abstract

The Universal Transformer (UT) is a variant of the Transformer that shares parameters across its layers. Empirical evidence shows that UTs have better compositional generalization than Vanilla Transformers (VTs) in formal language tasks. The parameter-sharing also affords it better parameter efficiency than VTs. Despite its many advantages, scaling UT parameters is much more compute and memory intensive than scaling up a VT. This paper proposes the Sparse Universal Transformer (SUT), which leverages Sparse Mixture of Experts (SMoE) and a new stick-breaking-based dynamic halting mechanism to reduce UT's computation complexity while retaining its parameter efficiency and generalization ability. Experiments show that SUT achieves the same performance as strong baseline models while only using half computation and parameters on WMT'14 and strong generalization results on formal language…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsTopic Modeling · Domain Adaptation and Few-Shot Learning · Natural Language Processing Techniques

MethodsMulti-Head Attention · Attention Is All You Need · Linear Layer · Position-Wise Feed-Forward Layer · Softmax · Byte Pair Encoding · Label Smoothing · *Communicated@Fast*How Do I Communicate to Expedia? · Adam · Absolute Position Encodings