Universal Transformers

Mostafa Dehghani; Stephan Gouws; Oriol Vinyals; Jakob Uszkoreit,; {\L}ukasz Kaiser

arXiv:1807.03819·cs.CL·March 6, 2019·396 cites

Universal Transformers

Mostafa Dehghani, Stephan Gouws, Oriol Vinyals, Jakob Uszkoreit,, {\L}ukasz Kaiser

PDF

Open Access 5 Repos 1 Models

TL;DR

The Universal Transformer extends the Transformer architecture by integrating recurrent processing and dynamic halting, enabling better generalization and performance on sequence tasks like language modeling and translation.

Contribution

It introduces a parallel-in-time self-attentive recurrent model that combines the strengths of RNNs and Transformers, with a dynamic halting mechanism for improved accuracy.

Findings

01

UTs outperform Transformers on algorithmic tasks

02

UTs achieve state-of-the-art on LAMBADA language modeling

03

UTs improve BLEU scores on WMT14 En-De translation

Abstract

Recurrent neural networks (RNNs) sequentially process data by updating their state with each new data point, and have long been the de facto choice for sequence modeling tasks. However, their inherently sequential computation makes them slow to train. Feed-forward and convolutional architectures have recently been shown to achieve superior results on some sequence modeling tasks such as machine translation, with the added advantage that they concurrently process all inputs in the sequence, leading to easy parallelization and faster training times. Despite these successes, however, popular feed-forward sequence models like the Transformer fail to generalize in many simple tasks that recurrent models handle with ease, e.g. copying strings or even simple logical inference when the string or formula lengths exceed those observed at training time. We propose the Universal Transformer (UT), a…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

Models

🤗
fcxfcx/owlv2
model· ♡ 1
♡ 1

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsTopic Modeling · Natural Language Processing Techniques · Multimodal Machine Learning Applications

MethodsLinear Layer · Absolute Position Encodings · Position-Wise Feed-Forward Layer · Residual Connection · Attention Dropout · Universal Transformer · Byte Pair Encoding · Dense Connections · Label Smoothing · *Communicated@Fast*How Do I Communicate to Expedia?