When Attention Meets Fast Recurrence: Training Language Models with   Reduced Compute

Tao Lei

arXiv:2102.12459·cs.CL·September 16, 2021·1 cites

When Attention Meets Fast Recurrence: Training Language Models with Reduced Compute

Tao Lei

PDF

Open Access 1 Repo

TL;DR

This paper introduces SRU++, an efficient sequence modeling architecture combining fast recurrence and minimal attention, achieving state-of-the-art results with significantly reduced training costs on language modeling benchmarks.

Contribution

SRU++ is a novel architecture that effectively merges fast recurrence with limited attention, reducing training costs while maintaining high modeling capacity.

Findings

01

Achieves better bits-per-character and perplexity on standard datasets.

02

Uses 3-10x less training cost compared to Transformer models.

03

Reaches near state-of-the-art performance with minimal attention.

Abstract

Large language models have become increasingly difficult to train because of the growing computation time and cost. In this work, we present SRU++, a highly-efficient architecture that combines fast recurrence and attention for sequence modeling. SRU++ exhibits strong modeling capacity and training efficiency. On standard language modeling tasks such as Enwik8, Wiki-103 and Billion Word datasets, our model obtains better bits-per-character and perplexity while using 3x-10x less training cost compared to top-performing Transformer models. For instance, our model achieves a state-of-the-art result on the Enwik8 dataset using 1.6 days of training on an 8-GPU machine. We further demonstrate that SRU++ requires minimal attention for near state-of-the-art performance. Our results suggest jointly leveraging fast recurrence with little attention as a promising direction for accelerating model…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

asappresearch/sru
pytorchOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsTopic Modeling · Natural Language Processing Techniques · Speech Recognition and Synthesis

MethodsLinear Layer · SRU++ · Absolute Position Encodings · Position-Wise Feed-Forward Layer · Residual Connection · Dense Connections · Label Smoothing · Dropout · Attention Is All You Need · Layer Normalization