When Attention Meets Fast Recurrence: Training Language Models with Reduced Compute
Tao Lei

TL;DR
This paper introduces SRU++, an efficient sequence modeling architecture combining fast recurrence and minimal attention, achieving state-of-the-art results with significantly reduced training costs on language modeling benchmarks.
Contribution
SRU++ is a novel architecture that effectively merges fast recurrence with limited attention, reducing training costs while maintaining high modeling capacity.
Findings
Achieves better bits-per-character and perplexity on standard datasets.
Uses 3-10x less training cost compared to Transformer models.
Reaches near state-of-the-art performance with minimal attention.
Abstract
Large language models have become increasingly difficult to train because of the growing computation time and cost. In this work, we present SRU++, a highly-efficient architecture that combines fast recurrence and attention for sequence modeling. SRU++ exhibits strong modeling capacity and training efficiency. On standard language modeling tasks such as Enwik8, Wiki-103 and Billion Word datasets, our model obtains better bits-per-character and perplexity while using 3x-10x less training cost compared to top-performing Transformer models. For instance, our model achieves a state-of-the-art result on the Enwik8 dataset using 1.6 days of training on an 8-GPU machine. We further demonstrate that SRU++ requires minimal attention for near state-of-the-art performance. Our results suggest jointly leveraging fast recurrence with little attention as a promising direction for accelerating model…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsTopic Modeling · Natural Language Processing Techniques · Speech Recognition and Synthesis
MethodsLinear Layer · SRU++ · Absolute Position Encodings · Position-Wise Feed-Forward Layer · Residual Connection · Dense Connections · Label Smoothing · Dropout · Attention Is All You Need · Layer Normalization
