Simple Hardware-Efficient Long Convolutions for Sequence Modeling
Daniel Y. Fu, Elliot L. Epstein, Eric Nguyen, Armin W. Thomas, Michael, Zhang, Tri Dao, Atri Rudra, Christopher R\'e

TL;DR
This paper demonstrates that simple, smooth long convolutions can match state space models in sequence modeling tasks, and introduces FlashButterfly, an efficient algorithm that accelerates these convolutions and improves performance on long sequences.
Contribution
The paper introduces a simple method for long sequence modeling using smooth convolutions, and develops FlashButterfly, an IO-efficient algorithm that significantly speeds up training and achieves state-of-the-art results.
Findings
Smooth kernels recover SSM performance across tasks.
FlashButterfly speeds up convolutions by 2.2× and trains on 64K sequences faster.
Extended FlashButterfly outperforms Transformers on WikiText103.
Abstract
State space models (SSMs) have high performance on long sequence modeling but require sophisticated initialization techniques and specialized implementations for high quality and runtime performance. We study whether a simple alternative can match SSMs in performance and efficiency: directly learning long convolutions over the sequence. We find that a key requirement to achieving high performance is keeping the convolution kernels smooth. We find that simple interventions--such as squashing the kernel weights--result in smooth kernels and recover SSM performance on a range of tasks including the long range arena, image classification, language modeling, and brain data modeling. Next, we develop FlashButterfly, an IO-aware algorithm to improve the runtime performance of long convolutions. FlashButterfly appeals to classic Butterfly decompositions of the convolution to reduce GPU memory…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsTopic Modeling · Ferroelectric and Negative Capacitance Devices · Machine Learning in Healthcare
MethodsAttention Is All You Need · Linear Layer · Dropout · Layer Normalization · Dense Connections · Multi-Head Attention · Absolute Position Encodings · Adam · Position-Wise Feed-Forward Layer · Softmax
