Simple Hardware-Efficient Long Convolutions for Sequence Modeling

Daniel Y. Fu; Elliot L. Epstein; Eric Nguyen; Armin W. Thomas; Michael; Zhang; Tri Dao; Atri Rudra; Christopher R\'e

arXiv:2302.06646·cs.LG·February 15, 2023·6 cites

Simple Hardware-Efficient Long Convolutions for Sequence Modeling

Daniel Y. Fu, Elliot L. Epstein, Eric Nguyen, Armin W. Thomas, Michael, Zhang, Tri Dao, Atri Rudra, Christopher R\'e

PDF

Open Access 1 Repo

TL;DR

This paper demonstrates that simple, smooth long convolutions can match state space models in sequence modeling tasks, and introduces FlashButterfly, an efficient algorithm that accelerates these convolutions and improves performance on long sequences.

Contribution

The paper introduces a simple method for long sequence modeling using smooth convolutions, and develops FlashButterfly, an IO-efficient algorithm that significantly speeds up training and achieves state-of-the-art results.

Findings

01

Smooth kernels recover SSM performance across tasks.

02

FlashButterfly speeds up convolutions by 2.2× and trains on 64K sequences faster.

03

Extended FlashButterfly outperforms Transformers on WikiText103.

Abstract

State space models (SSMs) have high performance on long sequence modeling but require sophisticated initialization techniques and specialized implementations for high quality and runtime performance. We study whether a simple alternative can match SSMs in performance and efficiency: directly learning long convolutions over the sequence. We find that a key requirement to achieving high performance is keeping the convolution kernels smooth. We find that simple interventions--such as squashing the kernel weights--result in smooth kernels and recover SSM performance on a range of tasks including the long range arena, image classification, language modeling, and brain data modeling. Next, we develop FlashButterfly, an IO-aware algorithm to improve the runtime performance of long convolutions. FlashButterfly appeals to classic Butterfly decompositions of the convolution to reduce GPU memory…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

hazyresearch/safari
pytorchOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsTopic Modeling · Ferroelectric and Negative Capacitance Devices · Machine Learning in Healthcare

MethodsAttention Is All You Need · Linear Layer · Dropout · Layer Normalization · Dense Connections · Multi-Head Attention · Absolute Position Encodings · Adam · Position-Wise Feed-Forward Layer · Softmax