Pretraining Without Attention
Junxiong Wang, Jing Nathan Yan, Albert Gu, Alexander M. Rush

TL;DR
This paper introduces BiGS, a pretraining model without attention layers, using sequence routing with SSMs, achieving comparable accuracy to BERT on GLUE without relying on attention mechanisms.
Contribution
The paper presents BiGS, a novel attention-free pretraining model using sequence routing and SSMs, matching BERT's performance on NLP benchmarks.
Findings
BiGS matches BERT's accuracy on GLUE.
BiGS can be extended to 4096 tokens without approximation.
The model exhibits different inductive biases than BERT.
Abstract
Transformers have been essential to pretraining success in NLP. While other architectures have been used, downstream accuracy is either significantly worse, or requires attention layers to match standard benchmarks such as GLUE. This work explores pretraining without attention by using recent advances in sequence routing based on state-space models (SSMs). Our proposed model, Bidirectional Gated SSM (BiGS), combines SSM layers with a multiplicative gating architecture that has been effective in simplified sequence modeling architectures. The model learns static layers that do not consider pair-wise interactions. Even so, BiGS is able to match BERT pretraining accuracy on GLUE and can be extended to long-form pretraining of 4096 tokens without approximation. Analysis shows that while the models have similar average accuracy, the approach has different inductive biases than BERT in terms…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsTopic Modeling · Natural Language Processing Techniques · Semantic Web and Ontologies
MethodsMulti-Head Attention · Attention Is All You Need · Linear Layer · Dense Connections · Attention Dropout · Residual Connection · Refunds@Expedia|||How do I get a full refund from Expedia? · Weight Decay · WordPiece · Linear Warmup With Linear Decay
