Pretraining Without Attention

Junxiong Wang; Jing Nathan Yan; Albert Gu; Alexander M. Rush

arXiv:2212.10544·cs.CL·May 10, 2023

Pretraining Without Attention

Junxiong Wang, Jing Nathan Yan, Albert Gu, Alexander M. Rush

PDF

Open Access 1 Repo 6 Models

TL;DR

This paper introduces BiGS, a pretraining model without attention layers, using sequence routing with SSMs, achieving comparable accuracy to BERT on GLUE without relying on attention mechanisms.

Contribution

The paper presents BiGS, a novel attention-free pretraining model using sequence routing and SSMs, matching BERT's performance on NLP benchmarks.

Findings

01

BiGS matches BERT's accuracy on GLUE.

02

BiGS can be extended to 4096 tokens without approximation.

03

The model exhibits different inductive biases than BERT.

Abstract

Transformers have been essential to pretraining success in NLP. While other architectures have been used, downstream accuracy is either significantly worse, or requires attention layers to match standard benchmarks such as GLUE. This work explores pretraining without attention by using recent advances in sequence routing based on state-space models (SSMs). Our proposed model, Bidirectional Gated SSM (BiGS), combines SSM layers with a multiplicative gating architecture that has been effective in simplified sequence modeling architectures. The model learns static layers that do not consider pair-wise interactions. Even so, BiGS is able to match BERT pretraining accuracy on GLUE and can be extended to long-form pretraining of 4096 tokens without approximation. Analysis shows that while the models have similar average accuracy, the approach has different inductive biases than BERT in terms…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

jxiw/bigs
jaxOfficial

Models

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsTopic Modeling · Natural Language Processing Techniques · Semantic Web and Ontologies

MethodsMulti-Head Attention · Attention Is All You Need · Linear Layer · Dense Connections · Attention Dropout · Residual Connection · Refunds@Expedia|||How do I get a full refund from Expedia? · Weight Decay · WordPiece · Linear Warmup With Linear Decay