SHAQ: Single Headed Attention with Quasi-Recurrence

Nashwin Bharwani; Warren Kushner; Sangeet Dandona; Ben Schreiber

arXiv:2108.08207·cs.CL·August 23, 2021

SHAQ: Single Headed Attention with Quasi-Recurrence

Nashwin Bharwani, Warren Kushner, Sangeet Dandona, Ben Schreiber

PDF

Open Access

TL;DR

SHAQ introduces a simplified, efficient neural network architecture combining single-headed attention with quasi-recurrence, achieving comparable accuracy to prior models but with significantly faster training times, making advanced NLP more accessible.

Contribution

The paper presents SHAQ, a novel architecture that merges single-headed attention with quasi-recurrence, offering a faster training alternative to complex transformer models.

Findings

01

SHAQ achieves similar accuracy to SHA-RNN.

02

SHAQ trains four times faster than SHA-RNN.

03

The architecture reduces computational complexity while maintaining performance.

Abstract

Natural Language Processing research has recently been dominated by large scale transformer models. Although they achieve state of the art on many important language tasks, transformers often require expensive compute resources, and days spanning to weeks to train. This is feasible for researchers at big tech companies and leading research universities, but not for scrappy start-up founders, students, and independent researchers. Stephen Merity's SHA-RNN, a compact, hybrid attention-RNN model, is designed for consumer-grade modeling as it requires significantly fewer parameters and less training time to reach near state of the art results. We analyze Merity's model here through an exploratory model analysis over several units of the architecture considering both training time and overall quality in our assessment. Ultimately, we combine these findings into a new architecture which we…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsTopic Modeling · Advanced Neural Network Applications · Natural Language Processing Techniques

MethodsAttention Is All You Need · Refunds@Expedia|||How do I get a full refund from Expedia? · Sigmoid Activation · Tanh Activation · Softmax · Long Short-Term Memory · Dropout · Dense Connections · Layer Normalization · Single-Headed Attention