Pay Attention when Required

Swetha Mandava; Szymon Migacz; Alex Fit Florea

arXiv:2009.04534·cs.LG·May 18, 2021·5 cites

Pay Attention when Required

Swetha Mandava, Szymon Migacz, Alex Fit Florea

PDF

Open Access 2 Repos 2 Models

TL;DR

This paper introduces the PAR Transformer, which reduces computational cost by replacing a significant portion of self-attention blocks with feed-forward blocks, maintaining performance across multiple language modeling benchmarks.

Contribution

The paper proposes the PAR Transformer architecture that optimizes block ordering and reduces compute time by replacing self-attention with feed-forward blocks, without sacrificing accuracy.

Findings

01

35% lower compute time compared to Transformer-XL

02

Retains perplexity on WikiText-103 benchmark

03

Validated on text8, enwiki8, and BERT models

Abstract

Transformer-based models consist of interleaved feed-forward blocks - that capture content meaning, and relatively more expensive self-attention blocks - that capture context meaning. In this paper, we explored trade-offs and ordering of the blocks to improve upon the current Transformer architecture and proposed PAR Transformer. It needs 35% lower compute time than Transformer-XL achieved by replacing ~63% of the self-attention blocks with feed-forward blocks, and retains the perplexity on WikiText-103 language modelling benchmark. We further validated our results on text8 and enwiki8 datasets, as well as on the BERT model.

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

Models

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsTopic Modeling

MethodsLinear Layer · PAR Transformer · Absolute Position Encodings · Position-Wise Feed-Forward Layer · Cosine Annealing · *Communicated@Fast*How Do I Communicate to Expedia? · Layer Normalization · Weight Decay · Dropout · Adaptive Softmax