Pay Attention when Required
Swetha Mandava, Szymon Migacz, Alex Fit Florea

TL;DR
This paper introduces the PAR Transformer, which reduces computational cost by replacing a significant portion of self-attention blocks with feed-forward blocks, maintaining performance across multiple language modeling benchmarks.
Contribution
The paper proposes the PAR Transformer architecture that optimizes block ordering and reduces compute time by replacing self-attention with feed-forward blocks, without sacrificing accuracy.
Findings
35% lower compute time compared to Transformer-XL
Retains perplexity on WikiText-103 benchmark
Validated on text8, enwiki8, and BERT models
Abstract
Transformer-based models consist of interleaved feed-forward blocks - that capture content meaning, and relatively more expensive self-attention blocks - that capture context meaning. In this paper, we explored trade-offs and ordering of the blocks to improve upon the current Transformer architecture and proposed PAR Transformer. It needs 35% lower compute time than Transformer-XL achieved by replacing ~63% of the self-attention blocks with feed-forward blocks, and retains the perplexity on WikiText-103 language modelling benchmark. We further validated our results on text8 and enwiki8 datasets, as well as on the BERT model.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsTopic Modeling
MethodsLinear Layer · PAR Transformer · Absolute Position Encodings · Position-Wise Feed-Forward Layer · Cosine Annealing · *Communicated@Fast*How Do I Communicate to Expedia? · Layer Normalization · Weight Decay · Dropout · Adaptive Softmax
