Fine-Tuning Pre-trained Transformers into Decaying Fast Weights
Huanru Henry Mao

TL;DR
This paper introduces a simple decaying fast weights method for autoregressive Transformers that achieves near-attention performance with significantly reduced computational complexity, outperforming prior kernel-based approaches.
Contribution
The paper proposes a straightforward decaying fast weights approach as an alternative to complex kernel-based methods for efficient Transformer inference.
Findings
Decaying fast weights run efficiently on GPU.
They retain 99% of GPT-2 attention performance.
Achieve competitive results on WikiText-103.
Abstract
Autoregressive Transformers are strong language models but incur O(T) complexity during per-token generation due to the self-attention mechanism. Recent work proposes kernel-based methods to approximate causal self-attention by replacing it with recurrent formulations with various update rules and feature maps to achieve O(1) time and memory complexity. We explore these approaches and find that they are unnecessarily complex, and propose a simple alternative - decaying fast weights - that runs fast on GPU, outperforms prior methods, and retains 99% of attention's performance for GPT-2. We also show competitive performance on WikiText-103 against more complex attention substitutes.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsTopic Modeling · Natural Language Processing Techniques · Machine Learning in Healthcare
MethodsAttention Is All You Need · Linear Layer · Byte Pair Encoding · Discriminative Fine-Tuning · Layer Normalization · Cosine Annealing · Refunds@Expedia|||How do I get a full refund from Expedia? · Residual Connection · Dropout · Weight Decay
