Fine-Tuning Pre-trained Transformers into Decaying Fast Weights

Huanru Henry Mao

arXiv:2210.04243·cs.LG·October 11, 2022

Fine-Tuning Pre-trained Transformers into Decaying Fast Weights

Huanru Henry Mao

PDF

Open Access 1 Repo

TL;DR

This paper introduces a simple decaying fast weights method for autoregressive Transformers that achieves near-attention performance with significantly reduced computational complexity, outperforming prior kernel-based approaches.

Contribution

The paper proposes a straightforward decaying fast weights approach as an alternative to complex kernel-based methods for efficient Transformer inference.

Findings

01

Decaying fast weights run efficiently on GPU.

02

They retain 99% of GPT-2 attention performance.

03

Achieve competitive results on WikiText-103.

Abstract

Autoregressive Transformers are strong language models but incur O(T) complexity during per-token generation due to the self-attention mechanism. Recent work proposes kernel-based methods to approximate causal self-attention by replacing it with recurrent formulations with various update rules and feature maps to achieve O(1) time and memory complexity. We explore these approaches and find that they are unnecessarily complex, and propose a simple alternative - decaying fast weights - that runs fast on GPU, outperforms prior methods, and retains 99% of attention's performance for GPT-2. We also show competitive performance on WikiText-103 against more complex attention substitutes.

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

jenni-ai/t2fw
pytorchOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsTopic Modeling · Natural Language Processing Techniques · Machine Learning in Healthcare

MethodsAttention Is All You Need · Linear Layer · Byte Pair Encoding · Discriminative Fine-Tuning · Layer Normalization · Cosine Annealing · Refunds@Expedia|||How do I get a full refund from Expedia? · Residual Connection · Dropout · Weight Decay