Breaking the Attention Bottleneck
Kalle Hilsenbek

TL;DR
This paper introduces a novel attention replacement method for transformers that reduces complexity and improves performance, demonstrated by lower loss in a nanoGPT setting with a smaller model.
Contribution
It proposes a generative function as an alternative to traditional attention, addressing the quadratic complexity bottleneck in transformer models.
Findings
Smaller models with the new method achieve lower loss.
Incorporating an average context vector further reduces loss.
The approach is publicly available under GNU AGPL v3 license.
Abstract
Attention-based transformers have become the standard architecture in many deep learning fields, primarily due to their ability to model long-range dependencies and handle variable-length input sequences. However, the attention mechanism with its quadratic complexity is a significant bottleneck in the transformer architecture. This algorithm is only uni-directional in the decoder and converges to a static pattern in over-parametrized decoder-only models. I address this issue by developing a generative function as attention or activation replacement. It still has the auto-regressive character by comparing each token with the previous one. In my test setting with nanoGPT this yields a smaller loss while having a smaller model. The loss further drops by incorporating an average context vector. This concept of attention replacement is distributed under the GNU AGPL v3 license at…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsGenerative Adversarial Networks and Image Synthesis · Advanced Neural Network Applications · Domain Adaptation and Few-Shot Learning
