Breaking the Attention Bottleneck

Kalle Hilsenbek

arXiv:2406.10906·cs.LG·June 18, 2024

Breaking the Attention Bottleneck

Kalle Hilsenbek

PDF

Open Access 1 Repo

TL;DR

This paper introduces a novel attention replacement method for transformers that reduces complexity and improves performance, demonstrated by lower loss in a nanoGPT setting with a smaller model.

Contribution

It proposes a generative function as an alternative to traditional attention, addressing the quadratic complexity bottleneck in transformer models.

Findings

01

Smaller models with the new method achieve lower loss.

02

Incorporating an average context vector further reduces loss.

03

The approach is publicly available under GNU AGPL v3 license.

Abstract

Attention-based transformers have become the standard architecture in many deep learning fields, primarily due to their ability to model long-range dependencies and handle variable-length input sequences. However, the attention mechanism with its quadratic complexity is a significant bottleneck in the transformer architecture. This algorithm is only uni-directional in the decoder and converges to a static pattern in over-parametrized decoder-only models. I address this issue by developing a generative function as attention or activation replacement. It still has the auto-regressive character by comparing each token with the previous one. In my test setting with nanoGPT this yields a smaller loss while having a smaller model. The loss further drops by incorporating an average context vector. This concept of attention replacement is distributed under the GNU AGPL v3 license at…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

https://gitlab.com/bachstelze/causal_generation
pytorchOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsGenerative Adversarial Networks and Image Synthesis · Advanced Neural Network Applications · Domain Adaptation and Few-Shot Learning