TransformerFAM: Feedback attention is working memory

Dongseong Hwang; Weiran Wang; Zhuoyuan Huo; Khe Chai Sim; Pedro Moreno; Mengibar

arXiv:2404.09173·cs.LG·May 8, 2024·2 cites

TransformerFAM: Feedback attention is working memory

Dongseong Hwang, Weiran Wang, Zhuoyuan Huo, Khe Chai Sim, Pedro Moreno, Mengibar

PDF

Open Access 1 Video

TL;DR

TransformerFAM introduces a feedback loop mechanism that creates working memory within Transformers, enabling processing of infinitely long sequences without extra weights, significantly enhancing long-context task performance.

Contribution

It presents a novel Feedback Attention Memory (FAM) architecture that integrates feedback loops into Transformers, allowing for unlimited sequence processing with no additional parameters.

Findings

01

Improved performance on long-context tasks across multiple model sizes.

02

Enables processing of sequences of unlimited length.

03

No additional weights needed for integration.

Abstract

While Transformers have revolutionized deep learning, their quadratic attention complexity hinders their ability to process infinitely long inputs. We propose Feedback Attention Memory (FAM), a novel Transformer architecture that leverages a feedback loop to enable the network to attend to its own latent representations. This design fosters the emergence of working memory within the Transformer, allowing it to process indefinitely long sequences. TransformerFAM requires no additional weights, enabling seamless integration with pre-trained models. Our experiments show that TransformerFAM significantly improves Transformer performance on long-context tasks across various model sizes (1B, 8B, and 24B). These results showcase the potential to empower Large Language Models (LLMs) to process sequences of unlimited length.

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

TransformerFAM: Feedback attention is working memory· youtube

Taxonomy

TopicsTopic Modeling · Advanced Neural Network Applications · Generative Adversarial Networks and Image Synthesis

MethodsAttention Is All You Need · Dropout · Adam · Position-Wise Feed-Forward Layer · Linear Layer · Layer Normalization · Byte Pair Encoding · Absolute Position Encodings · Dense Connections · Label Smoothing