TransformerFAM: Feedback attention is working memory
Dongseong Hwang, Weiran Wang, Zhuoyuan Huo, Khe Chai Sim, Pedro Moreno, Mengibar

TL;DR
TransformerFAM introduces a feedback loop mechanism that creates working memory within Transformers, enabling processing of infinitely long sequences without extra weights, significantly enhancing long-context task performance.
Contribution
It presents a novel Feedback Attention Memory (FAM) architecture that integrates feedback loops into Transformers, allowing for unlimited sequence processing with no additional parameters.
Findings
Improved performance on long-context tasks across multiple model sizes.
Enables processing of sequences of unlimited length.
No additional weights needed for integration.
Abstract
While Transformers have revolutionized deep learning, their quadratic attention complexity hinders their ability to process infinitely long inputs. We propose Feedback Attention Memory (FAM), a novel Transformer architecture that leverages a feedback loop to enable the network to attend to its own latent representations. This design fosters the emergence of working memory within the Transformer, allowing it to process indefinitely long sequences. TransformerFAM requires no additional weights, enabling seamless integration with pre-trained models. Our experiments show that TransformerFAM significantly improves Transformer performance on long-context tasks across various model sizes (1B, 8B, and 24B). These results showcase the potential to empower Large Language Models (LLMs) to process sequences of unlimited length.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
TransformerFAM: Feedback attention is working memory· youtube
Taxonomy
TopicsTopic Modeling · Advanced Neural Network Applications · Generative Adversarial Networks and Image Synthesis
MethodsAttention Is All You Need · Dropout · Adam · Position-Wise Feed-Forward Layer · Linear Layer · Layer Normalization · Byte Pair Encoding · Absolute Position Encodings · Dense Connections · Label Smoothing
