Enhancing Transformers Through Conditioned Embedded Tokens
Hemanth Saratchandran, Simon Lucey

TL;DR
This paper identifies ill-conditioning in transformer attention mechanisms and introduces conditioned embedded tokens to improve training stability and efficiency across multiple tasks and architectures.
Contribution
It provides a theoretical framework linking attention conditioning to embedded token data and proposes a novel method to enhance transformer training stability.
Findings
Improved training stability in transformers.
Consistent performance gains across vision and NLP tasks.
Reduced ill-conditioning in attention mechanisms.
Abstract
Transformers have transformed modern machine learning, driving breakthroughs in computer vision, natural language processing, and robotics. At the core of their success lies the attention mechanism, which enables the modeling of global dependencies among input tokens. However, we reveal that the attention block in transformers suffers from inherent ill-conditioning, which hampers gradient-based optimization and leads to inefficient training. To address this, we develop a theoretical framework that establishes a direct relationship between the conditioning of the attention block and that of the embedded tokenized data. Building on this insight, we introduce conditioned embedded tokens, a method that systematically modifies the embedded tokens to improve the conditioning of the attention mechanism. Our analysis demonstrates that this approach significantly mitigates ill-conditioning,…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAdvanced Neural Network Applications · Multimodal Machine Learning Applications · Generative Adversarial Networks and Image Synthesis
MethodsSoftmax · Attention Is All You Need
