Enhancing Transformers Through Conditioned Embedded Tokens

Hemanth Saratchandran; Simon Lucey

arXiv:2505.12789·cs.CV·October 7, 2025

Enhancing Transformers Through Conditioned Embedded Tokens

Hemanth Saratchandran, Simon Lucey

PDF

Open Access

TL;DR

This paper identifies ill-conditioning in transformer attention mechanisms and introduces conditioned embedded tokens to improve training stability and efficiency across multiple tasks and architectures.

Contribution

It provides a theoretical framework linking attention conditioning to embedded token data and proposes a novel method to enhance transformer training stability.

Findings

01

Improved training stability in transformers.

02

Consistent performance gains across vision and NLP tasks.

03

Reduced ill-conditioning in attention mechanisms.

Abstract

Transformers have transformed modern machine learning, driving breakthroughs in computer vision, natural language processing, and robotics. At the core of their success lies the attention mechanism, which enables the modeling of global dependencies among input tokens. However, we reveal that the attention block in transformers suffers from inherent ill-conditioning, which hampers gradient-based optimization and leads to inefficient training. To address this, we develop a theoretical framework that establishes a direct relationship between the conditioning of the attention block and that of the embedded tokenized data. Building on this insight, we introduce conditioned embedded tokens, a method that systematically modifies the embedded tokens to improve the conditioning of the attention mechanism. Our analysis demonstrates that this approach significantly mitigates ill-conditioning,…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsAdvanced Neural Network Applications · Multimodal Machine Learning Applications · Generative Adversarial Networks and Image Synthesis

MethodsSoftmax · Attention Is All You Need