Gradient Flow Polarizes Softmax Outputs towards Low-Entropy Solutions
Aditya Varre, Mark Rofin, Nicolas Flammarion

TL;DR
This paper analyzes the gradient flow dynamics of softmax-based models, revealing an inherent tendency towards low-entropy solutions, which explains certain empirical behaviors in transformer training.
Contribution
It provides a theoretical analysis of gradient flow in softmax models, showing the universal low-entropy bias across different objectives and linking it to transformer training phenomena.
Findings
Gradient flow drives solutions towards low-entropy outputs.
The low-entropy polarizing effect is universal across objectives.
Theoretical insights explain phenomena like attention sinks.
Abstract
Understanding the intricate non-convex training dynamics of softmax-based models is crucial for explaining the empirical success of transformers. In this article, we analyze the gradient flow dynamics of the value-softmax model, defined as , where and are a learnable value matrix and attention vector, respectively. As the matrix times softmax vector parameterization constitutes the core building block of self-attention, our analysis provides direct insight into transformer's training dynamics. We reveal that gradient flow on this structure inherently drives the optimization toward solutions characterized by low-entropy outputs. We demonstrate the universality of this polarizing effect across various objectives, including logistic and square loss. Furthermore, we discuss the practical implications of these theoretical results,…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAdvanced Memory and Neural Computing · Stochastic Gradient Optimization Techniques · Machine Learning in Materials Science
