Gradient Flow Polarizes Softmax Outputs towards Low-Entropy Solutions

Aditya Varre; Mark Rofin; Nicolas Flammarion

arXiv:2603.06248·cs.LG·March 9, 2026

Gradient Flow Polarizes Softmax Outputs towards Low-Entropy Solutions

Aditya Varre, Mark Rofin, Nicolas Flammarion

PDF

Open Access

TL;DR

This paper analyzes the gradient flow dynamics of softmax-based models, revealing an inherent tendency towards low-entropy solutions, which explains certain empirical behaviors in transformer training.

Contribution

It provides a theoretical analysis of gradient flow in softmax models, showing the universal low-entropy bias across different objectives and linking it to transformer training phenomena.

Findings

01

Gradient flow drives solutions towards low-entropy outputs.

02

The low-entropy polarizing effect is universal across objectives.

03

Theoretical insights explain phenomena like attention sinks.

Abstract

Understanding the intricate non-convex training dynamics of softmax-based models is crucial for explaining the empirical success of transformers. In this article, we analyze the gradient flow dynamics of the value-softmax model, defined as $L (V σ (a))$ , where $V$ and $a$ are a learnable value matrix and attention vector, respectively. As the matrix times softmax vector parameterization constitutes the core building block of self-attention, our analysis provides direct insight into transformer's training dynamics. We reveal that gradient flow on this structure inherently drives the optimization toward solutions characterized by low-entropy outputs. We demonstrate the universality of this polarizing effect across various objectives, including logistic and square loss. Furthermore, we discuss the practical implications of these theoretical results,…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsAdvanced Memory and Neural Computing · Stochastic Gradient Optimization Techniques · Machine Learning in Materials Science