Diagonalizing the Softmax: Hadamard Initialization for Tractable Cross-Entropy Dynamics
Connall Garrod, Jonathan P. Keating, Christos Thrampoulidis

TL;DR
This paper analyzes the training dynamics of cross-entropy loss in deep learning, revealing how Hadamard initialization simplifies the softmax operator and enabling a detailed understanding of convergence to neural collapse in a non-convex setting.
Contribution
It introduces a novel analysis of CE dynamics in a non-convex neural network, proving convergence to neural collapse and showing Hadamard initialization diagonalizes the softmax operator.
Findings
Gradient flow on CE converges to neural collapse geometry.
Hadamard initialization diagonalizes the softmax operator.
The analysis provides a pathway for studying CE dynamics beyond simple models.
Abstract
Cross-entropy (CE) training loss dominates deep learning practice, yet existing theory often relies on simplifications, either replacing it with squared loss or restricting to convex models, that miss essential behavior. CE and squared loss generate fundamentally different dynamics, and convex linear models cannot capture the complexities of non-convex optimization. We provide an in-depth characterization of multi-class CE optimization dynamics beyond the convex regime by analyzing a canonical two-layer linear neural network with standard-basis vectors as inputs: the simplest non-convex extension for which the implicit bias remained unknown. This model coincides with the unconstrained features model used to study neural collapse, making our work the first to prove that gradient flow on CE converges to the neural collapse geometry. We construct an explicit Lyapunov function that…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsStochastic Gradient Optimization Techniques · Neural dynamics and brain function · Neural Networks and Reservoir Computing
