Neural Collapse is Globally Optimal in Deep Regularized ResNets and Transformers
Peter S\'uken\'ik, Christoph H. Lampert, Marco Mondelli

TL;DR
This paper proves that deep regularized ResNets and transformers tend to exhibit neural collapse as their depth increases, providing a theoretical foundation for this phenomenon in modern architectures with empirical validation.
Contribution
It extends neural collapse analysis to data-aware deep architectures like ResNets and transformers, showing global optimality and feature symmetry in deep regimes.
Findings
Neural collapse becomes more prominent with increased depth.
Theoretical reduction of training to an unconstrained features model.
Empirical validation on vision and language datasets.
Abstract
The empirical emergence of neural collapse -- a surprising symmetry in the feature representations of the training data in the penultimate layer of deep neural networks -- has spurred a line of theoretical research aimed at its understanding. However, existing work focuses on data-agnostic models or, when data structure is taken into account, it remains limited to multi-layer perceptrons. Our paper fills both these gaps by analyzing modern architectures in a data-aware regime: we prove that global optima of deep regularized transformers and residual networks (ResNets) with LayerNorm trained with cross entropy or mean squared error loss are approximately collapsed, and the approximation gets tighter as the depth grows. More generally, we formally reduce any end-to-end large-depth ResNet or transformer training into an equivalent unconstrained features model, thus justifying its wide use…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
Taxonomy
TopicsAdversarial Robustness in Machine Learning · Domain Adaptation and Few-Shot Learning · Neural Networks and Applications
MethodsAverage Pooling · Convolution · Global Average Pooling · Kaiming Initialization · Max Pooling
