Neural Collapse is Globally Optimal in Deep Regularized ResNets and Transformers

Peter S\'uken\'ik; Christoph H. Lampert; Marco Mondelli

arXiv:2505.15239·cs.LG·May 22, 2025

Neural Collapse is Globally Optimal in Deep Regularized ResNets and Transformers

Peter S\'uken\'ik, Christoph H. Lampert, Marco Mondelli

PDF

Open Access 1 Video

TL;DR

This paper proves that deep regularized ResNets and transformers tend to exhibit neural collapse as their depth increases, providing a theoretical foundation for this phenomenon in modern architectures with empirical validation.

Contribution

It extends neural collapse analysis to data-aware deep architectures like ResNets and transformers, showing global optimality and feature symmetry in deep regimes.

Findings

01

Neural collapse becomes more prominent with increased depth.

02

Theoretical reduction of training to an unconstrained features model.

03

Empirical validation on vision and language datasets.

Abstract

The empirical emergence of neural collapse -- a surprising symmetry in the feature representations of the training data in the penultimate layer of deep neural networks -- has spurred a line of theoretical research aimed at its understanding. However, existing work focuses on data-agnostic models or, when data structure is taken into account, it remains limited to multi-layer perceptrons. Our paper fills both these gaps by analyzing modern architectures in a data-aware regime: we prove that global optima of deep regularized transformers and residual networks (ResNets) with LayerNorm trained with cross entropy or mean squared error loss are approximately collapsed, and the approximation gets tighter as the depth grows. More generally, we formally reduce any end-to-end large-depth ResNet or transformer training into an equivalent unconstrained features model, thus justifying its wide use…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

Neural Collapse is Globally Optimal in Deep Regularized ResNets and Transformers· slideslive

Taxonomy

TopicsAdversarial Robustness in Machine Learning · Domain Adaptation and Few-Shot Learning · Neural Networks and Applications

MethodsAverage Pooling · Convolution · Global Average Pooling · Kaiming Initialization · Max Pooling