Optimizer choice matters for the emergence of Neural Collapse

Jim Zhao; Tin Sum Cheng; Wojciech Masarczyk; Aurelien Lucchi

arXiv:2602.16642·cs.LG·February 26, 2026

Optimizer choice matters for the emergence of Neural Collapse

Jim Zhao, Tin Sum Cheng, Wojciech Masarczyk, Aurelien Lucchi

PDF

Open Access 3 Reviews

TL;DR

This paper investigates how the choice of optimizer influences Neural Collapse (NC) in deep neural networks, introducing a new diagnostic metric and providing both theoretical and empirical evidence that optimizer type affects NC emergence.

Contribution

The study introduces NC0 as a new metric for analyzing NC, and demonstrates that optimizer type and weight decay coupling critically impact NC emergence, supported by theoretical proofs and extensive experiments.

Findings

01

NC cannot emerge under decoupled weight decay in adaptive optimizers

02

Different optimizers exhibit distinct NC0 dynamics

03

Momentum accelerates NC beyond loss convergence

Abstract

Neural Collapse (NC) refers to the emergence of highly symmetric geometric structures in the representations of deep neural networks during the terminal phase of training. Despite its prevalence, the theoretical understanding of NC remains limited. Existing analyses largely ignore the role of the optimizer, thereby suggesting that NC is universal across optimization methods. In this work, we challenge this assumption and demonstrate that the choice of optimizer plays a critical role in the emergence of NC. The phenomenon is typically quantified through NC metrics, which, however, are difficult to track and analyze theoretically. To overcome this limitation, we introduce a novel diagnostic metric, NC0, whose convergence to zero is a necessary condition for NC. Using NC0, we provide theoretical evidence that NC cannot emerge under decoupled weight decay in adaptive optimizers, as…

Peer Reviews

Decision·ICLR 2026 Poster

Reviewer 01Rating 6Confidence 3

Strengths

1. An actionable mechanism that practitioners can immediately test and reason about. 2. NC0 provides a tractable and provable indicator, enabling convergence and impossibility statements. 3. Value the importance of optimization algorithm choices in NC.

Weaknesses

1. Modeling gap to real Adam/AdamW. The formal results use SignGD and unconstrained features as a proxy. While the qualitative match to Adam/AdamW is persuasive, the absence of finite-step analysis with (β₁, β₂, ε) leaves open whether corner-casescould break the claimed dynamics. 2. NC0 is necessary, not sufficient. It could happen that NC0→0 but full NC (NC1–NC3) can still fail. Without parallel theory for the other NC metrics, one could over-infer the presence or absence of collapse. 3. Exte

Reviewer 02Rating 8Confidence 3

Strengths

- This work investigates Neural Collapse from a novel perspective, supported by both experimentation and theoretical evidence. - It opens up for further discussion and deeper investigation in the field. - The work is clearly written. - The authors critically discuss their findings and limitations.

Weaknesses

- Some plots are difficult to interpret due to overlapping lines and similar colors.

Reviewer 03Rating 2Confidence 4

Strengths

- S1: Both theoretical and empirical evidence presented in the paper is convincing (modulo a few caveats mentioned in the weaknesses). The paper makes it clear that the presented phenomenon is indeed happening. Some of the ablations in the appendix (for instance the one with 2000 training epochs) reassure me even more. - S2: The paper tackles an important topic of better understanding the conditions under which the neural collapse emerges. - S3: The paper is mostly well written and easy to fol

Weaknesses

- W1: If I correctly understand the mathematical essence of the paper, it seems that authors are misinterpreting their results. To me, the main distinction should not be coupled vs. decoupled weight decay but rather first- vs. second-order optimizers. It seems that the crucial parameter that really determines the limiting behavior is not so much the implementation of the weight decay, but rather whether the gradient computation corresponds precisely to the original loss function and thus, whethe

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsStochastic Gradient Optimization Techniques · Neural dynamics and brain function · Advanced Memory and Neural Computing