Entropic Confinement and Mode Connectivity in Overparameterized Neural Networks
Luca Di Carlo, Chase Goddard, David J. Schwab

TL;DR
This paper investigates why neural network optimization remains confined despite connected low-loss paths, revealing entropic barriers caused by curvature variations and noise that influence solution localization.
Contribution
It introduces the concept of entropic barriers in loss landscapes, explaining the confinement and connectivity phenomena in overparameterized neural networks.
Findings
Curvature increases away from minima, creating entropic barriers.
Noise biases dynamics back toward endpoints, maintaining confinement.
Entropic barriers last longer than energetic barriers, affecting late-time solutions.
Abstract
Modern neural networks exhibit a striking property: basins of attraction in the loss landscape are often connected by low-loss paths, yet optimization dynamics generally remain confined to a single convex basin and rarely explore intermediate points. We resolve this paradox by identifying entropic barriers arising from the interplay between curvature variations along these paths and noise in optimization dynamics. Empirically, we find that curvature systematically rises away from minima, producing effective forces that bias noisy dynamics back toward the endpoints - even when the loss remains nearly flat. These barriers persist longer than energetic barriers, shaping the late-time localization of solutions in parameter space. Our results highlight the role of curvature-induced entropic forces in governing both connectivity and confinement in deep learning landscapes.
Peer Reviews
Decision·ICLR 2026 Poster
This paper presents an interesting study curvature-induced biases in neural network optimization. It offers valuable insights into how stochasticity combined with geometry can influence the parameter dynamics, mostly towards the end of training when the loss' role is lessened. The paper contributes to explain why certain solutions, though equivalent from the point of view of the train loss, are preferred and therefore challenges our understanding of the implicit biases of optimizers. The results
While the paper is conceptually interesting and seems technically sound, it suffers from awkward or overly elliptical phrasing, despite no apparent space constraints (see questions and minors below for specifics). This gives the impression that the manuscript was not thoroughly reread, which also somewhat undermines confidence in the results. The paper sometimes seems to overreach: - line 155: The reference to “the constant loss non-linear path” overstates uniqueness. In addition, identifying p
- The paper provides an interesting and fresh perspective on how optimization is biased towards certain solutions, driving interesting connections to the implicit bias of SGD (though the relationship to the literature could be expanded here) - The paper is sound. They use trace and max. Eigenvalue of the Hessian and using the SVD of the Fisher, providing robust evidence that curvature systematically increases along connecting paths. - The observations relating to the spawning experiments in
- While the core findings are convincing, the work would benefit significantly from additional ablation studies. Given that the experimental setting (ResNet-20, Wide ResNet-16-4) is not prohibitively expensive, several natural extensions would strengthen the paper: - How do entropic barriers manifest when learning rates are reduced? Do they become weaker (less noise)? - Algorithms like K-FAC, Shampoo, or natural gradient descent explicitly account for curvature. Do these methods navigate ent
- The paper's motivation is well presented and the introduction/literature review clearly set up the problem. - The paper connects two well-known phenomena: mode connectivity and the low curvature bias of gradient-based optimizers. These two nicely merge in the paper's analysis, giving an intuitive and convincing picture that the proposed explanation is actually behind the phenomenon observed. - I think that the experimental setups are interesting and well-designed. I particularly like the ide
1. My main concern with this work is that, while the ideas are very interesting, the experimental validation is too limited in scope to fully validate it. Given that the paper makes the strong claim that these energy barriers are a "univeral" phenomenon, the experiments should definitely be performed on more than two architectures (Wide ResNet-16-4 and ResNet-20) and one dataset (CIFAR 10) like the paper currently does. 2. Overall, the feeling I get from reading the paper is that a lot of space
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsStochastic Gradient Optimization Techniques · Neural Networks and Reservoir Computing · Model Reduction and Neural Networks
