Connecting Independently Trained Modes via Layer-Wise Connectivity
Yongding Tian, Zaid Al-Ars, Maksim Kitsak, Peter Hofstee

TL;DR
This paper introduces a new empirical algorithm that effectively connects independently trained neural network models across diverse modern architectures, enhancing the understanding of mode connectivity in complex models.
Contribution
The authors propose a generalized empirical method for mode connectivity that works across various modern architectures and training conditions, surpassing previous limitations.
Findings
Supports a wider range of architectures including MobileNet, ShuffleNet, and EfficientNet.
Provides more consistent connectivity paths between independently trained models.
Enables connecting modes trained with different hyperparameters.
Abstract
Empirical and theoretical studies have shown that continuous low-loss paths can be constructed between independently trained neural network models. This phenomenon, known as mode connectivity, refers to the existence of such paths between distinct modes-i.e., well-trained solutions in parameter space. However, existing empirical methods are primarily effective for older and relatively simple architectures such as basic CNNs, VGG, and ResNet, raising concerns about their applicability to modern and structurally diverse models. In this work, we propose a new empirical algorithm for connecting independently trained modes that generalizes beyond traditional architectures and supports a broader range of networks, including MobileNet, ShuffleNet, EfficientNet, RegNet, Deep Layer Aggregation (DLA), and Compact Convolutional Transformers (CCT). In addition to broader applicability, the proposed…
Peer Reviews
Decision·Submitted to ICLR 2026
- The authors revisit non-linear mode connectivity to a broader set of architectures beyond the commonly studied ResNet/VGG models. - I believe the discussion of the models trained with different hyper-parameters and the focus on different variances is an important point that is not addressed in the prior literature. The algorithm and the discussion of data flow order is sound.
- The paper does not provide clear motivation for why demonstrating mode connectivity in MobileNet, ShuffleNet, or the other selected architectures is important or beneficial. What practical or theoretical insights would we gain from showing mode connectivity in these specific models? The choice of architectures appears arbitrary and is not justified. Moreover, the proposed architectures (MobileNet, ShuffleNet, EfficientNet, RegNet) are roughly contemporary with the mode connectivity literature
- The proposed method to find low-loss paths is novel and interesting. - The proposed method is tested for recent model architectures. - The notion of a _variance sphere_ and the layer-wise variance correction step are intuitive and sound. - The empirical evaluation shows the method works for a wide variety of model architectures and results are consistent across seeds. - Algorithms are clearly presented and linked to geometric reasoning. - Consistent paths across seeds and architectures suggest
- The experiments are _somewhat_ limited to visualizations of loss/accuracy trajectories; no quantitative comparison of path quality (e.g., path length, interpolation efficiency, energy landscape visualization). - The effect of layer order, variance correction, and training steps is not systematically analyzed in an ablation study. - The approach is iterative and layer-wise. The paper acknowledges high compute requirements but gives no runtime analysis.
- The presentation is clear and the paper is easy to follow. - The experiment results verify the effectiveness of the proposed algorithm.
- I don't see why this problem is important. In my understanding, mode connectivity is more like a phenomenon that helps us better understand the loss landspace, instead of a challenge that needs to be solved by an algortihm. Perhpas the authors can describe more of the practical usefulness of their algorithm? - The algorithm looks trivial, and is not explained at all. For example, why do you need to project back to the variance sphere at each step? How do you ensure there is no loss barrier bet
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMolecular Communication and Nanonetworks · Neural Networks and Reservoir Computing · Quantum-Dot Cellular Automata
MethodsAttention Is All You Need · Linear Layer · Multi-Head Attention · Dense Connections · Adam · Dropout · Layer Normalization · Position-Wise Feed-Forward Layer · Byte Pair Encoding · Softmax
