Layer-wise Linear Mode Connectivity

Linara Adilova; Maksym Andriushchenko; Michael Kamp; Asja Fischer,; Martin Jaggi

arXiv:2307.06966·cs.LG·March 20, 2024

Layer-wise Linear Mode Connectivity

Linara Adilova, Maksym Andriushchenko, Michael Kamp, Asja Fischer,, Martin Jaggi

PDF

Open Access 1 Repo 1 Video 3 Reviews

TL;DR

This paper investigates layer-wise linear connectivity in deep networks, revealing that models can be connected through linear paths at the layer level, which enhances understanding of model averaging and federated learning.

Contribution

It introduces the concept of layer-wise linear connectivity and provides empirical and theoretical analysis showing deep networks lack layer-wise barriers.

Findings

01

Layer-wise averaging can produce well-performing models.

02

Deep networks exhibit layer-wise linear connectivity.

03

Models trained on different data can be connected through linear paths.

Abstract

Averaging neural network parameters is an intuitive method for fusing the knowledge of two independent models. It is most prominently used in federated learning. If models are averaged at the end of training, this can only lead to a good performing model if the loss surface of interest is very particular, i.e., the loss in the midpoint between the two models needs to be sufficiently low. This is impossible to guarantee for the non-convex losses of state-of-the-art networks. For averaging models trained on vastly different datasets, it was proposed to average only the parameters of particular layers or combinations of layers, resulting in better performing models. To get a better understanding of the effect of layer-wise averaging, we analyse the performance of the models that result from averaging single layers, or groups of layers. Based on our empirical and theoretical investigation,…

Peer Reviews

Decision·ICLR 2024 poster

Reviewer 01Rating 6· marginally above the acceptance thresholdConfidence 4

Strengths

The results on LLMC for single layers and groups of layers across models and datasets is thorough and provides generally interesting results. I did want to confirm that LLMC is the barrier with respect to *the average error of the two full models* not the average error of one full model and that model with a specified layer swapped. The latter would not be very informative if this significantly increased the error, but Definition 2 seems to imply the authors used the former. I would emphasize

Weaknesses

**Section 5.1:** I found the conclusions of this section hard to parse. I assume in Fig. 4 the intent is to compare the rows of the left 2 plots to the right 2 plots. One would then see that the full model has less curvature along the averaging direction as compared to a random perturbation, but a few of the layers have the opposite trend. The paper states: ``` Moreover, the networks are much more robust to random perturbations compared to the direction of interpolation between models. This

Reviewer 02Rating 6· marginally above the acceptance thresholdConfidence 3

Strengths

- The narrative construction is good, and the literature review is comprehensive and informative. - The reviewer appreciates the bold conjectures made throughout the paper. Although these may not always be rigorous, such daring speculations can be beneficial in stimulating further research. - Certain experimental outcomes are intriguing, for instance, the most sensitive layers of ViTs are the early attention and fully-connected weights; averaging directions exhibit a peculiar characteristic of h

Weaknesses

### Weaknesses - The reviewer acknowledges some insightful observations in this paper. However, from the reviewer's perspective, while the findings are interesting, they may not introduce profound novelty. - The presented phenomenon of smaller layer-wise barriers may not be surprising. For instance, if we consider a situation where all layers are created equally, the averaging of only $\frac{1}{s}$ layers would typically result in approximately $\frac{1}{s}$ of the loss increase induced by avera

Reviewer 03Rating 6· marginally above the acceptance thresholdConfidence 4

Strengths

* The motivation for studying layer-wise linear mode connectivity (LLMC) is novel and interesting. It contributes new ideas in the community of linear mode connectivity. * The authors study LLMC from different perspectives which are thorough. * The experiments are solid. The ViTs and LLMs are also studied to show the prevalence of the findings across a large range of model architectures. * The findings and takeaway insights are intriguing.

Weaknesses

Despite of the strengths listed above, I think this paper should be improved in the following aspects. * The analysis of cumulative LLMC (LLMC about the group of layers) needs further investigation. The authors should study the LLMC of $l$ consecutive layers on different parts of the models. The authors can conduct an experiment with a moving window of $l$ layers to show the group-layer-wise connectivity. For instance, given a 20-layer network with $l=5$; the experiments should be conducted: LLM

Code & Models

Repositories

link-er/layer-wise-lmc
pytorchOfficial

Videos

Layer-wise linear mode connectivity· slideslive

Taxonomy

TopicsOptical Wireless Communication Technologies · Energy Harvesting in Wireless Networks · Semiconductor Lasers and Optical Devices