Transferring Learning Trajectories of Neural Networks
Daiki Chijiwa

TL;DR
This paper introduces a novel method for transferring the learning trajectory of neural networks to new initializations, enabling faster training and improved initial accuracy, which reduces computational costs.
Contribution
It formulates the learning transfer problem and proposes the first algorithm to transfer trajectories by matching gradients, demonstrating practical benefits.
Findings
Transferred parameters achieve non-trivial accuracy before training.
Transferred parameters can be trained faster than from scratch.
The method effectively utilizes previous learning trajectories.
Abstract
Training deep neural networks (DNNs) is computationally expensive, which is problematic especially when performing duplicated or similar training runs in model ensemble or fine-tuning pre-trained models, for example. Once we have trained one DNN on some dataset, we have its learning trajectory (i.e., a sequence of intermediate parameters during training) which may potentially contain useful information for learning the dataset. However, there has been no attempt to utilize such information of a given learning trajectory for another training. In this paper, we formulate the problem of "transferring" a given learning trajectory from one initial parameter to another one (learning transfer problem) and derive the first algorithm to approximately solve it by matching gradients successively along the trajectory via permutation symmetry. We empirically show that the transferred parameters…
Peer Reviews
Decision·ICLR 2024 poster
1. The proposed algorithm is a novel approach to the problem of transferring a learning trajectory from one initial parameter to another. The idea is interesting. 2. The algorithm is theoretically grounded and can be solved efficiently with only several tens of gradient computations and lightweight linear optimization. 3. The empirical results show that the transferred parameters achieve non-trivial accuracy before any direct training and can be trained significantly faster than training from sc
1. The empirical evaluation of the algorithm is conducted on a limited set of benchmark datasets, and it is unclear how well the algorithm would perform on other types of datasets or in real-world scenarios. 2. The paper assumes that the source and target tasks are related, and it is unclear how well the algorithm would perform when the tasks are not such related. 3. The paper does not provide a detailed analysis of the computational cost of the algorithm, which may be a concern for large-scale
### Originality and significance The proposed task of learning transfer problem is novel and very interesting, with potentially wide applications, as the foundation-model paradigm prevails in many AI / DL fields. The proposed method is to progressively merge the target network with the source network using Git Re-basin, which is straightforward and efficient. ### Quality Theoretical analysis is performed to justify the adopted method, in addition to a series insightful experiments. The experi
I am open to change my score if the authors can address the following concerns: ### Lack of experiments more closely demonstrating the actual usage of the proposed method 1. One potential usage of the proposed method suggested by the authors is to transfer the update of a foundation model to its fine-tuned versions. However, all experiments are limited to network architectures of relatively smaller scale, and to the cases where fine-tuning task shares exactly the same number of classes as the
The task, “learning transfer problem” the authors proposed is novel to me. To address this problem, the authors proposed an algorithm to match the trajectories between source and target, which is seemly convincing. To evaluate the validity of the proposed algorithm, the authors, without any training, conducted an experiment that transfers the calculated parameter to match the trajectory, which performed somewhat successfully. In addition, as a result of fine-tuning after transferring the para
1. There is a lack of motivation albeit the promising results. It is a lack of the evidence whether having the same trajectory between tasks is always good. 2. To experimentally prove that an initialization or architecture affects the similarity more than the dataset, it is necessary to verify it on more datasets.
Videos
Taxonomy
TopicsMachine Learning and Data Classification · Machine Learning and Algorithms · Anomaly Detection Techniques and Applications
