Enhancing Neural Training via a Correlated Dynamics Model
Jonathan Brokman, Roy Betser, Rotem Turjeman, Tom Berkov, Ido Cohen,, Guy Gilboa

TL;DR
This paper introduces Correlation Mode Decomposition (CMD), a novel method that clusters neural network parameters based on their correlated dynamics during training, improving efficiency and generalization in large-scale models.
Contribution
The paper presents CMD, a new algorithm for capturing training dynamics through correlated parameter modes, enhancing model efficiency and generalization, and applicable to federated learning.
Findings
CMD outperforms state-of-the-art methods in modeling training dynamics.
CMD improves training efficiency and reduces communication overhead.
Preliminary experiments show benefits in federated learning scenarios.
Abstract
As neural networks grow in scale, their training becomes both computationally demanding and rich in dynamics. Amidst the flourishing interest in these training dynamics, we present a novel observation: Parameters during training exhibit intrinsic correlations over time. Capitalizing on this, we introduce Correlation Mode Decomposition (CMD). This algorithm clusters the parameter space into groups, termed modes, that display synchronized behavior across epochs. This enables CMD to efficiently represent the training dynamics of complex networks, like ResNets and Transformers, using only a few modes. Moreover, test set generalization is enhanced. We introduce an efficient CMD variant, designed to run concurrently with training. Our experiments indicate that CMD surpasses the state-of-the-art method for compactly modeled dynamics on image classification. Our modeling can improve training…
Peer Reviews
Decision·ICLR 2024 poster
* In general this manuscript is well-structured. * This manuscript considers an interesting aspect of modeling the training dynamics of complex networks. The idea of using clustered parameters looks novel to the reviewer. * The manuscript has a good logic flow, from the definition of the post-hoc CMD to online CMD and embedded CMD. * Sufficient numerical results also justify the effectiveness of the CMD. An extension to FL is also provided in the manuscript.
* Authors are encouraged to improve the writing quality of the current manuscript. * Regarding the experiments on FL, it remains unclear to the reviewer why only two communication-efficient FL baselines, namely APF and A-APF, are considered for the evaluation. More recent SOTA methods need to be taken into account.
* The paper introduces a novel observation regarding the intrinsic correlations over time of parameters during the training of neural networks. This insight is leveraged to develop a new algorithm, Correlation Mode Decomposition (CMD), which is a creative contribution to the field. * Despite the complexity of the topic, the paper seems to be structured and articulated in a manner that allows the reader to follow the authors' logic and methodologies.
* The citation format within the text could be improved for consistency and adherence to academic conventions. Utilizing citation commands like \citet or \citep would enhance the readability and professionalism of the references within the text. * Figure 2 Analysis: The benefits of the CMD method as depicted in Figure 2 are not evidently clear. In the left plot, it would be helpful to see the results over a more extended range of epochs to ascertain the method's effectiveness over a longer trai
1. The procedure of CMD seems reasonable and also novel in dimensionality reduction of learning dynamics. 2. It may be also novel that their proposal to use the dimensionality reduction for distributed training, but less confident since I'm not an expert in this area. 3. Experimental results (Figure 3) shows a surprising result that Online/Embedded CMD outperforms full SGD on CIFAR-10, which seems somewhat contradictory because Online/Embedded CMD was designed to approximate the full SGD.
1. There are many unclear points in experimental results/figures. 1. In each mode block in Figure 1 (Left & Middle), correlation between most of parameters tends to be less than 1.0, which does not satisfy the hypothesis behind the proposed method: `Any two time trajectories u, v that are perfectly correlated can be expressed as an affine transformation of each other`. 2. The y-axis in Figure 1 (Right) is unclear. 3. What are the different/common points between CMD and DMD? The paper
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsNeural Networks and Reservoir Computing · Anomaly Detection Techniques and Applications · Neural Networks and Applications
MethodsSparse Evolutionary Training
