TL;DR
This paper introduces Soft Weight Rescaling (SWR), a method to prevent unbounded weight growth in neural networks, thereby recovering plasticity and improving learning performance without losing learned information.
Contribution
The paper proposes SWR, a novel technique that bounds weight magnitudes and maintains network plasticity, with theoretical proofs and empirical validation across various learning scenarios.
Findings
SWR effectively bounds weight magnitudes during training.
SWR improves performance in continual and warm-start learning.
SWR maintains learned information while enhancing plasticity.
Abstract
Recent studies have shown that as training progresses, neural networks gradually lose their capacity to learn new information, a phenomenon known as plasticity loss. An unbounded weight growth is one of the main causes of plasticity loss. Furthermore, it harms generalization capability and disrupts optimization dynamics. Re-initializing the network can be a solution, but it results in the loss of learned information, leading to performance drops. In this paper, we propose Soft Weight Rescaling (SWR), a novel approach that prevents unbounded weight growth without losing information. SWR recovers the plasticity of the network by simply scaling down the weight at each step of the learning process. We theoretically prove that SWR bounds weight magnitude and balances weight magnitude between layers. Our experiment shows that SWR improves performance on warm-start learning, continual…
Peer Reviews
Decision·Submitted to ICLR 2025
1. The paper is easy to follow. 2. I think the authors are focusing on an interesting topic, i.e. loss of plasticity, that is worthy to probe. 3. The method proposed is simple and can be easily implemented in practice.
1. An unbounded weight growth is one of the main causes of plasticity loss, and the authors propose reducing weight magnitude through weight scaling. Reducing the weight magnitude could be a common implementation in training, where L2 is widely used. So I think the key here lies in comparing the proposed method to L2. However, after reviewing the text, I did not find a clear rationale why we should choose the proposed method over L2. Could the authors provide specific cases that demonstrate the
- The paper is overall clearly written and the method is adequately described. - The proposed method SWR is computationally more efficient than previously proposed methods. - The experiment results and analysis provided in the paper are insightful.
- The experimental results on smaller models are quite weak. For example, in warm-start and continual learning experiments, L2 (or S&P) seems to be better in most experiments (including the ones in the appendix). Even in Table 1, except for VGG, I wouldn't say the improvements are significantly higher since there's quite a bit of overlap with L2 in terms of standard deviations in MLP, and CNN cases. SWR only performs well on VGG which is not a very popular architecture even for vision-based expe
- This work progressively establishes and justifies its framework, making this paper easy to follow. - The results are promising, however, I have some concerns regarding the results as discussed below
- One main drawback of the paper is the limited application of the paper. The authors made many assumptions (e.g., the network is affine, homogeneous with ReLU), which impedes the contributions and the applicability of the paper in real-world scenarios. - Some crucial statements are made without proper references. Furthermore, these statements are conflicted with the statements in various peer-reviewed and significant publications. - The paper came up with many theorems and definitions without e
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
