Rethinking Weight Decay for Robust Fine-Tuning of Foundation Models
Junjiao Tian, Chengyue Huang, Zsolt Kira

TL;DR
This paper introduces Selective Projection Decay (SPD), a new weight decay method that selectively regularizes model layers to improve robustness and generalization when fine-tuning foundation models.
Contribution
SPD is a novel weight decay technique that selectively applies regularization to certain layers, enhancing robustness and generalization in fine-tuning foundation models.
Findings
SPD improves in-distribution generalization.
SPD enhances out-of-distribution robustness.
Adam with SPD outperforms standard methods on benchmarks.
Abstract
Modern optimizers such as AdamW, equipped with momentum and adaptive learning rate, are designed to escape local minima and explore the vast parameter space. This exploration is beneficial for finding good loss basins when training from scratch. It is not necessarily ideal when resuming from a powerful foundation model because it can lead to large deviations from the pre-trained initialization and, consequently, worse robustness and generalization. At the same time, strong regularization on all parameters can lead to under-fitting. We hypothesize that selectively regularizing the parameter space is the key to fitting and retraining the pre-trained knowledge. This paper proposes a new weight decay technique, Selective Projection Decay (SPD), that selectively imposes a strong penalty on certain layers while allowing others to change freely. Intuitively, SPD expands and contracts the…
Peer Reviews
Decision·NeurIPS 2024 poster
The simplicity of the proposed method is its major strength, making it easy for many users to adopt. Compatibility with parameter-efficient fine-tuning is a significant advantage. Additionally, its simplicity ensures good reproducibility. Moreover, the experiment details are well described, further supporting reproducibility. The presentation is very clear. Mostly, it’s easy to follow the main idea, interpretations, and experiment results. Specifically, the difference between L2-SP and SPD is c
No obvious weaknesses are observed. One minor issue is that the performance improvement, while consistently observed, is not significantly high. Additionally, I have noted some unclear justifications in the manuscript, as described in the questions section.
- SPD is well-motivated and simple to implement. - The paper is well-written and easy to follow. It clearly lays out intuition and motivation. I especially liked sections 3.3 and 3.4, which relate the condition to online hyperparameter optimization and the deviation ratio to a re-interpretation of L2-SP as a projection. I thought the line of logic here was quite clear. - The paper demonstrates strong performance on several standard benchmarks for robust fine-tuning.
- Overall, I think this is a strong paper, and its strengths outweigh its weaknesses. I think the paper could benefit from an ablation experiment, for example, trying Adam-SPD without the condition or benchmarking SPD's performance with optimizers other than Adam. - [1] reports that ensembling with the initial weights (whether in model space or weight space) is a simple strategy that improves OOD robustness, and a few recent works [2, 3] report that ensembling continues to improve performance af
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsInfrastructure Maintenance and Monitoring
MethodsAdamW · Adam · Weight Decay
