How to Fine-Tune Vision Models with SGD
Ananya Kumar, Ruoqi Shen, Sebastien Bubeck, Suriya Gunasekar

TL;DR
This paper compares SGD and AdamW optimizers for fine-tuning vision models, revealing that freezing the embedding layer improves SGD performance and reduces memory usage, achieving state-of-the-art results on distribution shift benchmarks.
Contribution
It demonstrates that freezing the embedding layer during fine-tuning enhances SGD's performance and efficiency, leading to new state-of-the-art results on several benchmarks.
Findings
AdamW outperforms SGD on distribution shift tasks.
Freezing the embedding layer improves SGD performance.
SGD with frozen embeddings achieves state-of-the-art accuracy.
Abstract
SGD and AdamW are the two most used optimizers for fine-tuning large neural networks in computer vision. When the two methods perform the same, SGD is preferable because it uses less memory (12 bytes/parameter with momentum and 8 bytes/parameter without) than AdamW (16 bytes/parameter). However, on a suite of downstream tasks, especially those with distribution shifts, we find that fine-tuning with AdamW performs substantially better than SGD on modern Vision Transformer and ConvNeXt models. We find that large gaps in performance between SGD and AdamW occur when the fine-tuning gradients in the first "embedding" layer are much larger than in the rest of the model. Our analysis suggests an easy fix that works consistently across datasets and models: freezing the embedding layer (less than 1% of the parameters) leads to SGD with or without momentum performing slightly better than AdamW…
Peer Reviews
Decision·ICLR 2024 poster
I found the paper to be extremely well written. All claims were backed up by experimental results. The method is simple, yet effective. Very good paper.
1) It seems that the entire paper hinges on models which have an embedding layer. While these models are popular now, they may not be popular forever, which limits the long term impact of this method. 2) In models beyond vision, such as recommendation models (i.e. DLRM https://arxiv.org/abs/1906.00091), the embedding layer contains most of the model parameters. In such cases, it is also unclear whether this kind of method could work. Of course, the authors are explicit that this paper is about v
1. Very extensive experiments are conducted and thorough analysis is provided. 2. The presentation is good and motivation is clear and strong. 3. The minor yet effective modification on SGD shows good performance improvement, with less memory requirement than Adam.
Since I am not researching around the optimization domain, I can not clearly point out what is the weakness of this paper. See my questions below.
* As the models are become larger reducing the memory footprint to train/fine-tune models becomes increasingly important. One way to reduce the memory footprint is to use optimizers which do not use additional much state. The paper shows that by freezing a the initial embedding layer SGD with momentum or just plain SGD can match performance of fine-tuning the full model with AdamW. * The ablation studies show interesting observations on the role of the optimizer difference when pre-training and
* All the down stream fine-tuning experiments are focused on classification. It would be good to see if the observations hold for other tasks like detection, segmentation etc. * Gradual unfreezing is mentioned in Table 5 but the method is not clearly described in the paper. There is some mention of the learning rate schedule with gradual unfreezing in the appendix but does not fully explain what was done.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAdvanced Neural Network Applications · Domain Adaptation and Few-Shot Learning · Brain Tumor Detection and Classification
MethodsMulti-Head Attention · Attention Is All You Need · ConvNeXt · Position-Wise Feed-Forward Layer · Label Smoothing · Linear Layer · Adam · Softmax · Absolute Position Encodings · Dropout
