Less is More: Selective Layer Finetuning with SubTuning
Gal Kaplun, Andrey Gurevich, Tal Swisa, Mazor David, Shai, Shalev-Shwartz, Eran Malach

TL;DR
This paper introduces SubTuning, a selective layer finetuning method that trains only a subset of layers in a pretrained model, achieving comparable or better performance than full finetuning, especially with limited data.
Contribution
The paper proposes SubTuning, a novel finetuning approach that reduces computational costs and enhances multi-task learning by selectively training layers of pretrained models.
Findings
SubTuning matches full finetuning accuracy on various tasks.
SubTuning outperforms full finetuning with scarce training data.
SubTuning enables efficient multi-task learning with shared resources.
Abstract
Finetuning a pretrained model has become a standard approach for training neural networks on novel tasks, resulting in fast convergence and improved performance. In this work, we study an alternative finetuning method, where instead of finetuning all the weights of the network, we only train a carefully chosen subset of layers, keeping the rest of the weights frozen at their initial (pretrained) values. We demonstrate that \emph{subset finetuning} (or SubTuning) often achieves accuracy comparable to full finetuning of the model, and even surpasses the performance of full finetuning when training data is scarce. Therefore, SubTuning allows deploying new tasks at minimal computational cost, while enjoying the benefits of finetuning the entire model. This yields a simple and effective method for multi-task learning, where different tasks do not interfere with one another, and yet share…
Peer Reviews
Decision·Submitted to ICLR 2024
1. The proposed method can be applied in conjunction with other parameter-efficient methods such as LoRA and Head2Toe. 2. The method is simple and can be broadly applied to any architecture / domain. 3. Some observations in Section 2 are quite interesting, for example how the finetuning profiles change for different downstream tasks and for different pre-training methods (even for the same architecture) is quite intriguing to me. This indicates that the kind of features learned by different pre-
1. I have some concerns about the contributions of this paper -- the key idea that we do not need to finetune all layers and that finetuning only a subset of the parameters can outperform full-feature finetuning was already observed in the "Surgical finetuning" paper of Lee et al. To me, the only new thing shown in this paper (beyond what was already shown in Lee et al), is the fact that we can greedily choose more than just one block of layers to finetune. In Related Work, the authors mention "
1. The overall problem of fine-tuning only the necessary layers to save compute is interesting. 2. There are some valuable empirical observations made by the paper, such as the fine-tuning profiles and the layer importance of different network architectures. 3. The presentation is clear, and the idea is easy to follow. The visualization also helps understanding the method's performance.
1. **Lack of novelty:** Performing layer selection during fine-tuning has been explored by previous work [1, 2, 3]. These works use more advanced selection techniques like policy networks or genetic algorithm. The proposed way of measuring layer importance through simply fine-tuning accuracy on a specific dataset is also not generalizable, i.e., the observed trend only holds for a specific model and a specific dataset and it is hard to deduce any more general and useful rules from that. 2. **R
1. The method is simple and easy to understand. The performance gain is good. 2. The concept of “finetuning profiles”, which demonstrates the contribution of different layers to the final results, is interesting. (But I believe only showing them is not enough. Explaining why the profiles behave like this under specific scenarios can make the paper stronger.) 3. Abundant experimental results from different perspectives.
1. The theoretical guarantee and explanations are not enough. Although theorem 1 provides a generalization bound that depends on the number of stunned parameters, it is hard to link this theory to the algorithm studied in this paper. Maybe the analysis provided in [1, 2] (the analysis on an overparameterized model) would be helpful. 2. The results in Table 1 are based on VTAB-1k, which means the samples used for transfer learning are only 1k. Will the proposed method still work when fine-tuning
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAdvanced Neural Network Applications · Neural Networks and Applications · Domain Adaptation and Few-Shot Learning
