Overcoming Generic Knowledge Loss with Selective Parameter Update
Wenxuan Zhang, Paul Janson, Rahaf Aljundi, Mohamed Elhoseiny

TL;DR
This paper introduces a selective parameter update method for foundation models that enhances learning of new tasks while preserving original knowledge, achieving significant improvements with minimal knowledge loss.
Contribution
It proposes a sparse parameter update approach that localizes learning to relevant parameters, balancing efficiency, new task performance, and knowledge retention.
Findings
Up to 7% accuracy improvement on new tasks
Negligible 0.9% decrease in original knowledge accuracy
Effective on diverse vision-language continual learning tasks
Abstract
Foundation models encompass an extensive knowledge base and offer remarkable transferability. However, this knowledge becomes outdated or insufficient over time. The challenge lies in continuously updating foundation models to accommodate novel information while retaining their original capabilities. Leveraging the fact that foundation models have initial knowledge on various tasks and domains, we propose a novel approach that, instead of updating all parameters equally, localizes the updates to a sparse set of parameters relevant to the task being learned. We strike a balance between efficiency and new task performance, while maintaining the transferability and generalizability of foundation models. We extensively evaluate our method on foundational vision-language models with a diverse spectrum of continual learning tasks. Our method achieves improvements on the accuracy of the newly…
Peer Reviews
Decision·ICLR 2024 Conference Withdrawn Submission
1)New ideas: This paper proposes that only a few parameters that are sparse but most important for new tasks need to be updated. 2)New technologies: The authors selected the parameters most highly relevant to the new task by designing a learnable gradient-based scoring function. 3)Better experimental results: This method is superior to other latest methods in accuracy and forgetting rate of image classification, indicating that the knowledge obtained from model pre-training is retained in cont
1)The paper lacks sufficient theoretical support and generalization ability to explain how to identify the specific layers that need parameter updates. It is not sufficient to argue that only the parameters in the MLP layer need to be updated by pointing out that the MLP layer has the ability of pattern detection. The paper's final choice to update only the first MLP layer is determined by designing a permutation experiment. Imagine that when the number of layers to be selected becomes larger, d
There seems to be a strong effect on limiting the parameters to be updated and maintaining performance on pretrained tasks (as shown in table 5). This work exploits that to present SPU as performing better than alternatives that finetune all parameters.
Lack of baselines: Given that the number of parameters being tuned has a significant impact on new task and pretrained task performance, I think there is a lack of baselines around this to justify the presented method. In particular how do existing PEFT methods such as Adapters, LoRA, KAdaptation perform vs SPU. Lack of clarity: SPU is presented in a rather convoluted way. IIUC SPU is a rather simple method and that should be a plus if presented as so. Unfortunately in my opinion that is not ho
1. Clear paper writing. 2. The investigated task is interesting. 3. The direction of the idea seems to be plausible.
This paper's main flaw is insufficient comparison and elaborations regarding closely related methods. Please refer to my questions below for more details.
1. The presentation of this paper is clear and easy to follow. 2. Preserving generic knowledge in foundation models during continual learning and providing quantitative assessment for generic knowledge loss is meaningful. 3. The results based on the CLIP model are good.
1. The quantitative evaluation of generic knowledge loss can only be applied to vision-language models. Specifically, such control set Accuracy (C.) is represented by the zero-shot performance differences between the tuned and frozen pre-trained models, which cannot be obtained for pre-trained unimodal models. 2. The reasons for **only** updating the (first) MLP layer in each transformer block are not sufficient. This paper ignores the discussion of other components, such as the attention block
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsDomain Adaptation and Few-Shot Learning · Multimodal Machine Learning Applications · COVID-19 diagnosis using AI
MethodsBalanced Selection
