Spurious Forgetting in Continual Learning of Language Models
Junhao Zheng, Xidi Cai, Shengjie Qiu, Qianli Ma

TL;DR
This paper investigates why large language models forget tasks during continual learning, identifying 'spurious forgetting' as a misinterpretation of task misalignment rather than true knowledge loss, and proposes a freezing strategy to improve performance.
Contribution
It introduces the concept of spurious forgetting, analyzes its causes through experiments and theory, and proposes a freezing method to enhance continual learning in language models.
Findings
Early optimization disrupts task alignment
Freezing bottom layers improves continual learning
Task misalignment explains performance drops
Abstract
Recent advancements in large language models (LLMs) reveal a perplexing phenomenon in continual learning: despite extensive training, models experience significant performance declines, raising questions about task alignment and underlying knowledge retention. This study first explores the concept of "spurious forgetting", proposing that such performance drops often reflect a decline in task alignment rather than true knowledge loss. Through controlled experiments with a synthesized dataset, we investigate the dynamics of model performance during the initial training phases of new tasks, discovering that early optimization steps can disrupt previously established task alignments. Our theoretical analysis connects these shifts to orthogonal updates in model weights, providing a robust framework for understanding this behavior. Ultimately, we introduce a Freezing strategy that fix the…
Peer Reviews
Decision·ICLR 2025 Poster
1. The paper brings a novel perspective by distinguishing between knowledge forgetting and spurious forgetting, addressing a misconception in continual learning research. This distinction enhances our understanding of model behavior across sequential tasks, highlighting the significance of this work. 2. The work is generally well-articulated and of good quality, with well-structured arguments and detailed explanations, making the complex ideas accessible to a broad audience. 3. The "Freeze" ap
1. Freezing layers for fine-tuning has long been a common practice and is not a novel approach. For instance, freezing lower layers was explored in the VGG paper [1]. 2. The empirical results in this work do not provide strong evidence to support the assumptions related to orthogonality. 3. The proposed method is only compared to the SEQ method in real-world scenarios, which is a pretty weak baseline. [1] Simonyan, K., & Zisserman, A. (2014). VERY DEEP CONVOLUTIONAL NETWORKS FOR LARGE-SCAL
1. The paper presents a novel and compelling claim regarding the phenomenon of forgetting in continual learning. 2. It adopts multiple perspectives to effectively substantiate this claim. 3. The paper provides a clear theoretical analysis based on the observation of orthogonal updates in model weights. 4. The proposed method, grounded in the theoretical analysis and observation, outperforms existing baseline methods.
The support for the assumption that the weight perturbation matrix lies in the left null space of the weight matrix in Assumption 4.4 appears to be simply based on the observations from Figure 4. However, in Figure 4, orthogonal updates occur only after the first 150 steps in the model's bottom layers. Based on performance observations, the accuracy of Task 0 has already significantly declined during these initial 150 steps. This does not support the conclusion that ``"observed performance decli
Proposes a novel perspective on catastrophic forgetting in LLM training, suggesting that LLMs do not lose knowledge but rather lose the ability to align with tasks. Provides some theoretical support for the parameter-freezing training method through experiments. Conducts thorough theoretical analysis. Experimental results across different methods are adequately explained, and a formalized definition of "pseudo-forgetting" is given. The proposed method of freezing lower-layer parameters shows goo
1 Lack of Originality and Insufficient Literature Review: The use of parameter freezing to mitigate forgetting in models is not a novel approach, and similar techniques have been explored previously. However, the authors do not provide an adequate discussion of prior research related to parameter freezing. This omission is significant, as a comparison with existing methods or a clear differentiation of this approach from prior studies would enhance the work's originality and position it within t
Code & Models
Videos
Taxonomy
TopicsNatural Language Processing Techniques · Topic Modeling · Multimodal Machine Learning Applications
