Scalable Language Model with Generalized Continual Learning
Bohao Peng, Zhuotao Tian, Shu Liu, Mingchang Yang, Jiaya Jia

TL;DR
This paper introduces a scalable language model with generalized continual learning capabilities, overcoming previous limitations by integrating adaptive re-parameterization and task retrieval, achieving state-of-the-art results across diverse benchmarks and tasks.
Contribution
The study presents the Joint Adaptive Re-Parameterization and Dynamic Task-related Knowledge Retrieval methods, enabling effective continual learning across multiple tasks and domains in large language models.
Findings
State-of-the-art performance on diverse benchmarks
Effective continual learning with minimal forgetting
Successful scaling across various domains and task types
Abstract
Continual learning has gained increasing importance as it facilitates the acquisition and refinement of scalable knowledge and skills in language models. However, existing methods typically encounter strict limitations and challenges in real-world scenarios, such as reliance on experience replay, optimization constraints, and inference task-ID. In this study, we introduce the Scalable Language Model (SLM) to overcome these limitations within a more challenging and generalized setting, representing a significant advancement toward practical applications for continual learning. Specifically, we propose the Joint Adaptive Re-Parameterization (JARe), integrated with Dynamic Task-related Knowledge Retrieval (DTKR), to enable adaptive adjustment of language models based on specific downstream tasks. This approach leverages the task distribution within the vector space, aiming to achieve a…
Peer Reviews
Decision·ICLR 2024 poster
1. The proposed method exhibits good results on continual learning benchmarks as evidenced by the experiments 2. The simplicity of the method, along with the lack of additional overhead, stands out.
1.The method's description appears somewhat convoluted, making it challenging to grasp the benefits fully. Relevant questions highlighting each part are in the Questions section. 2. A basic baseline employing an average feature extractor across examples of a task as a key and training a singular low-rank weight for each task as value is absent. In this method, each task would have just one key and one low-rank weight, and retrieval for examples from a new task or an existing task could be execu
- Originality - The authors have proposed a novel continual learning (CL) method, the Scalable Language Model (SLM), which eliminates the use of "regularization constraint" and "data replay" using vector space retrieval of relevant past knowledge, thus making it scalable across a variety of downstream tasks. - Although it seems that SLM aligns with the CL methods which incorporate additional trainable parameters for each encountered task in the sequence, however, SLM does not append additional
- Quality - The number of data points in a task should also be considered while using the weight increments from that task during the retrieval step, otherwise, weight increments of the task with 100 data points will have the same importance/impact as the task with 10000 data points. Just like it happens in the weighted averaging strategy used in the FedAvg algorithm in federated learning settings. - The computational complexity of the proposed model seems high as the training is done for one
**Strengths** (1) The overall presentation is very good, and the writing is very clear. (2) The method is sound and shows effective performance. (3) This direction resembles the recent interests of model soup [1], and also some prior works in meta-learning that linearly interpolates the learned modulations to adapt the current model [2]. I think it would be great to discuss the connection. ------------ Reference\ [1] Wartsman, Model soups: averaging weights of multiple fine-tuned models imp
**Weaknesses** (1) I disagree with the claim regarding the weakness of replay-based methods in the introduction. - While I do agree that replay-based methods require additional storage to save the previous datasets, this paper also requires saving the parameters of the previous task's low-rank parameters and saving the keys. - Since text dataset saving does not require much storage, **I believe comparing the storage is needed** for a fair comparison. (2) Comparing with the upper bound perfor
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsDomain Adaptation and Few-Shot Learning · Multimodal Machine Learning Applications
