Merge before Forget: A Single LoRA Continual Learning via Continual Merging
Fuli Qiao, Mehrdad Mahdavi

TL;DR
This paper introduces a novel continual learning method for large language models that merges LoRA updates into a single model, reducing memory usage and task interference while improving performance.
Contribution
The paper proposes a new continual learning approach that orthogonally merges LoRA updates into one, maintaining constant memory and reducing task interference.
Findings
Maintains constant memory complexity across tasks
Reduces interference between old and new tasks
Outperforms existing LoRA merging methods in experiments
Abstract
Parameter-efficient continual learning has emerged as a promising approach for large language models (LLMs) to mitigate catastrophic forgetting while enabling adaptation to new tasks. Current Low-Rank Adaptation (LoRA) continual learning techniques often retain and freeze previously learned LoRAs or generate data representations to overcome forgetting, typically utilizing these to support new LoRAs learn new tasks. However, these methods not only ignore growing computational memory with tasks and limited storage space but also suffer from potential task interference due to the lack of effective LoRA merging mechanisms. In this paper, we propose a novel continual learning method that orthogonally initializes and sequentially merges LoRAs updates into a single unified LoRA. Our method leverages orthogonal basis extraction from previously learned LoRA to initialize the learning of new…
Peer Reviews
Decision·ICLR 2026 Poster
+ The motivation of the intrinsic asymmetry property of LoRA is clearly presented in Figure 1. + The paper provides theoretical analyses grounded in NTK theory. + The writing is generally clear and well-organized.
+ Unclear mechanism for avoiding forgetting. 1)While I can understand how InfLoRA prevents forgetting by projecting updates into the null space of old task features or directly using old task samples, I find it difficult to see how the proposed orthogonal initialization in this paper achieves the same effect. For the merging step, many prior works in multi-task learning compute task vectors as ΔW = B·A. In your formulation, however, A is replaced, and only B is merged. It remains unclear why mer
1. SLAO is the first to enable CL with a single shared LoRA via sequential merging. 2. SLAO is robust to hyperparameters (e.g., LoRA rank, learning rate) and model scales. In particular, performance improves with larger models.
1. While SLAO’s training overhead is low, QR decomposition for orthogonal basis extraction adds a one-time cost per task. The paper does not quantify this cost for long sequences (e.g., 50+ tasks) or analyze whether approximate orthogonal methods (e.g., randomized SVD) could reduce the cost without performance loss.
+ Provides a formal analysis of forgetting and intransigence in the NTK regime and motivates the design via LoRA asymmetry. + The concept of continual merging into a single LoRA is novel and addresses key limitations of existing LoRA-based CL methods.
+ The method is designed for llm but evaluated on models that are not state-of-the-art, as well as tasks that are easy for current llms. Models such as qwen2.5/3 series and tasks such as aime, livecodebench or at the same difficulty level are needed. At lease, the reviewer think the tasks should be more diverse. + Could the orthogonal initialization strategy be combined with other PEFT methods?
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsDomain Adaptation and Few-Shot Learning · Multimodal Machine Learning Applications · Generative Adversarial Networks and Image Synthesis
