Merge before Forget: A Single LoRA Continual Learning via Continual Merging

Fuli Qiao; Mehrdad Mahdavi

arXiv:2512.23017·cs.LG·December 30, 2025

Merge before Forget: A Single LoRA Continual Learning via Continual Merging

Fuli Qiao, Mehrdad Mahdavi

PDF

Open Access 3 Reviews

TL;DR

This paper introduces a novel continual learning method for large language models that merges LoRA updates into a single model, reducing memory usage and task interference while improving performance.

Contribution

The paper proposes a new continual learning approach that orthogonally merges LoRA updates into one, maintaining constant memory and reducing task interference.

Findings

01

Maintains constant memory complexity across tasks

02

Reduces interference between old and new tasks

03

Outperforms existing LoRA merging methods in experiments

Abstract

Parameter-efficient continual learning has emerged as a promising approach for large language models (LLMs) to mitigate catastrophic forgetting while enabling adaptation to new tasks. Current Low-Rank Adaptation (LoRA) continual learning techniques often retain and freeze previously learned LoRAs or generate data representations to overcome forgetting, typically utilizing these to support new LoRAs learn new tasks. However, these methods not only ignore growing computational memory with tasks and limited storage space but also suffer from potential task interference due to the lack of effective LoRA merging mechanisms. In this paper, we propose a novel continual learning method that orthogonally initializes and sequentially merges LoRAs updates into a single unified LoRA. Our method leverages orthogonal basis extraction from previously learned LoRA to initialize the learning of new…

Peer Reviews

Decision·ICLR 2026 Poster

Reviewer 01Rating 4Confidence 4

Strengths

+ The motivation of the intrinsic asymmetry property of LoRA is clearly presented in Figure 1. + The paper provides theoretical analyses grounded in NTK theory. + The writing is generally clear and well-organized.

Weaknesses

+ Unclear mechanism for avoiding forgetting. 1)While I can understand how InfLoRA prevents forgetting by projecting updates into the null space of old task features or directly using old task samples, I find it difficult to see how the proposed orthogonal initialization in this paper achieves the same effect. For the merging step, many prior works in multi-task learning compute task vectors as ΔW = B·A. In your formulation, however, A is replaced, and only B is merged. It remains unclear why mer

Reviewer 02Rating 6Confidence 3

Strengths

1. SLAO is the first to enable CL with a single shared LoRA via sequential merging. 2. SLAO is robust to hyperparameters (e.g., LoRA rank, learning rate) and model scales. In particular, performance improves with larger models.

Weaknesses

1. While SLAO’s training overhead is low, QR decomposition for orthogonal basis extraction adds a one-time cost per task. The paper does not quantify this cost for long sequences (e.g., 50+ tasks) or analyze whether approximate orthogonal methods (e.g., randomized SVD) could reduce the cost without performance loss.

Reviewer 03Rating 4Confidence 3

Strengths

+ Provides a formal analysis of forgetting and intransigence in the NTK regime and motivates the design via LoRA asymmetry. + The concept of continual merging into a single LoRA is novel and addresses key limitations of existing LoRA-based CL methods.

Weaknesses

+ The method is designed for llm but evaluated on models that are not state-of-the-art, as well as tasks that are easy for current llms. Models such as qwen2.5/3 series and tasks such as aime, livecodebench or at the same difficulty level are needed. At lease, the reviewer think the tasks should be more diverse. + Could the orthogonal initialization strategy be combined with other PEFT methods?

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsDomain Adaptation and Few-Shot Learning · Multimodal Machine Learning Applications · Generative Adversarial Networks and Image Synthesis