Adaptive Budget Allocation for Orthogonal-Subspace Adapter Tuning in LLMs Continual Learning

Zhiyi Wan; Wanrou Du; Liang Li; Miao Pan; Xiaoqi Qin

arXiv:2505.22358·cs.LG·October 17, 2025

Adaptive Budget Allocation for Orthogonal-Subspace Adapter Tuning in LLMs Continual Learning

Zhiyi Wan, Wanrou Du, Liang Li, Miao Pan, Xiaoqi Qin

PDF

3 Reviews

TL;DR

This paper introduces OA-Adapter, a novel method for continual learning in large language models that dynamically allocates parameter budgets and applies orthogonal constraints to mitigate forgetting and improve efficiency.

Contribution

OA-Adapter unifies dynamic budget adaptation with orthogonal subspace learning in an end-to-end training process for LLMs in continual learning.

Findings

01

Outperforms state-of-the-art methods in accuracy.

02

Uses 58.5% fewer parameters on standard benchmarks.

03

Maintains advantages on larger, multi-task benchmarks.

Abstract

Large language models (LLMs) often suffer from catastrophic forgetting in continual learning (CL) scenarios, where performance on previously learned tasks degrades severely while training on sequentially arriving tasks. Although pioneering CL approaches using orthogonal subspaces can mitigate task interference, they typically employ fixed budget allocation, neglecting the varying complexity across tasks and layers. Besides, recent budget-adaptive tuning methods for LLMs often adopt multi-stage paradigms that decouple optimization and budget allocation. Such decoupling results in potential misalignment, which hinders those approaches' practical application in CL scenarios. To address these limitations, we propose OA-Adapter, a novel parameter-efficient approach for continual learning in LLMs that unifies dynamic budget adaptation with orthogonal subspace learning in an end-to-end…

Peer Reviews

Decision·Submitted to ICLR 2026

Reviewer 01Rating 4Confidence 4

Strengths

1. The learnable threshold (τ) allows bidirectional dimension adjustment (activation/deactivation), adapting to task complexity and layer needs. 2. Orthogonal constraints, applied to dynamically allocated subspaces, effectively reduce cross-task interference.

Weaknesses

1. The learnable threshold (τ) is critical for budget adaptation, but the paper provides no analysis of how its initial value or update dynamics affect performance. 2. While OA-Adapter is parameter-efficient, the orthogonality regularization introduces computational overhead that grows linearly with task count. 3. Only evaluates classification tasks (sentiment, topic, NLI, QA). It does not test complex tasks like open-ended generation (summarization, dialogue) or reasoning (math, logic), wh

Reviewer 02Rating 4Confidence 5

Strengths

1. OA-Adapter addresses dynamic budget allocation in continual learning, not to constrain all layers to use the fixed rank. 2. The paper is well written.

Weaknesses

1. The paper does not discuss the motivation of applying the parameter-efficient method both after feed-forward and multi-head attention, which is different from other LoRA methods, which are only applied in multi-head attention. But there is no clear discussion of this difference. 2. The paper lacks a strong theoretical analysis to reflect how orthogonal parameter subspace constraints affect the final loss. For example, the paper can analyze the training loss reduction or forgetting error in c

Reviewer 03Rating 4Confidence 4

Strengths

This is the first work that unifies learnable budget allocation and orthogonal-subspace updates inside a single training stage for LLM continual learning. The soft-threshold mask with a learnable $τ$ is technically simple yet novel, enabling bidirectional (expand/shrink) capacity adjustment that prior multi-stage methods cannot perform.

Weaknesses

1. Orthogonality is enforced via a simple cosine-penalty between current and previous subspace bases; no discussion of how this relates to classical projection-based CL guarantees or how the soft mask interacts with the orthogonality constraint from an optimization-theory viewpoint. 2. All experiments stop at 15 tasks. Because the sum of previously frozen orthogonal subspaces implicitly reduces the available rank, it is unclear whether the method will collapse when T≫50. A curve showing accurac

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

MethodsADaptive gradient method with the OPTimal convergence rate