GRID: Scalable Task-Agnostic Prompt-Based Continual Learning for Language Models
Anushka Tiwari, Sayantan Pal, Rohini K. Srihari, Kaiyi Ji

TL;DR
GRID introduces a scalable, task-agnostic prompt-based continual learning framework for language models that enhances performance, reduces memory usage, and effectively handles long task sequences.
Contribution
The paper proposes GRID, a novel framework that improves backward transfer, enables automatic task identification, and compresses prompts for scalable continual learning in language models.
Findings
Improves average accuracy and backward transfer on benchmarks.
Reduces prompt memory usage significantly.
Achieves competitive forward transfer performance.
Abstract
Prompt-based continual learning (CL) provides a parameter-efficient approach for adapting large language models (LLMs) across task sequences. However, most existing methods rely on task-aware inference and maintain a growing set of task-specific prompts, which introduces two major challenges: (1) severe performance degradation on earlier tasks under task-agnostic inference, and (2) limited scalability due to prompt memory accumulation as task sequences grow. In this paper, we present GRID, a unified framework designed to address these challenges. GRID incorporates a decoding mechanism that enhances backward transfer by leveraging representative inputs, automatic task identification, and constrained decoding. Furthermore, it employs a gradient-guided prompt selection strategy to compress less informative prompts into a single aggregated representation, ensuring scalable and…
Peer Reviews
Decision·ICLR 2026 Conference Withdrawn Submission
1. The problem is interesting, there are a lot of literature for the CL in the visual domain but limited work in the NLP domain, this work maybe beneficial for the NLP community. 2. The proposed architecture seems novel, gradient norm based prompt pool selection looks interesting. The gradient-guided prompt selection and merging to prevent unbounded prompt growth, maintaining high performance with over 60% reduced memory compared to prior prompt-based continual learners. 3. Paper does not requ
1. In the vision domain, the CIL setting is very common for the prompt based continual learning, here author highlights its key challenge, why? On the high level once tokenization is done both the language and vision model are similar, then why same concept [a,b,c] can not be applied here? 2. The task identification method in section 3.1 looks violates the CL setting, here paper use the pretrained (say Phi) model for the task id prediction, which is much powerful and mostly it know all the sequ
The problem is very clear for task-agnostic inference with a bounded prompt pool. The synergy of representative-input sampling + task ID and constrained decoding and gradient-guided prompt selection/merging. It cuts prompt memory $\approx \frac{2}{3}$ with better BWT/FTC than ProgPrompt/SHLPT, and includes ablations and runtime/memory reporting.
Useful engineering but the conceptual novelty is modest and the evaluation might be too simple. Each component is well-known: representative sampling/clustering, label remapping + constrained decoding, and gradient-norm scoring. The “gradient-weighted merging” is a straightforward heuristic. Even though the amount of tasks is large, most tasks are short-label classification (BoolQ, MNLI, SST-2, etc.). No open-ended generation, no tool use, no long-context or instruction-following streams, little
1. The paper addresses an underexplored but practically important setting where task identities are unknown at inference time. By integrating task identification and constrained decoding, GRID effectively mitigates label drift and latent forgetting—issues largely ignored in prior prompt-based continual learning works such as ProgPrompt and SHLPT. 2. The proposed gradient-guided prompt compression provides a simple yet effective mechanism to control prompt pool growth, reducing memory usage by o
1. **Lack of theoretical analysis.** For example, the gradient-based prompt selection in Eq. (1–4) is heuristic: prompts with smaller gradient norms are merged by a simple weighted average. However, the paper provides no theoretical justification for why the gradient magnitude correlates with “informativeness”. 2. **Limited experimental evidence supporting the core claim on task-agnostic inference and constrained decoding.** Although the paper claims that GRID enables task-agnostic continual l
1. This work tries to address the challenging task of continual learning in task-agnostic settings, achieving good results on long-sequence and negative transfer benchmarks. 2. The intuition of using gradient to determine the usefulness of prompts is interesting.
1. I am doubtful whether label mapping is still widely adopted in the mainstream decoder-only models. It looks to me that although the performance of decoder-only models may slightly underperform the encoder-decoder models, decoder-only models are scalable and do not require specific mapping strategies. Moreover, I am wondering why the authors do not apply thier method on decoder-only models such as Llama and Qwen. 2. There is redundency in writing. Both the introduction section and section 2.3
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsTopic Modeling · Natural Language Processing Techniques · Speech Recognition and Synthesis
