GRID: Scalable Task-Agnostic Prompt-Based Continual Learning for Language Models

Anushka Tiwari; Sayantan Pal; Rohini K. Srihari; Kaiyi Ji

arXiv:2507.14725·cs.LG·October 2, 2025

GRID: Scalable Task-Agnostic Prompt-Based Continual Learning for Language Models

Anushka Tiwari, Sayantan Pal, Rohini K. Srihari, Kaiyi Ji

PDF

Open Access 4 Reviews

TL;DR

GRID introduces a scalable, task-agnostic prompt-based continual learning framework for language models that enhances performance, reduces memory usage, and effectively handles long task sequences.

Contribution

The paper proposes GRID, a novel framework that improves backward transfer, enables automatic task identification, and compresses prompts for scalable continual learning in language models.

Findings

01

Improves average accuracy and backward transfer on benchmarks.

02

Reduces prompt memory usage significantly.

03

Achieves competitive forward transfer performance.

Abstract

Prompt-based continual learning (CL) provides a parameter-efficient approach for adapting large language models (LLMs) across task sequences. However, most existing methods rely on task-aware inference and maintain a growing set of task-specific prompts, which introduces two major challenges: (1) severe performance degradation on earlier tasks under task-agnostic inference, and (2) limited scalability due to prompt memory accumulation as task sequences grow. In this paper, we present GRID, a unified framework designed to address these challenges. GRID incorporates a decoding mechanism that enhances backward transfer by leveraging representative inputs, automatic task identification, and constrained decoding. Furthermore, it employs a gradient-guided prompt selection strategy to compress less informative prompts into a single aggregated representation, ensuring scalable and…

Peer Reviews

Decision·ICLR 2026 Conference Withdrawn Submission

Reviewer 01Rating 2Confidence 4

Strengths

1. The problem is interesting, there are a lot of literature for the CL in the visual domain but limited work in the NLP domain, this work maybe beneficial for the NLP community. 2. The proposed architecture seems novel, gradient norm based prompt pool selection looks interesting. The gradient-guided prompt selection and merging to prevent unbounded prompt growth, maintaining high performance with over 60% reduced memory compared to prior prompt-based continual learners. 3. Paper does not requ

Weaknesses

1. In the vision domain, the CIL setting is very common for the prompt based continual learning, here author highlights its key challenge, why? On the high level once tokenization is done both the language and vision model are similar, then why same concept [a,b,c] can not be applied here? 2. The task identification method in section 3.1 looks violates the CL setting, here paper use the pretrained (say Phi) model for the task id prediction, which is much powerful and mostly it know all the sequ

Reviewer 02Rating 2Confidence 3

Strengths

The problem is very clear for task-agnostic inference with a bounded prompt pool. The synergy of representative-input sampling + task ID and constrained decoding and gradient-guided prompt selection/merging. It cuts prompt memory $\approx \frac{2}{3}$ with better BWT/FTC than ProgPrompt/SHLPT, and includes ablations and runtime/memory reporting.

Weaknesses

Useful engineering but the conceptual novelty is modest and the evaluation might be too simple. Each component is well-known: representative sampling/clustering, label remapping + constrained decoding, and gradient-norm scoring. The “gradient-weighted merging” is a straightforward heuristic. Even though the amount of tasks is large, most tasks are short-label classification (BoolQ, MNLI, SST-2, etc.). No open-ended generation, no tool use, no long-context or instruction-following streams, little

Reviewer 03Rating 6Confidence 4

Strengths

1. The paper addresses an underexplored but practically important setting where task identities are unknown at inference time. By integrating task identification and constrained decoding, GRID effectively mitigates label drift and latent forgetting—issues largely ignored in prior prompt-based continual learning works such as ProgPrompt and SHLPT. 2. The proposed gradient-guided prompt compression provides a simple yet effective mechanism to control prompt pool growth, reducing memory usage by o

Weaknesses

1. **Lack of theoretical analysis.** For example, the gradient-based prompt selection in Eq. (1–4) is heuristic: prompts with smaller gradient norms are merged by a simple weighted average. However, the paper provides no theoretical justification for why the gradient magnitude correlates with “informativeness”. 2. **Limited experimental evidence supporting the core claim on task-agnostic inference and constrained decoding.** Although the paper claims that GRID enables task-agnostic continual l

Reviewer 04Rating 2Confidence 3

Strengths

1. This work tries to address the challenging task of continual learning in task-agnostic settings, achieving good results on long-sequence and negative transfer benchmarks. 2. The intuition of using gradient to determine the usefulness of prompts is interesting.

Weaknesses

1. I am doubtful whether label mapping is still widely adopted in the mainstream decoder-only models. It looks to me that although the performance of decoder-only models may slightly underperform the encoder-decoder models, decoder-only models are scalable and do not require specific mapping strategies. Moreover, I am wondering why the authors do not apply thier method on decoder-only models such as Llama and Qwen. 2. There is redundency in writing. Both the introduction section and section 2.3

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsTopic Modeling · Natural Language Processing Techniques · Speech Recognition and Synthesis