Memory-Free Continual Learning with Null Space Adaptation for Zero-Shot Vision-Language Models

Yujin Jo; Taesup Kim

arXiv:2510.21175·cs.AI·October 27, 2025

Memory-Free Continual Learning with Null Space Adaptation for Zero-Shot Vision-Language Models

Yujin Jo, Taesup Kim

PDF

3 Reviews

TL;DR

NuSA-CL is a memory-free continual learning method for vision-language models that preserves zero-shot capabilities by constraining weight updates within a null space, enabling efficient adaptation without catastrophic forgetting.

Contribution

It introduces a lightweight, memory-free continual learning framework that maintains zero-shot abilities of VLMs using null space adaptation, avoiding replay buffers or distillation.

Findings

01

Effectively preserves zero-shot transfer capabilities.

02

Achieves competitive performance on continual learning benchmarks.

03

Imposes minimal computational and memory overhead.

Abstract

Pre-trained vision-language models (VLMs), such as CLIP, have demonstrated remarkable zero-shot generalization, enabling deployment in a wide range of real-world tasks without additional task-specific training. However, in real deployment scenarios with evolving environments or emerging classes, these models inevitably face distributional shifts and novel tasks. In such contexts, static zero-shot capabilities are insufficient, and there is a growing need for continual learning methods that allow models to adapt over time while avoiding catastrophic forgetting. We introduce NuSA-CL (Null Space Adaptation for Continual Learning), a lightweight memory-free continual learning framework designed to address this challenge. NuSA-CL employs low-rank adaptation and constrains task-specific weight updates to lie within an approximate null space of the model's current parameters. This strategy…

Peer Reviews

Decision·ICLR 2026 Poster

Reviewer 01Rating 6Confidence 4

Strengths

1. The theoretical motivation is clear and reasonable. 2. The experiments demonstrate a favorable result among memory-free methods. The analysis and discussions are comprehensive, showing the method's practicality and robustness.

Weaknesses

1. Storage-free baseline models are not that strong. Then a question is: using null-space may constrain the model expressivity. Is it still able to trade off by increasing the memory cost to get a stronger performance? Or the null-space also constrain the performance upperbound?

Reviewer 02Rating 6Confidence 4

Strengths

1. Simple, natural idea: SVD-based separation of principal vs. null-like directions; persistent constraint prevents drift and reduces interference. 2. Truly memory-free and fixed-size: no replay, no task-specific modules; efficient and scalable. 3. Strong empirical results with comprehensive cost reporting; SOTA among storage-free baselines and close to storage-based methods. 4. Clear, informative ablations (Top/Tail/Random; rmax; persistent constraint; multimodal adaptation) and mechanism evide

Weaknesses

1. Theory is mostly motivational; tighter links between parameter-space bounds and function-level forgetting would help. 2. Long-horizon spectral drift and null-space quality: Although the method re-computes SVD per task, after many merges the spectrum will evolve. A more systematic study of whether low-energy subspaces gradually become “contaminated” (especially for highly correlated task sequences) and how to monitor/remedy this (e.g., periodic re-orthogonalization, spectral gating) would be v

Reviewer 03Rating 6Confidence 3

Strengths

1. Introduces the novel idea of updating in null-space directions (Tail), which is conceptually different from previous approaches that focus on Top singular directions or random subspaces. Combines multimodal adaptation and rank-limited updates, which is a creative integration addressing the stability–plasticity trade-off. 2. Significance Tackles catastrophic forgetting, a central challenge in continual learning. Demonstrates practical low-cost, robust implementation suitable for large-scale m

Weaknesses

1. The method assumes that tasks have sufficiently distinct distributions, enabling interference reduction via orthogonalization or projection. However, when tasks are highly correlated or share overlapping features, the model may struggle to separate old and new knowledge, leading to reduced stability and adaptability. 2. The experimental evaluation is limited to standard vision benchmarks (e.g., CIFAR, ImageNet subsets) with relatively few tasks, focusing only on class-incremental learning. Th

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.