TL;DR
This survey comprehensively reviews continual learning challenges and solutions for vision-language models and multimodal large language models, emphasizing unique issues like cross-modal feature drift and zero-shot capability erosion.
Contribution
It introduces a challenge-driven taxonomy for continual learning in VLMs and MLLMs, analyzing failure modes and proposing future research directions.
Findings
Deconstructed failure modes of VLMs and MLLMs in continual learning.
Proposed a four-paradigm taxonomy for addressing continual learning challenges.
Highlighted the importance of dual-track benchmarks and micro-diagnostic evaluations.
Abstract
Vision-language models (VLMs) and the recent surge of Multimodal Large Language Models (MLLMs) have revolutionized artificial intelligence with unprecedented cross-modal alignment and zero-shot generalization. However, enabling them to learn continually from non-stationary data remains a major challenge, as their cross-modal alignment and generalization capabilities are particularly vulnerable to catastrophic forgetting. Unlike traditional unimodal continual learning (CL), VLMs face unique challenges such as cross-modal feature drift, parameter interference due to shared architectures, and zero-shot capability erosion. Furthermore, generative MLLMs exhibit a unique ``alignment tax,'' where catastrophic forgetting manifests not merely as factual amnesia, but as a systemic collapse of deep Chain-of-Thought (CoT) reasoning. This survey presents the first comprehensive, diagnostic review…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
