Preserving Cross-Modal Consistency for CLIP-based Class-Incremental Learning
Haoran Chen, Houze Xu, Micah Goldblum, Daoguo Dong, Zuxuan Wu

TL;DR
This paper introduces DMC and DMC-OT, two frameworks for CLIP-based class-incremental learning that decouple vision and text adaptation, preserve cross-modal alignment, and address distributional drift, achieving state-of-the-art results.
Contribution
The paper proposes a two-stage decoupled framework DMC and an enhanced version DMC-OT with optimal-transport calibration for improved CLIP-based continual learning.
Findings
DMC and DMC-OT outperform existing methods on multiple datasets.
DMC-OT achieves an average accuracy increase of 1.80%.
The methods effectively mitigate classifier bias and distributional drift.
Abstract
Class-incremental learning (CIL) enables models to continuously learn new categories from sequential tasks without forgetting previously acquired knowledge. While recent advances in vision-language models such as CLIP have demonstrated strong generalization across domains, extending them to continual settings remains challenging. In particular, learning task-specific soft prompts for newly introduced classes often leads to severe classifier bias, as the text prototypes overfit to recent categories when prior data are unavailable. In this paper, we propose DMC, a simple yet effective two-stage framework for CLIP-based CIL that decouples the adaptation of the vision encoder and the optimization of textual soft prompts. Each stage is trained with the other frozen, allowing one modality to act as a stable semantic anchor for the other to preserve cross-modal alignment. Furthermore, current…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsDomain Adaptation and Few-Shot Learning · Face recognition and analysis · Multimodal Machine Learning Applications
