DesCLIP: Robust Continual Learning via General Attribute Descriptions for VLM-Based Visual Recognition
Chiyuan He, Zihuan Qiu, Fanman Meng, Linfeng Xu, Qingbo Wu, Hongliang Li

TL;DR
DesCLIP introduces a novel continual learning approach for vision-language models that uses general attribute descriptions to improve robustness and reduce forgetting in visual recognition tasks.
Contribution
The paper proposes DesCLIP, which leverages general attribute descriptions and a language assistant to enhance VLM continual learning by establishing robust vision-GA-class associations.
Findings
Outperforms existing continual learning methods in VLM recognition tasks.
Effectively reduces knowledge forgetting in continual learning scenarios.
Demonstrates robustness across various downstream datasets.
Abstract
Continual learning of vision-language models (VLMs) focuses on leveraging cross-modal pretrained knowledge to incrementally adapt to expanding downstream tasks and datasets, while tackling the challenge of knowledge forgetting. Existing research often focuses on connecting visual features with specific class text in downstream tasks, overlooking the latent relationships between general and specialized knowledge. Our findings reveal that forcing models to optimize inappropriate visual-text matches exacerbates forgetting of VLM's recognition ability. To tackle this issue, we propose DesCLIP, which leverages general attribute (GA) descriptions to guide the understanding of specific class objects, enabling VLMs to establish robust vision-GA-class trilateral associations rather than relying solely on vision-class connections. Specifically, we introduce a language assistant to generate…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
