Learning without Forgetting for Vision-Language Models

Da-Wei Zhou; Yuanhan Zhang; Yan Wang; Jingyi Ning; Han-Jia Ye,; De-Chuan Zhan; Ziwei Liu

arXiv:2305.19270·cs.CV·February 13, 2025·5 cites

Learning without Forgetting for Vision-Language Models

Da-Wei Zhou, Yuanhan Zhang, Yan Wang, Jingyi Ning, Han-Jia Ye,, De-Chuan Zhan, Ziwei Liu

PDF

Open Access 1 Repo 3 Reviews

TL;DR

This paper introduces PROOF, a method for vision-language models to learn new tasks continually without forgetting previous knowledge, by using task-specific projections and a fusion module to leverage multi-modal information.

Contribution

It proposes a novel approach with task-specific projections and a fusion module to enable continual learning in vision-language models, addressing catastrophic forgetting and multi-modal utilization.

Findings

01

PROOF achieves state-of-the-art results on nine benchmark datasets.

02

The method effectively alleviates forgetting of old concepts during continual learning.

03

Joint adjustment of visual and textual features enhances semantic representation.

Abstract

Class-Incremental Learning (CIL) or continual learning is a desired capability in the real world, which requires a learning system to adapt to new tasks without forgetting former ones. While traditional CIL methods focus on visual information to grasp core features, recent advances in Vision-Language Models (VLM) have shown promising capabilities in learning generalizable representations with the aid of textual information. However, when continually trained with new classes, VLMs often suffer from catastrophic forgetting of former knowledge. Applying VLMs to CIL poses two major challenges: 1) how to adapt the model without forgetting; and 2) how to make full use of the multi-modal information. To this end, we propose PROjectiOn Fusion (PROOF) that enables VLMs to learn without forgetting. To handle the first challenge, we propose training task-specific projections based on the frozen…

Peer Reviews

Decision·Submitted to ICLR 2024

Reviewer 01Rating 6· marginally above the acceptance thresholdConfidence 5

Strengths

- In general, the proposed method is well motivated and clearly presented. - The paper turns a VLM into a continual learner that is both retentive and comprehensive. - Good performance is achieved.

Weaknesses

- The effectiveness of alleviating forgetting is uncertain. The process involves incrementally learning image projection heads and text projection heads, which are then combined for various tasks. When new tasks are learned, the projections of previous tasks are fixed and not updated. However, during inference, the projections of all tasks are merged, which might not be ideal for test data from older tasks due to potential side effects caused by the projections from the new tasks. - The extent t

Reviewer 02Rating 6· marginally above the acceptance thresholdConfidence 5

Strengths

Technically a novel idea to incorporate both the visual and the text encoders. Improves upon SOTA.

Weaknesses

- Inference Mismatch - Projections are combined at inference time which may not fully match the training conditions for a specific task projection. - Representation Drift - The post-attention module representations learned by the frozen projections may drift or shift slightly during new task training due to weight updates elsewhere. Small drifts can accumulate. - Section 3 is really long and has a lot of redundant information, it should be made much shorter. That space should be given to incr

Reviewer 03Rating 5· marginally below the acceptance thresholdConfidence 4

Strengths

- The authors tested for the first time a VLM model for continual learning. - The authors tested their PROOF on a variety of datasets testing the effectiveness of the model. - The authors proved the effectiveness of the model with very interesting and detailed ablation studies.

Weaknesses

- The paper lacks motivation and innovation: The authors suggest using CLIP for class-incremental continual learning, but it would be more interesting to see its performance on tasks like incremental captioning or retrieval. Unlike L2P, where a large pretrained model was used, CIL could have been just one application. - Furthermore, the PROOF mechanism, while innovative, lacks depth. Projection networks are common in continual learning, and the new context definition isn't explored. - The main p

Code & Models

Repositories

zhoudw-zdw/PROOF
pytorchOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Domain Adaptation and Few-Shot Learning · Cancer-related molecular mechanisms research

MethodsFocus