Continual Vision-Language Representation Learning with Off-Diagonal   Information

Zixuan Ni; Longhui Wei; Siliang Tang; Yueting Zhuang; Qi; Tian

arXiv:2305.07437·cs.LG·June 2, 2023·2 cites

Continual Vision-Language Representation Learning with Off-Diagonal Information

Zixuan Ni, Longhui Wei, Siliang Tang, Yueting Zhuang, Qi, Tian

PDF

Open Access 1 Video

TL;DR

This paper investigates the challenges of continual vision-language learning with streaming data, identifies spatial disorder as a key issue, and proposes Mod-X to maintain off-diagonal information for improved model stability.

Contribution

It introduces a novel framework Mod-X that preserves off-diagonal information in contrastive matrices to mitigate spatial disorder during continual training.

Findings

01

Mod-X effectively reduces spatial disorder in continual CLIP training.

02

The proposed method improves cross-modal retrieval performance over baseline models.

03

Experiments show robustness across various datasets and scales.

Abstract

Large-scale multi-modal contrastive learning frameworks like CLIP typically require a large amount of image-text samples for training. However, these samples are always collected continuously in real scenarios. This paper discusses the feasibility of continual CLIP training using streaming data. Unlike continual learning based on self-supervised learning methods for pure images, which is empirically robust against catastrophic forgetting, CLIP's performance degeneration in the continual setting is significant and non-neglectable. By analyzing the changes in the model's representation space during continual CLIP training from a spatial geometry perspective, we explore and summarize these spatial variations as Spatial Disorder (SD), which can be divided into Intra-modal Rotation and Inter-modal Deviation. Moreover, we empirically and theoretically demonstrate how SD leads to a performance…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

Continual Vision-Language Representation Learning with Off-Diagonal Information· slideslive

Taxonomy

TopicsMultimodal Machine Learning Applications · Domain Adaptation and Few-Shot Learning · Advanced Image and Video Retrieval Techniques

MethodsContrastive Learning · Contrastive Language-Image Pre-training