TiC-CLIP: Continual Training of CLIP Models

Saurabh Garg; Mehrdad Farajtabar; Hadi Pouransari; Raviteja; Vemulapalli; Sachin Mehta; Oncel Tuzel; Vaishaal Shankar; Fartash Faghri

arXiv:2310.16226·cs.CV·March 22, 2024·1 cites

TiC-CLIP: Continual Training of CLIP Models

Saurabh Garg, Mehrdad Farajtabar, Hadi Pouransari, Raviteja, Vemulapalli, Sachin Mehta, Oncel Tuzel, Vaishaal Shankar, Fartash Faghri

PDF

Open Access 1 Repo 6 Models 1 Datasets 1 Video 3 Reviews

TL;DR

This paper introduces TiC benchmarks for continual training of vision-language models, evaluates the temporal robustness of existing models, and proposes an efficient rehearsal-based training method that reduces computational costs.

Contribution

It presents the first large-scale web data benchmarks for time-continuous training of vision-language models and demonstrates an effective rehearsal-based training approach.

Findings

01

Existing models lose accuracy over time without updates.

02

Rehearsal-based continual training reduces compute by 2.5x.

03

New benchmarks enable evaluation of temporal robustness.

Abstract

Keeping large foundation models up to date on latest data is inherently expensive. To avoid the prohibitive costs of constantly retraining, it is imperative to continually train these models. This problem is exacerbated by the lack of any large scale continual learning benchmarks or baselines. We introduce the first set of web-scale Time-Continual (TiC) benchmarks for training vision-language models: TiC-DataComp, TiC-YFCC, and TiC-Redcaps. TiC-DataComp, our largest dataset, contains over 12.7B timestamped image-text pairs spanning 9 years (2014-2022). We first use our benchmarks to curate various dynamic evaluations to measure temporal robustness of existing models. We show OpenAI's CLIP (trained on data up to 2020) loses $\approx 8%$ zero-shot accuracy on our curated retrieval task from 2021-2022 compared with more recently trained models in OpenCLIP repository. We then study how to…

Peer Reviews

Decision·ICLR 2024 poster

Reviewer 01Rating 6· marginally above the acceptance thresholdConfidence 3

Strengths

The paper collects a large amount of dynamic data to study how to effectively train CLIP models continuously, ensuring the comprehensiveness of the research. In order to ensure fairness in the evaluation, the paper has established a corresponding experimental protocol.

Weaknesses

The dataset being solely focused on training CLIP may be somewhat limited. Can the article consider incorporating more vision-language models? The YFCC100M dataset might be somewhat outdated in terms of the years it covers. It may be more representative to explore newer datasets for the research.

Reviewer 02Rating 8· accept, good paperConfidence 4

Strengths

S1) **Tackles an important problem** [Critical]: This work correctly highlights the need to shift focus in continual learning and introduces time-evolving benchmarks for evaluating continual pretraining which turns out is quite important. I really liked the dynamic retrieval and the classification task design. Retrieval captures performance shifts in time by new concepts and distribution shifts, whereas classification task ablates the performance gap caused due to new things (e.g. covid) by choo

Weaknesses

W1) **Sequential and cumulative models behave quite differently between TiC-YFCC15M and TiC-Datacomp** [Critical] - The paper nicely illustrates that YFCC15M has strong distribution shifts in Figure 15. - However, does TiC-DataComp have significant distribution shifts? - The case for continually training CLIP primarily relies on Datacomp-like data having strong distribution shifts. - I suspect the case there is far weaker than YFCC15M (I am worried it's too small to make this setting e

Reviewer 03Rating 5· marginally below the acceptance thresholdConfidence 3

Strengths

1. The authors construct multiple datasets with time information based on existing datasets for continual learning settings. This is an non-trivial contribution for continue learning to evaluate the effectiveness of algorithms when facing natural distribution shifts. 2. This paper is well-written and easy-to-follow.

Weaknesses

1. This benchmark lacks various types of continual learning methods [1]: elastic weights consolidation methods, progressive neural network methods, dynamic architecture methods, etc. Therefore, the experiments of this benchmark is relative weak and insufficient. 2. This paper lacks some in-depth analysis of vision-language models solving continual learning. Vision-language models enable various novel model tuning paradigms, such as prompt tuning, vision prompt tuning, parameter-efficient tuning

Code & Models

Repositories

apple/ml-tic-clip
noneOfficial

Models

Datasets

apple/TiC-DataComp
dataset· 2.1k dl
2.1k dl

Videos

TiC-CLIP: Continual Training of CLIP Models· slideslive

Taxonomy

TopicsMultimodal Machine Learning Applications · Domain Adaptation and Few-Shot Learning · Topic Modeling

MethodsSparse Evolutionary Training · Contrastive Language-Image Pre-training