TiC-CLIP: Continual Training of CLIP Models
Saurabh Garg, Mehrdad Farajtabar, Hadi Pouransari, Raviteja, Vemulapalli, Sachin Mehta, Oncel Tuzel, Vaishaal Shankar, Fartash Faghri

TL;DR
This paper introduces TiC benchmarks for continual training of vision-language models, evaluates the temporal robustness of existing models, and proposes an efficient rehearsal-based training method that reduces computational costs.
Contribution
It presents the first large-scale web data benchmarks for time-continuous training of vision-language models and demonstrates an effective rehearsal-based training approach.
Findings
Existing models lose accuracy over time without updates.
Rehearsal-based continual training reduces compute by 2.5x.
New benchmarks enable evaluation of temporal robustness.
Abstract
Keeping large foundation models up to date on latest data is inherently expensive. To avoid the prohibitive costs of constantly retraining, it is imperative to continually train these models. This problem is exacerbated by the lack of any large scale continual learning benchmarks or baselines. We introduce the first set of web-scale Time-Continual (TiC) benchmarks for training vision-language models: TiC-DataComp, TiC-YFCC, and TiC-Redcaps. TiC-DataComp, our largest dataset, contains over 12.7B timestamped image-text pairs spanning 9 years (2014-2022). We first use our benchmarks to curate various dynamic evaluations to measure temporal robustness of existing models. We show OpenAI's CLIP (trained on data up to 2020) loses zero-shot accuracy on our curated retrieval task from 2021-2022 compared with more recently trained models in OpenCLIP repository. We then study how to…
Peer Reviews
Decision·ICLR 2024 poster
The paper collects a large amount of dynamic data to study how to effectively train CLIP models continuously, ensuring the comprehensiveness of the research. In order to ensure fairness in the evaluation, the paper has established a corresponding experimental protocol.
The dataset being solely focused on training CLIP may be somewhat limited. Can the article consider incorporating more vision-language models? The YFCC100M dataset might be somewhat outdated in terms of the years it covers. It may be more representative to explore newer datasets for the research.
S1) **Tackles an important problem** [Critical]: This work correctly highlights the need to shift focus in continual learning and introduces time-evolving benchmarks for evaluating continual pretraining which turns out is quite important. I really liked the dynamic retrieval and the classification task design. Retrieval captures performance shifts in time by new concepts and distribution shifts, whereas classification task ablates the performance gap caused due to new things (e.g. covid) by choo
W1) **Sequential and cumulative models behave quite differently between TiC-YFCC15M and TiC-Datacomp** [Critical] - The paper nicely illustrates that YFCC15M has strong distribution shifts in Figure 15. - However, does TiC-DataComp have significant distribution shifts? - The case for continually training CLIP primarily relies on Datacomp-like data having strong distribution shifts. - I suspect the case there is far weaker than YFCC15M (I am worried it's too small to make this setting e
1. The authors construct multiple datasets with time information based on existing datasets for continual learning settings. This is an non-trivial contribution for continue learning to evaluate the effectiveness of algorithms when facing natural distribution shifts. 2. This paper is well-written and easy-to-follow.
1. This benchmark lacks various types of continual learning methods [1]: elastic weights consolidation methods, progressive neural network methods, dynamic architecture methods, etc. Therefore, the experiments of this benchmark is relative weak and insufficient. 2. This paper lacks some in-depth analysis of vision-language models solving continual learning. Vision-language models enable various novel model tuning paradigms, such as prompt tuning, vision prompt tuning, parameter-efficient tuning
Code & Models
- 🤗apple/TiC-CLIP-basic-oraclemodel· 14 dl14 dl
- 🤗apple/TiC-CLIP-basic-sequentialmodel· 126 dl· ♡ 1126 dl♡ 1
- 🤗apple/TiC-CLIP-bestpool-cumulativemodel· 44 dl· ♡ 344 dl♡ 3
- 🤗apple/TiC-CLIP-bestpool-oraclemodel· 4 dl4 dl
- 🤗apple/TiC-CLIP-bestpool-sequentialmodel· 44 dl44 dl
- 🤗apple/TiC-CLIP-basic-cumulativemodel· 134 dl· ♡ 3134 dl♡ 3
Videos
Taxonomy
TopicsMultimodal Machine Learning Applications · Domain Adaptation and Few-Shot Learning · Topic Modeling
MethodsSparse Evolutionary Training · Contrastive Language-Image Pre-training
