SCAN: Bootstrapping Contrastive Pre-training for Data Efficiency
Yangyang Guo, Mohan Kankanhalli

TL;DR
This paper introduces a dynamic dataset pruning method for contrastive pre-training that improves data efficiency by iteratively updating training data, achieving comparable performance with significantly less data across vision-language and vision-centric models.
Contribution
The paper proposes a novel dynamic bootstrapping dataset pruning technique that adapts data usefulness during contrastive pre-training, enhancing data efficiency and model performance.
Findings
Achieves less than 1% performance drop with 30-35% data pruning.
Outperforms static coreset selection methods in downstream tasks.
Demonstrates effectiveness across multiple models and datasets.
Abstract
While contrastive pre-training is widely employed, its data efficiency problem has remained relatively under-explored thus far. Existing methods often rely on static coreset selection algorithms to pre-identify important data for training. However, this static nature renders them unable to dynamically track the data usefulness throughout pre-training, leading to subpar pre-trained models. To address this challenge, our paper introduces a novel dynamic bootstrapping dataset pruning method. It involves pruning data preparation followed by dataset mutation operations, both of which undergo iterative and dynamic updates. We apply this method to two prevalent contrastive pre-training frameworks: \textbf{CLIP} and \textbf{MoCo}, representing vision-language and vision-centric domains, respectively. In particular, we individually pre-train seven CLIP models on two large-scale image-text pair…
Peer Reviews
Decision·ICLR 2025 Conference Withdrawn Submission
1. The target problem of data efficiency is important and has the potential for broad applicability. 2. The proposed method is simple yet highly effective. Its effectiveness is thoroughly validated through extensive experiments across various components, hyperparameters, datasets, and architectures. Additionally, the paper includes numerous ablation studies and analyses to support its claims. 3. The paper is well-structured, making it easy to follow.
The method appears quite straightforward, but both the approach and motivation seem effective. In my opinion, the experiments are sufficient to demonstrate its effectiveness.
The idea is straightforward and demonstrates effectiveness by the experimental results.
1. Please improve the paper writing to avoid unnecessary confusion. (See "Questions") a. Please clearly state the experimental settings in each section. E.g., what are the settings (task & model) in Table 6? b. How are the total numbers of training steps determined for different settings? Are they are always same? c. Quantitatively, what is the overhead of pruning operations? d. Why did Info-Batch model on ViT-B/32 collapse in Table 1? And Swin-Base in Table 3? 2. I am concerned
1. SCAN’s dynamic pruning approach represents an advancement over traditional static pruning techniques, effectively enhancing data efficiency by continuously adjusting the dataset based on sample relevance throughout training. 2. The paper is well written and the proposed method is easy to follow. 3. The approach achieves comparable performance to full data training, even with up to 35% data pruning. This success indicates its potential as a resource-saving, efficient method for contrastive pre
1. Incorporating the results without data pruning into Table 7 and Table 8 would enhance the clarity and provide a more intuitive presentation of the findings. 2. Interest in the extreme dataset pruning with the pruning ratio of 70%~90%. In my opinion, the performance of SCAN relies heavily on the careful tuning of pruning ratios, which requires additional experimentation to achieve optimal results across different datasets.
- The idea of using loss values to dynamic prune dataset is interesting. - The proposed method shows some performance gain compared to some existing core-set selection methods (Info-Batch, D-Pruning)
- The data scale for experiments is relatively small. The results with CLIP only achieve ~26% on ImageNet zero-shot classification, which is far from the original work with 70+% on a much larger dataset. The effectiveness of this method for larger scale data is not clear from the paper. - Some important baselines and related work are not compared. For CLIP training, a standard data filtering method is CLIP-score (using a pretrained CLIP model to filter data). It will remove low image-text simila
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsTime Series Analysis and Forecasting · Machine Learning and Data Classification · Anomaly Detection Techniques and Applications
MethodsDataset Pruning · Pruning · Batch Normalization · InfoNCE · Contrastive Language-Image Pre-training · Coresets · Momentum Contrast
