A CLIP-Powered Framework for Robust and Generalizable Data Selection

Suorong Yang; Peng Ye; Wanli Ouyang; Dongzhan Zhou; Furao Shen

arXiv:2410.11215·cs.CV·June 23, 2025

A CLIP-Powered Framework for Robust and Generalizable Data Selection

Suorong Yang, Peng Ye, Wanli Ouyang, Dongzhan Zhou, Furao Shen

PDF

Open Access 3 Reviews

TL;DR

This paper introduces a CLIP-powered multimodal data selection framework that improves the robustness and efficiency of training deep learning models by effectively removing noisy data and selecting representative samples.

Contribution

The paper presents a novel multimodal data selection framework leveraging CLIP's knowledge, enhancing robustness and generalization over existing single-modality methods.

Findings

01

Outperforms state-of-the-art data selection methods on benchmark datasets.

02

Effectively removes noisy or damaged samples, improving data quality.

03

Achieves higher performance with less training data.

Abstract

Large-scale datasets have been pivotal to the advancements of deep learning models in recent years, but training on such large datasets invariably incurs substantial storage and computational overhead. Meanwhile, real-world datasets often contain redundant and noisy data, imposing a negative impact on training efficiency and model performance. Data selection has shown promise in identifying the most representative samples from the entire dataset, which aims to minimize the performance gap with reduced training costs. Existing works typically rely on single-modality information to assign importance scores for individual samples, which may lead to inaccurate assessments, especially when dealing with noisy or corrupted samples. To address this limitation, we propose a novel CLIP-powered data selection framework that leverages multimodal information for more robust and generalizable sample…

Peer Reviews

Decision·ICLR 2025 Spotlight

Reviewer 01Rating 8Confidence 4

Strengths

1. The method performs more robustly compared to other baseline data selection methods in various popularly datasets. 2. The algorithm is easy to understand and to follow. 3. The writing of this paper is overall great.

Weaknesses

Some details and motivation should be explained further. 1. The method requires training an adapter with data from the training dataset for data selection. If the noise ratio is high (such as 50%), wouldn't it be a better option not to use an adapter? 2. Does the Actual Selection Costs in Appendix G include the training time for the Adapter? Is the comparison fair? 3. From the ablation experiment, it can be seen that selection loss has a significant impact on the final result. So, what is the se

Reviewer 02Rating 8Confidence 4

Strengths

The idea of exploiting multimodal features from the CLIP model is interesting and plausible. The proposed method is simple, which is a strength in my opinion. The method is also efficient and is able to control the alignment, diversity and selection ratio in a multiobjective optimization efficiently. The paper is well written and easy to follow. The results are good.

Weaknesses

The proposed method relies on a pretrained CLIP and hence any biases in the CLIP model will propagate to the selected dataset. The proposed method optimizes alignment and diversity but does it have any indirect effect on bias in the dataset? Is it possible to control bias in the dataset or in the subsequent models that are trained on the selected dataset? Does the STE cause convergence issues? The variable d (sample wise parameter) can be easily confused with d (feature dimension). Cons

Reviewer 03Rating 8Confidence 3

Strengths

- Connecting the text and image information for the dataset selection is novel. The proposed CLIP-based method implements the idea well. - The SGD-based selection optimization with multi-objective is interesting. It is different from the existing sampling strategies with combinational optimization. - Sufficient experimental results show the effectiveness, including different datasets, different settings, and different model architectures.

Weaknesses

- The work relies on two adapters to project the CLIP features to the dataset-specific embedding space. The adapter training is based on perfect data. In fact, under the setting of noisy or corrupted data, this ideal data is inaccessible. Therefore, the experiments would be problematic. In Lines 514-515, the authors depict "is essential for effectively transferring the model’s generalization ability to target datasets". It seems the adapter has a large influence on the performance. Will using t

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsFuzzy Logic and Control Systems