SELECT: A Large-Scale Benchmark of Data Curation Strategies for Image Classification
Benjamin Feuer, Jiawei Xu, Niv Cohen, Patrick Yubeaton, Govind Mittal,, Chinmay Hegde

TL;DR
This paper introduces SELECT, a large-scale benchmark for evaluating data curation strategies in image classification, along with a new dataset, ImageNet++, to systematically compare different approaches.
Contribution
It presents the first comprehensive benchmark for data curation strategies in image classification and creates a new dataset, ImageNet++, for systematic evaluation.
Findings
Curation strategies like synthetic data and CLIP-based lookup are competitive for certain tasks.
Original ImageNet-1K curation remains the most effective strategy.
Benchmark and dataset enable systematic comparison of data curation methods.
Abstract
Data curation is the problem of how to collect and organize samples into a dataset that supports efficient learning. Despite the centrality of the task, little work has been devoted towards a large-scale, systematic comparison of various curation methods. In this work, we take steps towards a formal evaluation of data curation strategies and introduce SELECT, the first large-scale benchmark of curation strategies for image classification. In order to generate baseline methods for the SELECT benchmark, we create a new dataset, ImageNet++, which constitutes the largest superset of ImageNet-1K to date. Our dataset extends ImageNet with 5 new training-data shifts, each approximately the size of ImageNet-1K itself, and each assembled using a distinct curation strategy. We evaluate our data curation baselines in two ways: (i) using each training-data shift to train identical image…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAI in cancer detection · Image Retrieval and Classification Techniques
MethodsContrastive Language-Image Pre-training
