Automatic Data Curation for Self-Supervised Learning: A Clustering-Based Approach
Huy V. Vo, Vasil Khalidov, Timoth\'ee Darcet, Th\'eo Moutakanni,, Nikita Smetanin, Marc Szafraniec, Hugo Touvron, Camille Couprie, Maxime, Oquab, Armand Joulin, Herv\'e J\'egou, Patrick Labatut, Piotr Bojanowski

TL;DR
This paper introduces a clustering-based method for automatically curating high-quality, diverse, and balanced datasets for self-supervised learning, reducing manual effort and improving feature quality across multiple data domains.
Contribution
The authors propose a novel hierarchical clustering and sampling approach for automatic dataset curation tailored for self-supervised pre-training, demonstrating its effectiveness across various data types.
Findings
Features trained on curated datasets outperform uncurated data.
Automatically curated datasets match or surpass manually curated datasets.
Method is effective across images and text domains.
Abstract
Self-supervised features are the cornerstone of modern machine learning systems. They are typically pre-trained on data collections whose construction and curation typically require extensive human effort. This manual process has some limitations similar to those encountered in supervised learning, e.g., the crowd-sourced selection of data is costly and time-consuming, preventing scaling the dataset size. In this work, we consider the problem of automatic curation of high-quality datasets for self-supervised pre-training. We posit that such datasets should be large, diverse and balanced, and propose a clustering-based approach for building ones satisfying all these criteria. Our method involves successive and hierarchical applications of -means on a large and diverse data repository to obtain clusters that distribute uniformly among data concepts, followed by a hierarchical, balanced…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsData Mining Algorithms and Applications
