Curator: Creating Large-Scale Curated Labelled Datasets using Self-Supervised Learning
Tarun Narayanan, Ajay Krishnan, Anirudh Koul, Siddha Ganju

TL;DR
Curator is a no-code pipeline that leverages self-supervised learning, nearest neighbor search, and active learning to efficiently create large, labeled datasets from vast unlabelled data, significantly reducing curation time.
Contribution
The paper introduces Curator, a novel end-to-end system that automates dataset curation using self-supervised learning and scalable search, applicable across various domains.
Findings
Dramatically reduces dataset curation time.
Enables creation of comprehensive datasets from minimal references.
Applicable to multiple domains with unlabelled data.
Abstract
Applying Machine learning to domains like Earth Sciences is impeded by the lack of labeled data, despite a large corpus of raw data available in such domains. For instance, training a wildfire classifier on satellite imagery requires curating a massive and diverse dataset, which is an expensive and time-consuming process that can span from weeks to months. Searching for relevant examples in over 40 petabytes of unlabelled data requires researchers to manually hunt for such images, much like finding a needle in a haystack. We present a no-code end-to-end pipeline, Curator, which dramatically minimizes the time taken to curate an exhaustive labeled dataset. Curator is able to search massive amounts of unlabelled data by combining self-supervision, scalable nearest neighbor search, and active learning to learn and differentiate image representations. The pipeline can also be readily…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsImage Retrieval and Classification Techniques · Cell Image Analysis Techniques · AI in cancer detection
