Active Data Discovery: Mining Unknown Data using Submodular Information Measures
Suraj Kothawade, Shivang Chopra, Saikat Ghosh, Rishabh Iyer

TL;DR
This paper introduces an active data discovery framework that efficiently identifies rare classes and slices in datasets using submodular information measures, improving labeling efficiency and accuracy in various scenarios.
Contribution
It proposes a novel framework leveraging submodular functions to discover unknown rare data instances without prior seed sets, applicable to multiple tasks.
Findings
Significant accuracy improvements over existing methods.
Enhanced labeling efficiency in discovering rare data.
Effective across image classification and object detection.
Abstract
Active Learning is a very common yet powerful framework for iteratively and adaptively sampling subsets of the unlabeled sets with a human in the loop with the goal of achieving labeling efficiency. Most real world datasets have imbalance either in classes and slices, and correspondingly, parts of the dataset are rare. As a result, there has been a lot of work in designing active learning approaches for mining these rare data instances. Most approaches assume access to a seed set of instances which contain these rare data instances. However, in the event of more extreme rareness, it is reasonable to assume that these rare data instances (either classes or slices) may not even be present in the seed labeled set, and a critical need for the active learning paradigm is to efficiently discover these rare data instances. In this work, we provide an active data discovery framework which can…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMachine Learning and Algorithms · Imbalanced Data Classification Techniques · SARS-CoV-2 detection and testing
