Submodular Mutual Information for Targeted Data Subset Selection
Suraj Kothawade, Vishal Kaushal, Ganesh Ramakrishnan, Jeff Bilmes,, Rishabh Iyer

TL;DR
This paper introduces a novel approach using Submodular Mutual Information functions for targeted data subset selection, significantly improving model performance with minimal additional labeling on image classification tasks.
Contribution
It proposes a new method employing SMI functions for targeted data selection, outperforming existing methods in active learning scenarios.
Findings
Achieved 20-30% performance gain with targeted subset addition.
Demonstrated superior results over state-of-the-art methods.
Validated on CIFAR-10 and MNIST datasets.
Abstract
With the rapid growth of data, it is becoming increasingly difficult to train or improve deep learning models with the right subset of data. We show that this problem can be effectively solved at an additional labeling cost by targeted data subset selection(TSS) where a subset of unlabeled data points similar to an auxiliary set are added to the training data. We do so by using a rich class of Submodular Mutual Information (SMI) functions and demonstrate its effectiveness for image classification on CIFAR-10 and MNIST datasets. Lastly, we compare the performance of SMI functions for TSS with other state-of-the-art methods for closely related problems like active learning. Using SMI functions, we observe ~20-30% gain over the model's performance before re-training with added targeted subset; ~12% more than other methods.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMachine Learning and Algorithms · Machine Learning and Data Classification · Algorithms and Data Compression
