Selection via Proxy: Efficient Data Selection for Deep Learning
Cody Coleman, Christopher Yeh, Stephen Mussmann, Baharan, Mirzasoleiman, Peter Bailis, Percy Liang, Jure Leskovec, Matei Zaharia

TL;DR
This paper introduces a proxy-based data selection method that significantly accelerates active learning and core-set selection in deep learning by using small, fast-to-train models as proxies, maintaining accuracy while reducing computation.
Contribution
The authors propose 'selection via proxy' (SVP), a novel approach that employs lightweight proxy models for efficient data selection in deep learning tasks, reducing training time substantially.
Findings
SVP achieves an order of magnitude faster data selection in active learning.
Proxies over 10x faster can remove up to 50% of data without accuracy loss.
End-to-end training time improves by up to 1.6x with SVP.
Abstract
Data selection methods, such as active learning and core-set selection, are useful tools for machine learning on large datasets. However, they can be prohibitively expensive to apply in deep learning because they depend on feature representations that need to be learned. In this work, we show that we can greatly improve the computational efficiency by using a small proxy model to perform data selection (e.g., selecting data points to label for active learning). By removing hidden layers from the target model, using smaller architectures, and training for fewer epochs, we create proxies that are an order of magnitude faster to train. Although these small proxy models have higher error rates, we find that they empirically provide useful signals for data selection. We evaluate this "selection via proxy" (SVP) approach on several data selection tasks across five datasets: CIFAR10, CIFAR100,…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMachine Learning and Algorithms · Machine Learning and Data Classification · Imbalanced Data Classification Techniques
