Coresets for Data-efficient Training of Machine Learning Models
Baharan Mirzasoleiman, Jeff Bilmes, Jure Leskovec

TL;DR
This paper introduces CRAIG, a novel coreset selection method that enables data-efficient training of machine learning models by selecting a weighted subset of data, achieving significant speedups without sacrificing accuracy.
Contribution
CRAIG is the first rigorous method for data-efficient training that guarantees convergence and speedup for general machine learning models using incremental gradient methods.
Findings
CRAIG achieves up to 6x speedup for logistic regression.
CRAIG achieves up to 3x speedup for deep neural networks.
CRAIG maintains solution quality comparable to full dataset training.
Abstract
Incremental gradient (IG) methods, such as stochastic gradient descent and its variants are commonly used for large scale optimization in machine learning. Despite the sustained effort to make IG methods more data-efficient, it remains an open question how to select a training data subset that can theoretically and practically perform on par with the full dataset. Here we develop CRAIG, a method to select a weighted subset (or coreset) of training data that closely estimates the full gradient by maximizing a submodular function. We prove that applying IG to this subset is guaranteed to converge to the (near)optimal solution with the same convergence rate as that of IG for convex optimization. As a result, CRAIG achieves a speedup that is inversely proportional to the size of the subset. To our knowledge, this is the first rigorous method for data-efficient training of general machine…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
Taxonomy
TopicsStochastic Gradient Optimization Techniques · Advanced Neural Network Applications · Machine Learning and ELM
MethodsLogistic Regression
