Coresets for Data-efficient Training of Machine Learning Models

Baharan Mirzasoleiman; Jeff Bilmes; Jure Leskovec

arXiv:1906.01827·cs.LG·November 18, 2020·37 cites

Coresets for Data-efficient Training of Machine Learning Models

Baharan Mirzasoleiman, Jeff Bilmes, Jure Leskovec

PDF

Open Access 4 Repos 1 Video

TL;DR

This paper introduces CRAIG, a novel coreset selection method that enables data-efficient training of machine learning models by selecting a weighted subset of data, achieving significant speedups without sacrificing accuracy.

Contribution

CRAIG is the first rigorous method for data-efficient training that guarantees convergence and speedup for general machine learning models using incremental gradient methods.

Findings

01

CRAIG achieves up to 6x speedup for logistic regression.

02

CRAIG achieves up to 3x speedup for deep neural networks.

03

CRAIG maintains solution quality comparable to full dataset training.

Abstract

Incremental gradient (IG) methods, such as stochastic gradient descent and its variants are commonly used for large scale optimization in machine learning. Despite the sustained effort to make IG methods more data-efficient, it remains an open question how to select a training data subset that can theoretically and practically perform on par with the full dataset. Here we develop CRAIG, a method to select a weighted subset (or coreset) of training data that closely estimates the full gradient by maximizing a submodular function. We prove that applying IG to this subset is guaranteed to converge to the (near)optimal solution with the same convergence rate as that of IG for convex optimization. As a result, CRAIG achieves a speedup that is inversely proportional to the size of the subset. To our knowledge, this is the first rigorous method for data-efficient training of general machine…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

Videos

Coresets for Data-efficient Training of Machine Learning Models· slideslive

Taxonomy

TopicsStochastic Gradient Optimization Techniques · Advanced Neural Network Applications · Machine Learning and ELM

MethodsLogistic Regression