apricot: Submodular selection for data summarization in Python

Jacob Schreiber; Jeffrey Bilmes; William Stafford Noble

arXiv:1906.03543·cs.LG·June 11, 2019·25 cites

apricot: Submodular selection for data summarization in Python

Jacob Schreiber, Jeffrey Bilmes, William Stafford Noble

PDF

Open Access 1 Repo

TL;DR

Apricot is a Python package that efficiently selects representative data subsets using submodular optimization, enabling scalable data summarization for machine learning applications with strong theoretical guarantees.

Contribution

The paper introduces apricot, a new Python library implementing scalable submodular selection algorithms with practical efficiency and theoretical guarantees for data summarization.

Findings

01

Efficient subset selection with strong theoretical guarantees.

02

Ability to scale to millions of examples using feature-based functions.

03

Comparable model accuracy with full datasets using selected subsets.

Abstract

We present apricot, an open source Python package for selecting representative subsets from large data sets using submodular optimization. The package implements an efficient greedy selection algorithm that offers strong theoretical guarantees on the quality of the selected set. Two submodular set functions are implemented in apricot: facility location, which is broadly applicable but requires memory quadratic in the number of examples in the data set, and a feature-based function that is less broadly applicable but can scale to millions of examples. Apricot is extremely efficient, using both algorithmic speedups such as the lazy greedy algorithm and code optimizers such as numba. We demonstrate the use of subset selection by training machine learning models to comparable accuracy using either the full data set or a representative subset thereof. This paper presents an explanation of…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

jmschrei/apricot
noneOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsAlgorithms and Data Compression · Machine Learning and Algorithms · Machine Learning and Data Classification