Optimal Data Selection: An Online Distributed View
Mariel Werner, Anastasios Angelopoulos, Stephen Bates, Michael I., Jordan

TL;DR
This paper introduces algorithms for online distributed data selection using submodular maximization, effectively reducing redundant data while maintaining theoretical guarantees and improving learning performance on ImageNet and MNIST.
Contribution
It develops the first online distributed submodular optimization algorithms with provable guarantees and a flexible core selection routine adaptable to various data streams.
Findings
Algorithms achieve similar guarantees as offline methods.
Methods outperform random selection by 5-20%.
Effective in large-scale learning tasks on ImageNet and MNIST.
Abstract
The blessing of ubiquitous data also comes with a curse: the communication, storage, and labeling of massive, mostly redundant datasets. We seek to solve this problem at its core, collecting only valuable data and throwing out the rest via submodular maximization. Specifically, we develop algorithms for the online and distributed version of the problem, where data selection occurs in an uncoordinated fashion across multiple data streams. We design a general and flexible core selection routine for our algorithms which, given any stream of data, any assessment of its value, and any formulation of its selection cost, extracts the most valuable subset of the stream up to a constant factor while using minimal memory. Notably, our methods have the same theoretical guarantees as their offline counterparts, and, as far as we know, provide the first guarantees for online distributed submodular…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsStochastic Gradient Optimization Techniques · Sparse and Compressive Sensing Techniques · Privacy-Preserving Technologies in Data
