Loss-Proportional Subsampling for Subsequent ERM
Paul Mineiro, Nikos Karampatziakis

TL;DR
This paper introduces a loss-proportional subsampling method that efficiently reduces data size before empirical risk minimization, maintaining strong performance guarantees and demonstrating practical benefits on large datasets.
Contribution
The paper presents a novel sampling scheme that considers a subset of hypotheses to reduce data size while ensuring competitive excess risk bounds.
Findings
Effective data reduction prior to ERM
Guarantees on excess risk compared to full data
Improved efficiency on large datasets
Abstract
We propose a sampling scheme suitable for reducing a data set prior to selecting a hypothesis with minimum empirical risk. The sampling only considers a subset of the ultimate (unknown) hypothesis set, but can nonetheless guarantee that the final excess risk will compare favorably with utilizing the entire original data set. We demonstrate the practical benefits of our approach on a large dataset which we subsample and subsequently fit with boosted trees.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMachine Learning and Algorithms · Imbalanced Data Classification Techniques · Domain Adaptation and Few-Shot Learning
