Learning from Multiple Corrupted Sources, with Application to Learning from Label Proportions
Clayton Scott, Jianxin Zhang

TL;DR
This paper develops a method for binary classification using multiple corrupted data sources, providing theoretical guarantees and applying it to label proportions learning with strong performance bounds.
Contribution
Introduces a weighted empirical risk minimization approach for learning from multiple corrupted sources, with theoretical error bounds and application to label proportions learning.
Findings
The proposed method achieves optimal generalization error bounds.
The weights for combining corrupted samples are interpretable functions of data quality.
Experiments validate the effectiveness of the approach in practical scenarios.
Abstract
We study binary classification in the setting where the learner is presented with multiple corrupted training samples, with possibly different sample sizes and degrees of corruption, and introduce an approach based on minimizing a weighted combination of corruption-corrected empirical risks. We establish a generalization error bound, and further show that the bound is optimized when the weights are certain interpretable and intuitive functions of the sample sizes and degrees of corruptions. We then apply this setting to the problem of learning with label proportions (LLP), and propose an algorithm that enjoys the most general statistical performance guarantees known for LLP. Experiments demonstrate the utility of our theory.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMachine Learning and Data Classification · Machine Learning and Algorithms · Imbalanced Data Classification Techniques
