Learning from Multiple Corrupted Sources, with Application to Learning   from Label Proportions

Clayton Scott; Jianxin Zhang

arXiv:1910.04665·stat.ML·October 11, 2019·5 cites

Learning from Multiple Corrupted Sources, with Application to Learning from Label Proportions

Clayton Scott, Jianxin Zhang

PDF

Open Access

TL;DR

This paper develops a method for binary classification using multiple corrupted data sources, providing theoretical guarantees and applying it to label proportions learning with strong performance bounds.

Contribution

Introduces a weighted empirical risk minimization approach for learning from multiple corrupted sources, with theoretical error bounds and application to label proportions learning.

Findings

01

The proposed method achieves optimal generalization error bounds.

02

The weights for combining corrupted samples are interpretable functions of data quality.

03

Experiments validate the effectiveness of the approach in practical scenarios.

Abstract

We study binary classification in the setting where the learner is presented with multiple corrupted training samples, with possibly different sample sizes and degrees of corruption, and introduce an approach based on minimizing a weighted combination of corruption-corrected empirical risks. We establish a generalization error bound, and further show that the bound is optimized when the weights are certain interpretable and intuitive functions of the sample sizes and degrees of corruptions. We then apply this setting to the problem of learning with label proportions (LLP), and propose an algorithm that enjoys the most general statistical performance guarantees known for LLP. Experiments demonstrate the utility of our theory.

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMachine Learning and Data Classification · Machine Learning and Algorithms · Imbalanced Data Classification Techniques