Training Subset Selection for Weak Supervision
Hunter Lang, Aravindan Vijayaraghavan, David Sontag

TL;DR
This paper demonstrates that selecting high-quality subsets of weakly-labeled data can significantly improve classifier performance, challenging the common practice of using all available weakly-labeled data in weak supervision.
Contribution
It introduces a simple subset selection method based on the cut statistic that enhances weak supervision by balancing data quantity and label quality.
Findings
Improves weak supervision accuracy by up to 19% on benchmarks.
Applicable to any label model and classifier with minimal implementation effort.
Theoretically and empirically shows benefits of subset selection over using all data.
Abstract
Existing weak supervision approaches use all the data covered by weak signals to train a classifier. We show both theoretically and empirically that this is not always optimal. Intuitively, there is a tradeoff between the amount of weakly-labeled data and the precision of the weak labels. We explore this tradeoff by combining pretrained data representations with the cut statistic (Muhlenbach et al., 2004) to select (hopefully) high-quality subsets of the weakly-labeled training data. Subset selection applies to any label model and classifier and is very simple to plug in to existing weak supervision pipelines, requiring just a few lines of code. We show our subset selection method improves the performance of weak supervision for a wide range of label models, classifiers, and datasets. Using less weakly-labeled data improves the accuracy of weak supervision pipelines by up to 19%…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMachine Learning and Data Classification · Neural Networks and Applications · Face and Expression Recognition
