Data Programming: Creating Large Training Sets, Quickly
Alexander Ratner, Christopher De Sa, Sen Wu, Daniel Selsam,, Christopher R\'e

TL;DR
Data programming enables rapid creation of large training datasets by using weak supervision through labeling functions, which are then denoised with a generative model, significantly improving model performance and accessibility.
Contribution
The paper introduces a novel paradigm for programmatically creating training data using labeling functions and a generative model to denoise labels, with theoretical guarantees and practical improvements.
Findings
Achieved a new winning score on the TAC-KBP challenge.
Improved LSTM performance by nearly 6 F1 points using data programming.
Demonstrated ease of use for non-experts in creating training data.
Abstract
Large labeled training sets are the critical building blocks of supervised learning methods and are key enablers of deep learning techniques. For some applications, creating labeled training sets is the most time-consuming and expensive part of applying machine learning. We therefore propose a paradigm for the programmatic creation of training sets called data programming in which users express weak supervision strategies or domain heuristics as labeling functions, which are programs that label subsets of the data, but that are noisy and may conflict. We show that by explicitly representing this training set labeling process as a generative model, we can "denoise" the generated training set, and establish theoretically that we can recover the parameters of these generative models in a handful of settings. We then show how to modify a discriminative loss function to make it noise-aware,…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMachine Learning and Data Classification · Multimodal Machine Learning Applications · Machine Learning and Algorithms
MethodsSigmoid Activation · Tanh Activation · Logistic Regression · Long Short-Term Memory
