The Word is Mightier than the Label: Learning without Pointillistic Labels using Data Programming
Chufan Gao, Mononito Goswami

TL;DR
This paper explores the Data Programming framework, which leverages noisy heuristics to generate labels for text classification, reducing reliance on manual point-by-point labeling and demonstrating competitive results.
Contribution
It provides a detailed analysis of Data Programming's mathematical foundations and empirically compares it with traditional active and semi-supervised learning methods.
Findings
DP effectively denoises heuristic labels for text classification
Compared to traditional methods, DP reduces labeling effort and maintains competitive accuracy
Demonstrates the applicability of DP on real-world text datasets
Abstract
Most advanced supervised Machine Learning (ML) models rely on vast amounts of point-by-point labelled training examples. Hand-labelling vast amounts of data may be tedious, expensive, and error-prone. Recently, some studies have explored the use of diverse sources of weak supervision to produce competitive end model classifiers. In this paper, we survey recent work on weak supervision, and in particular, we investigate the Data Programming (DP) framework. Taking a set of potentially noisy heuristics as input, DP assigns denoised probabilistic labels to each data point in a dataset using a probabilistic graphical model of heuristics. We analyze the math fundamentals behind DP and demonstrate the power of it by applying it on two real-world text classification tasks. Furthermore, we compare DP with pointillistic active and semi-supervised learning techniques traditionally applied in…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMachine Learning and Algorithms · Machine Learning and Data Classification · Imbalanced Data Classification Techniques
