Automating Weak Label Generation for Data Programming with Clinicians in the Loop
Jean Park, Sydney Pugh, Kaustubh Sridhar, Mengyu Liu, Navish Yarna,, Ramneet Kaur, Souradeep Dutta, Elena Bernardis, Oleg Sokolsky, and Insup Lee

TL;DR
This paper introduces a novel approach to generate weak labels for medical data by selecting representative samples for expert labeling, improving data programming accuracy in high-dimensional medical datasets.
Contribution
It proposes a distance-based sampling algorithm that efficiently captures dataset distribution for weak label generation, enhancing data programming in high-dimensional medical data.
Findings
17-28% accuracy improvement in time series data
5-15% accuracy improvement in medical images
Significant F1 score enhancements over baseline methods
Abstract
Large Deep Neural Networks (DNNs) are often data hungry and need high-quality labeled data in copious amounts for learning to converge. This is a challenge in the field of medicine since high quality labeled data is often scarce. Data programming has been the ray of hope in this regard, since it allows us to label unlabeled data using multiple weak labeling functions. Such functions are often supplied by a domain expert. Data-programming can combine multiple weak labeling functions and suggest labels better than simple majority voting over the different functions. However, it is not straightforward to express such weak labeling functions, especially in high-dimensional settings such as images and time-series data. What we propose in this paper is a way to bypass this issue, using distance functions. In high-dimensional spaces, it is easier to find meaningful distance metrics which can…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsScheduling and Timetabling Solutions · Statistical Methods in Clinical Trials · Organizational Management and Leadership
