Structured Exploration and Exploitation of Label Functions for Automated Data Annotation
Phong Lam, Ha-Linh Nguyen, Thu-Trang Nguyen, Son Nguyen, and Hieu Dinh Vo

TL;DR
EXPONA is an automated framework for generating high-quality label functions that balance diversity and reliability, significantly improving data annotation coverage and downstream model performance.
Contribution
It introduces a systematic exploration of multi-level label functions with reliability-aware filtering, advancing automated data labeling techniques.
Findings
Achieved up to 98.9% label coverage across datasets.
Improved weak label quality by up to 87%.
Enhanced downstream weighted F1 scores by up to 46%.
Abstract
High-quality labeled data is critical for training reliable machine learning and deep learning models, yet manual annotation remains costly and error-prone. Programmatic labeling addresses this challenge by using label functions (LFs), i.e., heuristic rules that automatically generate weak labels for training datasets. However, existing automated LF generation methods either rely on large language models (LLMs) to synthesize surface-level heuristics or employ model-based synthesis over hand-crafted primitives. These approaches often result in limited coverage and unreliable label quality. In this paper, we introduce EXPONA, an automated framework for programmatic labeling that formulates LF generation as a principled process balancing diversity and reliability. EXPONA systematically explores multi-level LFs, spanning surface, structural, and semantic perspectives. EXPONA further applies…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
