Probably Approximately Correct Labels
Emmanuel J. Cand\`es, Andrew Ilyas, Tijana Zrnic

TL;DR
This paper introduces a method to leverage pre-trained AI models for creating high-quality datasets with guaranteed low error rates, reducing labeling costs while maintaining accuracy.
Contribution
It proposes a nonasymptotic, theoretically grounded approach for dataset curation using AI models with high-probability accuracy guarantees.
Findings
Effective text annotation with large language models
Accurate image labeling using pre-trained vision models
Reliable protein folding analysis with AlphaFold
Abstract
Obtaining high-quality labeled datasets is often costly, requiring either human annotation or expensive experiments. In theory, powerful pre-trained AI models provide an opportunity to automatically label datasets and save costs. Unfortunately, these models come with no guarantees on their accuracy, making wholesale replacement of manual labeling impractical. In this work, we propose a method for leveraging pre-trained AI models to curate cost-effective and high-quality datasets. In particular, our approach results in probably approximately correct labels: with high probability, the overall labeling error is small. Our method is nonasymptotically valid under minimal assumptions on the dataset or the AI model being studied, and thus enables rigorous yet efficient dataset curation using modern AI models. We demonstrate the benefits of the methodology through text annotation with large…
Peer Reviews
Decision·Submitted to ICLR 2026
1. Relevant and well-motivated problem formulation. Applying PAC reasoning to automatic labeling is timely and practically meaningful. It formalizes a common heuristic (confidence-based filtering) into a statistically grounded process. 2. Simplicity and generality. The proposed approach is model-agnostic and directly applicable across diverse labeling pipelines. 3. Comprehensive empirical coverage. Experiments span multiple modalities and show consistent compliance with PAC bounds whi
1. Limited overall contribution despite novel framing. While the paper introduces new formulations (e.g., PAC Labeling and PAC Router) that are conceptually fresh, the underlying theoretical substance remains limited. The analysis primarily builds on existing mean-upper-bound PAC results without introducing new bounds, assumptions, or insights into the nature of uncertainty in labeling. Consequently, although the problem setting is well-motivated, the contribution lies more in repackaging and in
- The paper considers a very practical setup with AI models generating labels for all data and then experts annotating the most valuable subset. This is helpful in generating labeled training data in low resource domains. - The proposed method has nice statistical guarantee that with high probability, the final labeled dataset's error will not exceed a user-specified threshold. - The effectiveness of the method in terms of cost saving is demonstrated empirically.
- In practice, the cost of high quality AI models is not neglectable, especially given the existence of test time scaling. Wonder how does this affect the proposed approach? Also wonder if allocating the whole expert budget to AI models test time scaling achieves better results? - The considered baselines are too naive. A natural baseline is to use active learning. For example, first initialize the active learning model to be the model trained on labels generated by AI models, then select data
This paper addresses an important problem and is well-timed with the current trend of synthetic data labeling, particularly in its focus on providing statistical guarantees.
The biggest problem is that the baseline comparison feels contrived, and the paper's contribution seems overstated. The authors claim that existing methods fail to provide mathematical guarantees, but such methodologies clearly exist. For example, Conformal Prediction (CP), which the authors themselves mention, deals with a very similar problem. The CP methodology quantifies prediction uncertainty into a prediction set with guaranteed coverage; it statistically guarantees the probability that th
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMachine Learning and Data Classification · Generative Adversarial Networks and Image Synthesis · Cell Image Analysis Techniques
MethodsAlphaFold
