Probably Approximately Correct Labels

Emmanuel J. Cand\`es; Andrew Ilyas; Tijana Zrnic

arXiv:2506.10908·stat.ML·October 21, 2025

Probably Approximately Correct Labels

Emmanuel J. Cand\`es, Andrew Ilyas, Tijana Zrnic

PDF

Open Access 1 Repo 3 Reviews

TL;DR

This paper introduces a method to leverage pre-trained AI models for creating high-quality datasets with guaranteed low error rates, reducing labeling costs while maintaining accuracy.

Contribution

It proposes a nonasymptotic, theoretically grounded approach for dataset curation using AI models with high-probability accuracy guarantees.

Findings

01

Effective text annotation with large language models

02

Accurate image labeling using pre-trained vision models

03

Reliable protein folding analysis with AlphaFold

Abstract

Obtaining high-quality labeled datasets is often costly, requiring either human annotation or expensive experiments. In theory, powerful pre-trained AI models provide an opportunity to automatically label datasets and save costs. Unfortunately, these models come with no guarantees on their accuracy, making wholesale replacement of manual labeling impractical. In this work, we propose a method for leveraging pre-trained AI models to curate cost-effective and high-quality datasets. In particular, our approach results in probably approximately correct labels: with high probability, the overall labeling error is small. Our method is nonasymptotically valid under minimal assumptions on the dataset or the AI model being studied, and thus enables rigorous yet efficient dataset curation using modern AI models. We demonstrate the benefits of the methodology through text annotation with large…

Peer Reviews

Decision·Submitted to ICLR 2026

Reviewer 01Rating 4Confidence 3

Strengths

1. Relevant and well-motivated problem formulation. Applying PAC reasoning to automatic labeling is timely and practically meaningful. It formalizes a common heuristic (confidence-based filtering) into a statistically grounded process. 2. Simplicity and generality. The proposed approach is model-agnostic and directly applicable across diverse labeling pipelines. 3. Comprehensive empirical coverage. Experiments span multiple modalities and show consistent compliance with PAC bounds whi

Weaknesses

1. Limited overall contribution despite novel framing. While the paper introduces new formulations (e.g., PAC Labeling and PAC Router) that are conceptually fresh, the underlying theoretical substance remains limited. The analysis primarily builds on existing mean-upper-bound PAC results without introducing new bounds, assumptions, or insights into the nature of uncertainty in labeling. Consequently, although the problem setting is well-motivated, the contribution lies more in repackaging and in

Reviewer 02Rating 4Confidence 3

Strengths

- The paper considers a very practical setup with AI models generating labels for all data and then experts annotating the most valuable subset. This is helpful in generating labeled training data in low resource domains. - The proposed method has nice statistical guarantee that with high probability, the final labeled dataset's error will not exceed a user-specified threshold. - The effectiveness of the method in terms of cost saving is demonstrated empirically.

Weaknesses

- In practice, the cost of high quality AI models is not neglectable, especially given the existence of test time scaling. Wonder how does this affect the proposed approach? Also wonder if allocating the whole expert budget to AI models test time scaling achieves better results? - The considered baselines are too naive. A natural baseline is to use active learning. For example, first initialize the active learning model to be the model trained on labels generated by AI models, then select data

Reviewer 03Rating 4Confidence 2

Strengths

This paper addresses an important problem and is well-timed with the current trend of synthetic data labeling, particularly in its focus on providing statistical guarantees.

Weaknesses

The biggest problem is that the baseline comparison feels contrived, and the paper's contribution seems overstated. The authors claim that existing methods fail to provide mathematical guarantees, but such methodologies clearly exist. For example, Conformal Prediction (CP), which the authors themselves mention, deals with a very similar problem. The CP methodology quantifies prediction uncertainty into a prediction set with guaranteed coverage; it statistically guarantees the probability that th

Code & Models

Repositories

tijana-zrnic/pac-labels
pytorchOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMachine Learning and Data Classification · Generative Adversarial Networks and Image Synthesis · Cell Image Analysis Techniques

MethodsAlphaFold