Prompt Candidates, then Distill: A Teacher-Student Framework for LLM-driven Data Annotation

Mingxuan Xia; Haobo Wang; Yixuan Li; Zewei Yu; Jindong Wang; Junbo Zhao; Runze Wu

arXiv:2506.03857·cs.LG·June 5, 2025

Prompt Candidates, then Distill: A Teacher-Student Framework for LLM-driven Data Annotation

Mingxuan Xia, Haobo Wang, Yixuan Li, Zewei Yu, Jindong Wang, Junbo Zhao, Runze Wu

PDF

Open Access 1 Repo 1 Video

TL;DR

This paper introduces a novel teacher-student framework called CanDist that improves LLM-driven data annotation by capturing multiple candidate labels to handle uncertainty, then distilling them into a single label for better data quality.

Contribution

It proposes a candidate annotation paradigm and a distillation method that leverages multiple labels from LLMs, providing theoretical guarantees and improved annotation quality.

Findings

01

Outperforms existing single-label annotation methods across six text classification tasks.

02

Theoretical analysis shows superior guarantees for candidate-based distillation.

03

Demonstrates robustness to LLM uncertainty in data annotation.

Abstract

Recently, Large Language Models (LLMs) have demonstrated significant potential for data annotation, markedly reducing the labor costs associated with downstream applications. However, existing methods mostly adopt an aggressive strategy by prompting LLM to determine a single gold label for each unlabeled sample. Due to the inherent uncertainty within LLMs, they often produce incorrect labels for difficult samples, severely compromising the data quality for downstream applications. Motivated by ambiguity aversion in human behaviors, we propose a novel candidate annotation paradigm wherein large language models are encouraged to output all possible labels when incurring uncertainty. To ensure unique labels are provided for downstream tasks, we develop a teacher-student framework CanDist that distills candidate annotations with a Small Language Model (SLM). We further provide a rigorous…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

mingxuanxia/candist
pytorchOfficial

Videos

Prompt Candidates, then Distill: A Teacher-Student Framework for LLM-driven Data Annotation· underline

Taxonomy

TopicsTopic Modeling · Machine Learning and Data Classification · Explainable Artificial Intelligence (XAI)

MethodsADaptive gradient method with the OPTimal convergence rate