Calibrating Pre-trained Language Classifiers on LLM-generated Noisy Labels via Iterative Refinement

Liqin Ye; Agam Shah; Chao Zhang; Sudheer Chava

arXiv:2505.19675·cs.CL·June 23, 2025

Calibrating Pre-trained Language Classifiers on LLM-generated Noisy Labels via Iterative Refinement

Liqin Ye, Agam Shah, Chao Zhang, Sudheer Chava

PDF

Open Access 1 Repo

TL;DR

This paper introduces SiDyP, a method to improve classifier robustness against noisy labels generated by large language models, by iteratively refining label predictions using a simplex diffusion approach, significantly enhancing NLP task performance.

Contribution

The paper presents SiDyP, a novel iterative refinement framework that calibrates classifier predictions to handle LLM-generated noisy labels, improving NLP classifier accuracy.

Findings

01

Increases BERT classifier performance by over 7% on noisy datasets.

02

Effectively refines noisy labels using neighborhood distribution and diffusion.

03

Demonstrates robustness across various LLMs and NLP tasks.

Abstract

The traditional process of creating labeled datasets is labor-intensive and expensive. Recent breakthroughs in open-source large language models (LLMs) have opened up a new avenue in generating labeled datasets automatically for various natural language processing (NLP) tasks, providing an alternative to such an expensive annotation process. However, the reliability of such auto-generated labels remains a significant concern due to inherent inaccuracies. When learning from noisy labels, the model's generalization is likely to be harmed as it is prone to overfit to those label noises. While previous studies in learning from noisy labels mainly focus on synthetic noise and real-world noise, LLM-generated label noise receives less attention. In this paper, we propose SiDyP: Simplex Label Diffusion with Dynamic Prior to calibrate the classifier's prediction, thus enhancing its robustness…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

liqinye/sidyp
pytorchOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsNatural Language Processing Techniques · Speech Recognition and Synthesis

MethodsRefunds@Expedia|||How do I get a full refund from Expedia? · Attention Is All You Need · Linear Layer · Attention Dropout · Softmax · WordPiece · Weight Decay · Multi-Head Attention · Layer Normalization · Dropout