Not All LLM-Generated Data Are Equal: Rethinking Data Weighting in Text   Classification

Hsun-Yu Kuo; Yin-Hsiang Liao; Yu-Chieh Chao; Wei-Yun Ma; Pu-Jen Cheng

arXiv:2410.21526·cs.LG·March 25, 2025·2 cites

Not All LLM-Generated Data Are Equal: Rethinking Data Weighting in Text Classification

Hsun-Yu Kuo, Yin-Hsiang Liao, Yu-Chieh Chao, Wei-Yun Ma, Pu-Jen Cheng

PDF

Open Access 1 Video

TL;DR

This paper introduces weighted-loss methods to improve the use of synthetic LLM-generated data in text classification, emphasizing high-quality data to enhance model performance when real data is limited.

Contribution

It proposes novel weighted-loss techniques that align synthetic data with real-world distributions, improving classification accuracy over standard methods.

Findings

01

Weighted-loss approaches outperform standard cross-entropy.

02

Synthetic data quality significantly impacts model performance.

03

Method is effective across multiple text classification tasks.

Abstract

Synthetic data augmentation via large language models (LLMs) allows researchers to leverage additional training data, thus enhancing the performance of downstream tasks, especially when real-world data is scarce. However, the generated data can deviate from the real-world data, and this misalignment can bring deficient outcomes while applying the trained model to applications. Therefore, we proposed efficient weighted-loss approaches to align synthetic data with real-world distribution by emphasizing high-quality and diversified data generated by LLMs with using merely a little real-world data. We empirically assessed the effectiveness of our method on multiple text classification tasks, and the results showed leveraging our approaches on a BERT-level model robustly outperformed standard cross-entropy and other data weighting approaches, providing potential solutions to effectively…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

Not All LLM-Generated Data Are Equal: Rethinking Data Weighting in Text Classification· slideslive

Taxonomy

TopicsNatural Language Processing Techniques

MethodsALIGN