Data Quality Enhancement on the Basis of Diversity with Large Language   Models for Text Classification: Uncovered, Difficult, and Noisy

Min Zeng; Caiquan Liu; Shiqi Zhang; Li Xie; Chen Sang; Xiaoxin Chen

arXiv:2412.06575·cs.CL·December 11, 2024

Data Quality Enhancement on the Basis of Diversity with Large Language Models for Text Classification: Uncovered, Difficult, and Noisy

Min Zeng, Caiquan Liu, Shiqi Zhang, Li Xie, Chen Sang, Xiaoxin Chen

PDF

Open Access

TL;DR

This paper introduces a data quality enhancement method using large language models for text classification, which improves accuracy, reduces training time, and achieves state-of-the-art results by selecting and fine-tuning on high-quality data.

Contribution

The paper proposes a novel data quality enhancement approach based on LLMs that effectively identifies and utilizes high-quality data to improve classification performance and efficiency.

Findings

01

Significant accuracy improvements in text classification tasks.

02

Nearly 50% reduction in training time.

03

Achieved state-of-the-art results on multiple datasets.

Abstract

In recent years, the use of large language models (LLMs) for text classification has attracted widespread attention. Despite this, the classification accuracy of LLMs has not yet universally surpassed that of smaller models. LLMs can enhance their performance in text classification through fine-tuning. However, existing data quality research based on LLMs is challenging to apply directly to solve text classification problems. To further improve the performance of LLMs in classification tasks, this paper proposes a data quality enhancement (DQE) method for text classification based on LLMs. This method starts by using a greedy algorithm to select data, dividing the dataset into sampled and unsampled subsets, and then performing fine-tuning of the LLMs using the sampled data. Subsequently, this model is used to predict the outcomes for the unsampled data, categorizing incorrectly…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsText and Document Classification Technologies · Data Quality and Management