Text Grafting: Near-Distribution Weak Supervision for Minority Classes   in Text Classification

Letian Peng; Yi Gu; Chengyu Dong; Zihan Wang; Jingbo Shang

arXiv:2406.11115·cs.CL·June 18, 2024

Text Grafting: Near-Distribution Weak Supervision for Minority Classes in Text Classification

Letian Peng, Yi Gu, Chengyu Dong, Zihan Wang, Jingbo Shang

PDF

Open Access 1 Repo

TL;DR

This paper introduces a novel text grafting framework that combines mining and synthesis techniques to generate near-distribution data for minority classes in weakly supervised text classification, improving classifier performance.

Contribution

The paper proposes a new framework that effectively bridges the gap between data mining and synthesis, enhancing minority class data generation in weak supervision scenarios.

Findings

01

Text grafting outperforms previous methods in minority class classification.

02

Synthesized texts are more in-distribution and relevant to target classes.

03

Significant accuracy improvements demonstrated on benchmark datasets.

Abstract

For extremely weak-supervised text classification, pioneer research generates pseudo labels by mining texts similar to the class names from the raw corpus, which may end up with very limited or even no samples for the minority classes. Recent works have started to generate the relevant texts by prompting LLMs using the class names or definitions; however, there is a high risk that LLMs cannot generate in-distribution (i.e., similar to the corpus where the text classifier will be applied) data, leading to ungeneralizable classifiers. In this paper, we combine the advantages of these two approaches and propose to bridge the gap via a novel framework, \emph{text grafting}, which aims to obtain clean and near-distribution weak supervision for minority classes. Specifically, we first use LLM-based logits to mine masked templates from the raw corpus, which have a high potential for data…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

KomeijiForce/TextGrafting
pytorchOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsText and Document Classification Technologies · Spam and Phishing Detection · Internet Traffic Analysis and Secure E-voting