Adaptive Ranking-based Sample Selection for Weakly Supervised   Class-imbalanced Text Classification

Linxin Song; Jieyu Zhang; Tianxiang Yang; Masayuki Goto

arXiv:2210.03092·cs.CL·October 10, 2022

Adaptive Ranking-based Sample Selection for Weakly Supervised Class-imbalanced Text Classification

Linxin Song, Jieyu Zhang, Tianxiang Yang, Masayuki Goto

PDF

Open Access 2 Repos

TL;DR

This paper introduces ARS2, a flexible framework that improves weakly supervised, class-imbalanced text classification by ranking and sampling data based on model confidence and rule expertise, significantly boosting performance.

Contribution

The paper presents a novel adaptive ranking-based sample selection method that addresses data imbalance in weak supervision for NLP tasks, improving classification accuracy.

Findings

01

ARS2 outperforms state-of-the-art methods by 2%-57.8% in F1-score.

02

It effectively balances data and leverages rule expertise.

03

Experimental results validate its robustness across datasets.

Abstract

To obtain a large amount of training labels inexpensively, researchers have recently adopted the weak supervision (WS) paradigm, which leverages labeling rules to synthesize training labels rather than using individual annotations to achieve competitive results for natural language processing (NLP) tasks. However, data imbalance is often overlooked in applying the WS paradigm, despite being a common issue in a variety of NLP tasks. To address this challenge, we propose Adaptive Ranking-based Sample Selection (ARS2), a model-agnostic framework to alleviate the data imbalance issue in the WS paradigm. Specifically, it calculates a probabilistic margin score based on the output of the current model to measure and rank the cleanliness of each data point. Then, the ranked data are sampled based on both class-wise and rule-aware ranking. In particular, the two sample strategies corresponds to…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsImbalanced Data Classification Techniques · Text and Document Classification Technologies · Topic Modeling