Adaptive Ranking-based Sample Selection for Weakly Supervised Class-imbalanced Text Classification
Linxin Song, Jieyu Zhang, Tianxiang Yang, Masayuki Goto

TL;DR
This paper introduces ARS2, a flexible framework that improves weakly supervised, class-imbalanced text classification by ranking and sampling data based on model confidence and rule expertise, significantly boosting performance.
Contribution
The paper presents a novel adaptive ranking-based sample selection method that addresses data imbalance in weak supervision for NLP tasks, improving classification accuracy.
Findings
ARS2 outperforms state-of-the-art methods by 2%-57.8% in F1-score.
It effectively balances data and leverages rule expertise.
Experimental results validate its robustness across datasets.
Abstract
To obtain a large amount of training labels inexpensively, researchers have recently adopted the weak supervision (WS) paradigm, which leverages labeling rules to synthesize training labels rather than using individual annotations to achieve competitive results for natural language processing (NLP) tasks. However, data imbalance is often overlooked in applying the WS paradigm, despite being a common issue in a variety of NLP tasks. To address this challenge, we propose Adaptive Ranking-based Sample Selection (ARS2), a model-agnostic framework to alleviate the data imbalance issue in the WS paradigm. Specifically, it calculates a probabilistic margin score based on the output of the current model to measure and rank the cleanliness of each data point. Then, the ranked data are sampled based on both class-wise and rule-aware ranking. In particular, the two sample strategies corresponds to…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsImbalanced Data Classification Techniques · Text and Document Classification Technologies · Topic Modeling
