Semi-Supervised Text Classification via Self-Pretraining
Payam Karisani, Negin Karisani

TL;DR
This paper introduces Self-Pretraining, a novel semi-supervised text classification method that iteratively trains two classifiers, effectively handling semantic drift and outperforming existing models on social media datasets.
Contribution
The paper proposes a threshold-free, iterative semi-supervised learning model with a dual-classifier setup and innovative techniques to address semantic drift in text classification.
Findings
Outperforms state-of-the-art semi-supervised classifiers
Effective on social media datasets
Handles semantic drift successfully
Abstract
We present a neural semi-supervised learning model termed Self-Pretraining. Our model is inspired by the classic self-training algorithm. However, as opposed to self-training, Self-Pretraining is threshold-free, it can potentially update its belief about previously labeled documents, and can cope with the semantic drift problem. Self-Pretraining is iterative and consists of two classifiers. In each iteration, one classifier draws a random set of unlabeled documents and labels them. This set is used to initialize the second classifier, to be further trained by the set of labeled documents. The algorithm proceeds to the next iteration and the classifiers' roles are reversed. To improve the flow of information across the iterations and also to cope with the semantic drift problem, Self-Pretraining employs an iterative distillation process, transfers hypotheses across the iterations,…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsText and Document Classification Technologies · Topic Modeling · Domain Adaptation and Few-Shot Learning
