Self-Training for Sample-Efficient Active Learning for Text   Classification with Pre-Trained Language Models

Christopher Schr\"oder; Gerhard Heyer

arXiv:2406.09206·cs.CL·October 7, 2024

Self-Training for Sample-Efficient Active Learning for Text Classification with Pre-Trained Language Models

Christopher Schr\"oder, Gerhard Heyer

PDF

Open Access 1 Repo 1 Video

TL;DR

This paper explores how self-training can enhance the efficiency of active learning in text classification tasks using pre-trained language models, achieving comparable results with significantly less labeled data.

Contribution

It introduces HAST, a novel self-training strategy that improves active learning efficiency, and provides a comprehensive evaluation of self-training approaches in NLP.

Findings

01

HAST outperforms previous self-training methods.

02

Achieves comparable results with only 25% of labeled data.

03

Effective across four text classification benchmarks.

Abstract

Active learning is an iterative labeling process that is used to obtain a small labeled subset, despite the absence of labeled data, thereby enabling to train a model for supervised tasks such as text classification. While active learning has made considerable progress in recent years due to improvements provided by pre-trained language models, there is untapped potential in the often neglected unlabeled portion of the data, although it is available in considerably larger quantities than the usually small set of labeled data. In this work, we investigate how self-training, a semi-supervised approach that uses a model to obtain pseudo-labels for unlabeled data, can be used to improve the efficiency of active learning for text classification. Building on a comprehensive reproduction of four previous self-training approaches, some of which are evaluated for the first time in the context of…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

chschroeder/self-training-for-sample-efficient-active-learning
pytorchOfficial

Videos

Self-Training for Sample-Efficient Active Learning for Text Classification with Pre-Trained Language Models· underline

Taxonomy

TopicsEducational Assessment and Pedagogy

MethodsContrastive Learning