Improving Probabilistic Models in Text Classification via Active Learning

Mitchell Bosley; Saki Kuzushima; Ted Enamorado; Yuki Shiraito

arXiv:2202.02629·cs.CL·May 14, 2025

Improving Probabilistic Models in Text Classification via Active Learning

Mitchell Bosley, Saki Kuzushima, Ted Enamorado, Yuki Shiraito

PDF

Open Access

TL;DR

This paper introduces an active learning algorithm for text classification that reduces labeling costs while maintaining high accuracy, validated through empirical studies and replication of existing research.

Contribution

A novel active learning algorithm combining probabilistic models with focused labeling on difficult documents, improving efficiency in text classification tasks.

Findings

01

Performance comparable to state-of-the-art methods

02

Significant reduction in labeled data needed

03

Successful replication of previous studies with less data

Abstract

Social scientists often classify text documents to use the resulting labels as an outcome or a predictor in empirical research. Automated text classification has become a standard tool, since it requires less human coding. However, scholars still need many human-labeled documents to train automated classifiers. To reduce labeling costs, we propose a new algorithm for text classification that combines a probabilistic model with active learning. The probabilistic model uses both labeled and unlabeled data, and active learning concentrates labeling efforts on difficult documents to classify. Our validation study shows that the classification performance of our algorithm is comparable to state-of-the-art methods at a fraction of the computational cost. Moreover, we replicate two recently published articles and reach the same substantive conclusions with only a small proportion of the…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsText and Document Classification Technologies · Advanced Text Analysis Techniques · Topic Modeling