Improving Probabilistic Models in Text Classification via Active Learning
Mitchell Bosley, Saki Kuzushima, Ted Enamorado, Yuki Shiraito

TL;DR
This paper introduces an active learning algorithm for text classification that reduces labeling costs while maintaining high accuracy, validated through empirical studies and replication of existing research.
Contribution
A novel active learning algorithm combining probabilistic models with focused labeling on difficult documents, improving efficiency in text classification tasks.
Findings
Performance comparable to state-of-the-art methods
Significant reduction in labeled data needed
Successful replication of previous studies with less data
Abstract
Social scientists often classify text documents to use the resulting labels as an outcome or a predictor in empirical research. Automated text classification has become a standard tool, since it requires less human coding. However, scholars still need many human-labeled documents to train automated classifiers. To reduce labeling costs, we propose a new algorithm for text classification that combines a probabilistic model with active learning. The probabilistic model uses both labeled and unlabeled data, and active learning concentrates labeling efforts on difficult documents to classify. Our validation study shows that the classification performance of our algorithm is comparable to state-of-the-art methods at a fraction of the computational cost. Moreover, we replicate two recently published articles and reach the same substantive conclusions with only a small proportion of the…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsText and Document Classification Technologies · Advanced Text Analysis Techniques · Topic Modeling
