Dominant Set-based Active Learning for Text Classification and its   Application to Online Social Media

Toktam A. Oghaz; Ivan Garibay

arXiv:2202.00540·cs.CL·February 2, 2022

Dominant Set-based Active Learning for Text Classification and its Application to Online Social Media

Toktam A. Oghaz, Ivan Garibay

PDF

Open Access

TL;DR

This paper introduces a parameter-free, dominant set-based active learning method for text classification that efficiently reduces labeling costs while maintaining high accuracy, outperforming existing strategies across various datasets and models.

Contribution

The novel active learning approach identifies cohesive local clusters and selects boundary samples without parameter tuning, enhancing efficiency and effectiveness in NLP tasks.

Findings

01

Achieves comparable accuracy to full data training with fewer samples.

02

Outperforms state-of-the-art active learning methods.

03

Effective across diverse datasets and neural network architectures.

Abstract

Recent advances in natural language processing (NLP) in online social media are evidently owed to large-scale datasets. However, labeling, storing, and processing a large number of textual data points, e.g., tweets, has remained challenging. On top of that, in applications such as hate speech detection, labeling a sufficiently large dataset containing offensive content can be mentally and emotionally taxing for human annotators. Thus, NLP methods that can make the best use of significantly less labeled data points are of great interest. In this paper, we present a novel pool-based active learning method that can be used for the training of large unlabeled corpus with minimum annotation cost. For that, we propose to find the dominant sets of local clusters in the feature space. These sets represent maximally cohesive structures in the data. Then, the samples that do not belong to any of…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMachine Learning and Algorithms · COVID-19 diagnosis using AI · Hate Speech and Cyberbullying Detection