Dominant Set-based Active Learning for Text Classification and its Application to Online Social Media
Toktam A. Oghaz, Ivan Garibay

TL;DR
This paper introduces a parameter-free, dominant set-based active learning method for text classification that efficiently reduces labeling costs while maintaining high accuracy, outperforming existing strategies across various datasets and models.
Contribution
The novel active learning approach identifies cohesive local clusters and selects boundary samples without parameter tuning, enhancing efficiency and effectiveness in NLP tasks.
Findings
Achieves comparable accuracy to full data training with fewer samples.
Outperforms state-of-the-art active learning methods.
Effective across diverse datasets and neural network architectures.
Abstract
Recent advances in natural language processing (NLP) in online social media are evidently owed to large-scale datasets. However, labeling, storing, and processing a large number of textual data points, e.g., tweets, has remained challenging. On top of that, in applications such as hate speech detection, labeling a sufficiently large dataset containing offensive content can be mentally and emotionally taxing for human annotators. Thus, NLP methods that can make the best use of significantly less labeled data points are of great interest. In this paper, we present a novel pool-based active learning method that can be used for the training of large unlabeled corpus with minimum annotation cost. For that, we propose to find the dominant sets of local clusters in the feature space. These sets represent maximally cohesive structures in the data. Then, the samples that do not belong to any of…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMachine Learning and Algorithms · COVID-19 diagnosis using AI · Hate Speech and Cyberbullying Detection
