The Application of Active Query K-Means in Text Classification
Yukun Jiang

TL;DR
This paper introduces an active query K-Means algorithm for text classification that improves accuracy and reduces labeling costs by combining semi-supervised clustering with active learning techniques.
Contribution
It extends traditional K-Means into a semi-supervised and active learning framework using Penalized Min-Max-selection for more efficient text classification.
Findings
Increased classification accuracy on Chinese news dataset
Reduced labeling costs in active learning process
Stable initial centroids through penalized selection
Abstract
Active learning is a state-of-art machine learning approach to deal with an abundance of unlabeled data. In the field of Natural Language Processing, typically it is costly and time-consuming to have all the data annotated. This inefficiency inspires out our application of active learning in text classification. Traditional unsupervised k-means clustering is first modified into a semi-supervised version in this research. Then, a novel attempt is applied to further extend the algorithm into active learning scenario with Penalized Min-Max-selection, so as to make limited queries that yield more stable initial centroids. This method utilizes both the interactive query results from users and the underlying distance representation. After tested on a Chinese news dataset, it shows a consistent increase in accuracy while lowering the cost in training.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMachine Learning and Algorithms · Text and Document Classification Technologies · Algorithms and Data Compression
Methodsk-Means Clustering
