An Improved k-Nearest Neighbor Algorithm for Text Categorization
Baoli Li, Shiwen Yu, and Qin Lu

TL;DR
This paper introduces an improved k-Nearest Neighbor algorithm for text categorization that adapts the number of neighbors per category, reducing bias and improving classification of smaller classes.
Contribution
The proposed method uses variable neighbor counts per category, addressing class imbalance and parameter sensitivity issues in traditional kNN for text classification.
Findings
Less sensitive to the parameter k than traditional kNN
Better classification of small classes with large k
Effective in Chinese text categorization
Abstract
k is the most important parameter in a text categorization system based on k-Nearest Neighbor algorithm (kNN).In the classification process, k nearest documents to the test one in the training set are determined firstly. Then, the predication can be made according to the category distribution among these k nearest neighbors. Generally speaking, the class distribution in the training set is uneven. Some classes may have more samples than others. Therefore, the system performance is very sensitive to the choice of the parameter k. And it is very likely that a fixed k value will result in a bias on large categories. To deal with these problems, we propose an improved kNN algorithm, which uses different numbers of nearest neighbors for different categories, rather than a fixed number across all categories. More samples (nearest neighbors) will be used for deciding whether a test document…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsText and Document Classification Technologies · Spam and Phishing Detection · Web Data Mining and Analysis
