An Improved k-Nearest Neighbor Algorithm for Text Categorization

Baoli Li; Shiwen Yu; and Qin Lu

arXiv:cs/0306099·cs.CL·May 23, 2007·86 cites

An Improved k-Nearest Neighbor Algorithm for Text Categorization

Baoli Li, Shiwen Yu, and Qin Lu

PDF

Open Access

TL;DR

This paper introduces an improved k-Nearest Neighbor algorithm for text categorization that adapts the number of neighbors per category, reducing bias and improving classification of smaller classes.

Contribution

The proposed method uses variable neighbor counts per category, addressing class imbalance and parameter sensitivity issues in traditional kNN for text classification.

Findings

01

Less sensitive to the parameter k than traditional kNN

02

Better classification of small classes with large k

03

Effective in Chinese text categorization

Abstract

k is the most important parameter in a text categorization system based on k-Nearest Neighbor algorithm (kNN).In the classification process, k nearest documents to the test one in the training set are determined firstly. Then, the predication can be made according to the category distribution among these k nearest neighbors. Generally speaking, the class distribution in the training set is uneven. Some classes may have more samples than others. Therefore, the system performance is very sensitive to the choice of the parameter k. And it is very likely that a fixed k value will result in a bias on large categories. To deal with these problems, we propose an improved kNN algorithm, which uses different numbers of nearest neighbors for different categories, rather than a fixed number across all categories. More samples (nearest neighbors) will be used for deciding whether a test document…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsText and Document Classification Technologies · Spam and Phishing Detection · Web Data Mining and Analysis