Text Categorization via Similarity Search: An Efficient and Effective Novel Algorithm
Hubert Haoyang Duan, Vladimir Pestov, and Varun Singla

TL;DR
This paper introduces a novel supervised text categorization algorithm based on similarity search in measure distribution space, achieving high performance in competitions and on the Reuters dataset.
Contribution
It proposes a new approach using measure distribution similarity for text classification, differing from traditional centroid-based methods.
Findings
Achieved 2nd place in CDMC'2012 text categorization division.
Performed effectively on the Reuters 21578 dataset.
Efficient in both training and classification stages.
Abstract
We present a supervised learning algorithm for text categorization which has brought the team of authors the 2nd place in the text categorization division of the 2012 Cybersecurity Data Mining Competition (CDMC'2012) and a 3rd prize overall. The algorithm is quite different from existing approaches in that it is based on similarity search in the metric space of measure distributions on the dictionary. At the preprocessing stage, given a labeled learning sample of texts, we associate to every class label (document category) a point in the space of question. Unlike it is usual in clustering, this point is not a centroid of the category but rather an outlier, a uniform measure distribution on a selection of domain-specific words. At the execution stage, an unlabeled text is assigned a text category as defined by the closest labeled neighbour to the point representing the frequency…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsText and Document Classification Technologies · Spam and Phishing Detection · Algorithms and Data Compression
