Cluster Based Symbolic Representation for Skewed Text Categorization

Lavanya Narayana Raju; Mahamad Suhil; D S Guru; Harsha S Gowda

arXiv:1706.07912·cs.IR·June 27, 2017

Cluster Based Symbolic Representation for Skewed Text Categorization

Lavanya Narayana Raju, Mahamad Suhil, D S Guru, Harsha S Gowda

PDF

TL;DR

This paper introduces a clustering-based symbolic representation method to balance skewed text datasets, improving classification efficiency and accuracy on benchmark datasets compared to existing models.

Contribution

It presents a novel approach combining clustering and symbolic representation to address class imbalance in text categorization, reducing dimensionality and computational cost.

Findings

01

Outperforms existing models on Reuters 21578 and TDT2 datasets.

02

Reduces classification time and space requirements.

03

Effective in handling imbalanced text corpora.

Abstract

In this work, a problem associated with imbalanced text corpora is addressed. A method of converting an imbalanced text corpus into a balanced one is presented. The presented method employs a clustering algorithm for conversion. Initially to avoid curse of dimensionality, an effective representation scheme based on term class relevancy measure is adapted, which drastically reduces the dimension to the number of classes in the corpus. Subsequently, the samples of larger sized classes are grouped into a number of subclasses of smaller sizes to make the entire corpus balanced. Each subclass is then given a single symbolic vector representation by the use of interval valued features. This symbolic representation in addition to being compact helps in reducing the space requirement and also the classification time. The proposed model has been empirically demonstrated for its superiority on…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.