Text Mining using Nonnegative Matrix Factorization and Latent Semantic Analysis
Ali Hassani, Amir Iranmanesh, Najme Mansouri

TL;DR
This paper introduces a novel feature agglomeration method using Nonnegative Matrix Factorization for text clustering, enhancing stability and performance over existing techniques like Latent Semantic Analysis.
Contribution
It proposes a new feature agglomeration approach with NMF and a deterministic initialization for spherical K-Means, improving clustering stability and effectiveness.
Findings
Significant improvement in clustering performance
Enhanced stability of results
Comparable or better than existing methods
Abstract
Text clustering is arguably one of the most important topics in modern data mining. Nevertheless, text data require tokenization which usually yields a very large and highly sparse term-document matrix, which is usually difficult to process using conventional machine learning algorithms. Methods such as Latent Semantic Analysis have helped mitigate this issue, but are nevertheless not completely stable in practice. As a result, we propose a new feature agglomeration method based on Nonnegative Matrix Factorization, which is employed to separate the terms into groups, and then each group's term vectors are agglomerated into a new feature vector. Together, these feature vectors create a new feature space much more suitable for clustering. In addition, we propose a new deterministic initialization for spherical K-Means, which proves very useful for this specific type of data. In order to…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
