Text Mining using Nonnegative Matrix Factorization and Latent Semantic   Analysis

Ali Hassani; Amir Iranmanesh; Najme Mansouri

arXiv:1911.04705·cs.LG·February 25, 2020

Text Mining using Nonnegative Matrix Factorization and Latent Semantic Analysis

Ali Hassani, Amir Iranmanesh, Najme Mansouri

PDF

TL;DR

This paper introduces a novel feature agglomeration method using Nonnegative Matrix Factorization for text clustering, enhancing stability and performance over existing techniques like Latent Semantic Analysis.

Contribution

It proposes a new feature agglomeration approach with NMF and a deterministic initialization for spherical K-Means, improving clustering stability and effectiveness.

Findings

01

Significant improvement in clustering performance

02

Enhanced stability of results

03

Comparable or better than existing methods

Abstract

Text clustering is arguably one of the most important topics in modern data mining. Nevertheless, text data require tokenization which usually yields a very large and highly sparse term-document matrix, which is usually difficult to process using conventional machine learning algorithms. Methods such as Latent Semantic Analysis have helped mitigate this issue, but are nevertheless not completely stable in practice. As a result, we propose a new feature agglomeration method based on Nonnegative Matrix Factorization, which is employed to separate the terms into groups, and then each group's term vectors are agglomerated into a new feature vector. Together, these feature vectors create a new feature space much more suitable for clustering. In addition, we propose a new deterministic initialization for spherical K-Means, which proves very useful for this specific type of data. In order to…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.