Taming Wild High Dimensional Text Data with a Fuzzy Lash

Amir Karami

arXiv:1712.05997·stat.ML·December 19, 2017

Taming Wild High Dimensional Text Data with a Fuzzy Lash

Amir Karami

PDF

TL;DR

This paper introduces a fuzzy clustering-based dimension reduction method for high-dimensional text data, outperforming traditional techniques like PCA and SVD in representing documents more effectively.

Contribution

It is the first to apply fuzzy clustering as a dimension reduction technique within the Unsupervised Feature Transformation framework for text data.

Findings

01

Fuzzy clustering outperforms PCA and SVD in document representation.

02

The method effectively reduces dimensionality while preserving important information.

03

Experimental results demonstrate superior performance of the proposed approach.

Abstract

The bag of words (BOW) represents a corpus in a matrix whose elements are the frequency of words. However, each row in the matrix is a very high-dimensional sparse vector. Dimension reduction (DR) is a popular method to address sparsity and high-dimensionality issues. Among different strategies to develop DR method, Unsupervised Feature Transformation (UFT) is a popular strategy to map all words on a new basis to represent BOW. The recent increase of text data and its challenges imply that DR area still needs new perspectives. Although a wide range of methods based on the UFT strategy has been developed, the fuzzy approach has not been considered for DR based on this strategy. This research investigates the application of fuzzy clustering as a DR method based on the UFT strategy to collapse BOW matrix to provide a lower-dimensional representation of documents instead of the words in a…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.