# Subspace Clustering of Very Sparse High-Dimensional Data

**Authors:** Hankui Peng, Nicos Pavlidis, Idris Eckley, and Ioannis Tsalamanis

arXiv: 1901.09108 · 2019-01-29

## TL;DR

This paper introduces a simple linear algebra-based subspace clustering algorithm tailored for very sparse, high-dimensional short texts, demonstrating competitive performance in product categorization tasks.

## Contribution

The paper presents a novel subspace clustering method specifically designed for sparse, high-dimensional short texts, addressing a key challenge in text clustering.

## Key findings

- Algorithm performs competitively against state-of-the-art methods
- Effective in clustering product names from Amazon
- Handles high-dimensional, sparse data efficiently

## Abstract

In this paper we consider the problem of clustering collections of very short texts using subspace clustering. This problem arises in many applications such as product categorisation, fraud detection, and sentiment analysis. The main challenge lies in the fact that the vectorial representation of short texts is both high-dimensional, due to the large number of unique terms in the corpus, and extremely sparse, as each text contains a very small number of words with no repetition. We propose a new, simple subspace clustering algorithm that relies on linear algebra to cluster such datasets. Experimental results on identifying product categories from product names obtained from the US Amazon website indicate that the algorithm can be competitive against state-of-the-art clustering algorithms.

## Full text

_Full body text omitted from this summary view._ Fetch the complete paper as Markdown: https://tomesphere.com/paper/1901.09108/full.md

## Figures

1 figure with captions in the complete paper: https://tomesphere.com/paper/1901.09108/full.md

## References

19 references — full list in the complete paper: https://tomesphere.com/paper/1901.09108/full.md

---
Source: https://tomesphere.com/paper/1901.09108