Efficient Sparse Spherical k-Means for Document Clustering

Johannes Knittel; Steffen Koch; Thomas Ertl

arXiv:2108.00895·cs.LG·August 3, 2021

Efficient Sparse Spherical k-Means for Document Clustering

Johannes Knittel, Steffen Koch, Thomas Ertl

PDF

1 Repo

TL;DR

This paper introduces an optimized indexing method for spherical k-Means clustering that leverages data sparsity and convergence properties to enhance scalability for large cluster counts in document collections.

Contribution

It presents a novel indexing structure that significantly reduces comparison operations in spherical k-Means, improving its efficiency for large-scale document clustering.

Findings

01

Reduces the number of comparisons per iteration

02

Improves scalability with respect to the number of clusters

03

Maintains clustering quality while increasing efficiency

Abstract

Spherical k-Means is frequently used to cluster document collections because it performs reasonably well in many settings and is computationally efficient. However, the time complexity increases linearly with the number of clusters k, which limits the suitability of the algorithm for larger values of k depending on the size of the collection. Optimizations targeted at the Euclidean k-Means algorithm largely do not apply because the cosine distance is not a metric. We therefore propose an efficient indexing structure to improve the scalability of Spherical k-Means with respect to k. Our approach exploits the sparsity of the input vectors and the convergence behavior of k-Means to reduce the number of comparisons on each iteration significantly.

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

johpro/esp-kmeans
noneOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.