Accelerating spherical K-means clustering for large-scale sparse document data
Kazuo Aoyama, Kazumi Saito

TL;DR
This paper introduces an accelerated spherical K-means clustering algorithm optimized for large-scale, high-dimensional sparse document data, significantly improving speed by architecture-aware design and data structure innovations.
Contribution
The paper proposes a novel architecture-friendly algorithm that leverages data-object and cluster characteristics to reduce computational costs and improve clustering speed on large sparse datasets.
Findings
Achieves superior speed performance compared to state-of-the-art methods.
Effectively reduces instruction count, branch mispredictions, and cache misses.
Demonstrates efficiency on large-scale document clustering tasks.
Abstract
This paper presents an accelerated spherical K-means clustering algorithm for large-scale and high-dimensional sparse document data sets. We design an algorithm working in an architecture-friendly manner (AFM), which is a procedure of suppressing performance-degradation factors such as the numbers of instructions, branch mispredictions, and cache misses in CPUs of a modern computer system. For the AFM operation, we leverage unique universal characteristics (UCs) of a data-object and a cluster's mean set, which are skewed distributions on data relationships such as Zipf's law and a feature-value concentration phenomenon. The UCs indicate that the most part of the number of multiplications for similarity calculations is executed regarding terms with high document frequencies (df) and the most part of a similarity between an object- and a mean-feature vector is obtained by the…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAdvanced Clustering Algorithms Research · Data Management and Algorithms · Bayesian Methods and Mixture Models
Methodsk-Means Clustering · SPEED: Separable Pyramidal Pooling EncodEr-Decoder for Real-Time Monocular Depth Estimation on Low-Resource Settings
