Structured Inverted-File k-Means Clustering for High-Dimensional Sparse Data
Kazuo Aoyama, Kazumi Saito

TL;DR
This paper introduces SIVF, an architecture-efficient k-means clustering algorithm optimized for high-dimensional sparse data, reducing similarity calculations and improving speed and memory usage on real large-scale datasets.
Contribution
SIVF employs an invariant centroid-pair filter and an inverted-file structure to enhance clustering efficiency on high-dimensional sparse data, considering computer architecture.
Findings
SIVF outperforms existing algorithms in speed and memory efficiency.
SIVF reduces cache misses and branch mispredictions, improving performance.
Experimental results on large-scale document datasets validate SIVF's effectiveness.
Abstract
This paper presents an architecture-friendly k-means clustering algorithm called SIVF for a large-scale and high-dimensional sparse data set. Algorithm efficiency on time is often measured by the number of costly operations such as similarity calculations. In practice, however, it depends greatly on how the algorithm adapts to an architecture of the computer system which it is executed on. Our proposed SIVF employs invariant centroid-pair based filter (ICP) to decrease the number of similarity calculations between a data object and centroids of all the clusters. To maximize the ICP performance, SIVF exploits for a centroid set an inverted-file that is structured so as to reduce pipeline hazards. We demonstrate in our experiments on real large-scale document data sets that SIVF operates at higher speed and with lower memory consumption than existing algorithms. Our performance analysis…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAdvanced Clustering Algorithms Research · Caching and Content Delivery · Network Security and Intrusion Detection
Methodsk-Means Clustering
