Accelerating spherical K-means clustering for large-scale sparse   document data

Kazuo Aoyama; Kazumi Saito

arXiv:2411.11300·stat.ML·November 19, 2024

Accelerating spherical K-means clustering for large-scale sparse document data

Kazuo Aoyama, Kazumi Saito

PDF

Open Access

TL;DR

This paper introduces an accelerated spherical K-means clustering algorithm optimized for large-scale, high-dimensional sparse document data, significantly improving speed by architecture-aware design and data structure innovations.

Contribution

The paper proposes a novel architecture-friendly algorithm that leverages data-object and cluster characteristics to reduce computational costs and improve clustering speed on large sparse datasets.

Findings

01

Achieves superior speed performance compared to state-of-the-art methods.

02

Effectively reduces instruction count, branch mispredictions, and cache misses.

03

Demonstrates efficiency on large-scale document clustering tasks.

Abstract

This paper presents an accelerated spherical K-means clustering algorithm for large-scale and high-dimensional sparse document data sets. We design an algorithm working in an architecture-friendly manner (AFM), which is a procedure of suppressing performance-degradation factors such as the numbers of instructions, branch mispredictions, and cache misses in CPUs of a modern computer system. For the AFM operation, we leverage unique universal characteristics (UCs) of a data-object and a cluster's mean set, which are skewed distributions on data relationships such as Zipf's law and a feature-value concentration phenomenon. The UCs indicate that the most part of the number of multiplications for similarity calculations is executed regarding terms with high document frequencies (df) and the most part of a similarity between an object- and a mean-feature vector is obtained by the…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsAdvanced Clustering Algorithms Research · Data Management and Algorithms · Bayesian Methods and Mixture Models

Methodsk-Means Clustering · SPEED: Separable Pyramidal Pooling EncodEr-Decoder for Real-Time Monocular Depth Estimation on Low-Resource Settings