Parallelization of the K-Means Algorithm with Applications to Big Data   Clustering

Ashish Srivastava; Mohammed Nawfal

arXiv:2405.12052·cs.DC·May 21, 2024

Parallelization of the K-Means Algorithm with Applications to Big Data Clustering

Ashish Srivastava, Mohammed Nawfal

PDF

Open Access

TL;DR

This paper explores parallelizing the K-Means clustering algorithm using OpenMP and OpenACC to improve performance on big data, comparing synchronization and GPU-based approaches.

Contribution

It introduces and compares two parallelization strategies for K-Means, highlighting their efficiency and scalability on large datasets.

Findings

01

OpenMP synchronous parallelization improves speed with synchronization.

02

GPU-based OpenACC approach offers significant time reduction.

03

Performance varies with data size and number of processes.

Abstract

The K-Means clustering using LLoyd's algorithm is an iterative approach to partition the given dataset into K different clusters. The algorithm assigns each point to the cluster based on the following objective function \[\ \min \Sigma_{i=1}^{n}||x_i-\mu_{x_i}||^2\] The serial algorithm involves iterative steps where we compute the distance of each datapoint from the centroids and assign the datapoint to the nearest centroid. This approach is essentially known as the expectation-maximization step. Clustering involves extensive computations to calculate distances at each iteration, which increases as the number of data points increases. This provides scope for parallelism. However, we must ensure that in a parallel process, each thread has access to the updated centroid value and no racing condition exists on any centroid values. We will compare two different approaches in this…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsAdvanced Clustering Algorithms Research · Face and Expression Recognition

MethodsSPEED: Separable Pyramidal Pooling EncodEr-Decoder for Real-Time Monocular Depth Estimation on Low-Resource Settings · k-Means Clustering