Parallelization of the K-Means Algorithm with Applications to Big Data Clustering
Ashish Srivastava, Mohammed Nawfal

TL;DR
This paper explores parallelizing the K-Means clustering algorithm using OpenMP and OpenACC to improve performance on big data, comparing synchronization and GPU-based approaches.
Contribution
It introduces and compares two parallelization strategies for K-Means, highlighting their efficiency and scalability on large datasets.
Findings
OpenMP synchronous parallelization improves speed with synchronization.
GPU-based OpenACC approach offers significant time reduction.
Performance varies with data size and number of processes.
Abstract
The K-Means clustering using LLoyd's algorithm is an iterative approach to partition the given dataset into K different clusters. The algorithm assigns each point to the cluster based on the following objective function \[\ \min \Sigma_{i=1}^{n}||x_i-\mu_{x_i}||^2\] The serial algorithm involves iterative steps where we compute the distance of each datapoint from the centroids and assign the datapoint to the nearest centroid. This approach is essentially known as the expectation-maximization step. Clustering involves extensive computations to calculate distances at each iteration, which increases as the number of data points increases. This provides scope for parallelism. However, we must ensure that in a parallel process, each thread has access to the updated centroid value and no racing condition exists on any centroid values. We will compare two different approaches in this…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAdvanced Clustering Algorithms Research · Face and Expression Recognition
MethodsSPEED: Separable Pyramidal Pooling EncodEr-Decoder for Real-Time Monocular Depth Estimation on Low-Resource Settings · k-Means Clustering
