A parallel sampling based clustering
Aditya AV Sastry, Kalyan Netti

TL;DR
This paper proposes parallel and serial subclustering schemes to improve the efficiency of data clustering by dividing datasets into smaller parts, enabling faster processing with minimal accuracy loss.
Contribution
Introduces two novel subclustering schemes that enable parallel processing, significantly reducing clustering time while maintaining accuracy.
Findings
Parallel execution is much faster than traditional methods.
Error introduced by subclustering is very low.
Both methods outperform original algorithms in speed.
Abstract
The problem of automatically clustering data is an age old problem. People have created numerous algorithms to tackle this problem. The execution time of any of this algorithm grows with the number of input points and the number of cluster centers required. To reduce the number of input points we could average the points locally and use the means or the local centers as the input for clustering. However since the required number of local centers is very high, running the clustering algorithm on the entire dataset to obtain these representational points is very time consuming. To remedy this problem, in this paper we are proposing two subclustering schemes where by we subdivide the dataset into smaller sets and run the clustering algorithm on the smaller datasets to obtain the required number of datapoints to run our clustering algorithm with. As we are subdividing the given dataset, we…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAdvanced Clustering Algorithms Research · Data Management and Algorithms · Text and Document Classification Technologies
