Efficient Big Text Data Clustering Algorithms using Hadoop and Spark
Sergios Gerakidis, Sofia Megarchioti, Basilis Mamalis

TL;DR
This paper introduces scalable document clustering algorithms optimized for big data using Hadoop and Spark, combining variations of K-Means and hierarchical methods to improve efficiency and maintain quality.
Contribution
It presents two novel scalable clustering approaches tailored for large-scale text data, integrating hierarchical sampling with K-Means for faster processing.
Findings
Both algorithms achieve significant time reductions compared to standard K-Means.
The methods maintain acceptable clustering quality on large datasets.
Experimental results validate the effectiveness of Hadoop and Spark implementations.
Abstract
Document clustering is a traditional, efficient and yet quite effective, text mining technique when we need to get a better insight of the documents of a collection that could be grouped together. The K-Means algorithm and the Hierarchical Agglomerative Clustering (HAC) algorithm are two of the most known and commonly used clustering algorithms; the former due to its low time cost and the latter due to its accuracy. However, even the use of K-Means in text clustering over large-scale collections can lead to unacceptable time costs. In this paper we first address some of the most valuable approaches for document clustering over such 'big data' (large-scale) collections. We then present two very promising alternatives: (a) a variation of an existing K-Means-based fast clustering technique (known as BigKClustering - BKC) so that it can be applied in document clustering, and (b) a hybrid…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAdvanced Clustering Algorithms Research · Data Management and Algorithms · Data Mining Algorithms and Applications
