Efficient Big Text Data Clustering Algorithms using Hadoop and Spark

Sergios Gerakidis; Sofia Megarchioti; Basilis Mamalis

arXiv:2112.00200·cs.DC·December 2, 2021

Efficient Big Text Data Clustering Algorithms using Hadoop and Spark

Sergios Gerakidis, Sofia Megarchioti, Basilis Mamalis

PDF

Open Access

TL;DR

This paper introduces scalable document clustering algorithms optimized for big data using Hadoop and Spark, combining variations of K-Means and hierarchical methods to improve efficiency and maintain quality.

Contribution

It presents two novel scalable clustering approaches tailored for large-scale text data, integrating hierarchical sampling with K-Means for faster processing.

Findings

01

Both algorithms achieve significant time reductions compared to standard K-Means.

02

The methods maintain acceptable clustering quality on large datasets.

03

Experimental results validate the effectiveness of Hadoop and Spark implementations.

Abstract

Document clustering is a traditional, efficient and yet quite effective, text mining technique when we need to get a better insight of the documents of a collection that could be grouped together. The K-Means algorithm and the Hierarchical Agglomerative Clustering (HAC) algorithm are two of the most known and commonly used clustering algorithms; the former due to its low time cost and the latter due to its accuracy. However, even the use of K-Means in text clustering over large-scale collections can lead to unacceptable time costs. In this paper we first address some of the most valuable approaches for document clustering over such 'big data' (large-scale) collections. We then present two very promising alternatives: (a) a variation of an existing K-Means-based fast clustering technique (known as BigKClustering - BKC) so that it can be applied in document clustering, and (b) a hybrid…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsAdvanced Clustering Algorithms Research · Data Management and Algorithms · Data Mining Algorithms and Applications