Ontology Based Document Clustering Using MapReduce
Abdelrahman Elsayed, Hoda M. O. Mokhtar, Osama Ismail

TL;DR
This paper presents a distributed MapReduce implementation of bisecting k-means for large-scale document clustering, integrating WordNet ontology to leverage semantic relations and improve clustering quality.
Contribution
It introduces a novel distributed clustering method combining MapReduce with ontology-based semantic enhancement, reducing feature dimensionality and improving clustering results.
Findings
Semantic integration improves clustering quality.
Feature dimensionality is significantly reduced.
Distributed implementation scales to large datasets.
Abstract
Nowadays, document clustering is considered as a data intensive task due to the dramatic, fast increase in the number of available documents. Nevertheless, the features that represent those documents are also too large. The most common method for representing documents is the vector space model, which represents document features as a bag of words and does not represent semantic relations between words. In this paper we introduce a distributed implementation for the bisecting k-means using MapReduce programming model. The aim behind our proposed implementation is to solve the problem of clustering intensive data documents. In addition, we propose integrating the WordNet ontology with bisecting k-means in order to utilize the semantic relations between words to enhance document clustering results. Our presented experimental results show that using lexical categories for nouns only…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAdvanced Clustering Algorithms Research · Data Mining Algorithms and Applications · Text and Document Classification Technologies
