An effective web document clustering for information retrieval

R.K. Roul; S.K. Sahay

arXiv:1211.1107·cs.IR·November 7, 2012·2 cites

An effective web document clustering for information retrieval

R.K. Roul, S.K. Sahay

PDF

Open Access

TL;DR

This paper presents a combined web document clustering method using Frequent Pattern growth and Fuzzy C-Means to improve clustering accuracy and efficiency for information retrieval from large web datasets.

Contribution

It introduces a novel hybrid approach that enhances traditional clustering by integrating frequent pattern mining with fuzzy clustering, addressing initial centroid sensitivity.

Findings

01

Outperforms traditional clustering methods in efficiency

02

Handles initial centroid sensitivity better

03

More effective for large web datasets

Abstract

The size of web has increased exponentially over the past few years with thousands of documents related to a subject available to the user. With this much amount of information available, it is not possible to take the full advantage of the World Wide Web without having a proper framework to search through the available data. This requisite organization can be done in many ways. In this paper we introduce a combine approach to cluster the web pages which first finds the frequent sets and then clusters the documents. These frequent sets are generated by using Frequent Pattern growth technique. Then by applying Fuzzy C- Means algorithm on it, we found clusters having documents which are highly related and have similar features. We used Gensim package to implement our approach because of its simplicity and robust nature. We have compared our results with the combine approach of (Frequent…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsWeb Data Mining and Analysis · Advanced Clustering Algorithms Research · Text and Document Classification Technologies