Web Document Clustering and Ranking using Tf-Idf based Apriori Approach

R.K. Roul; O. R. Devanand; S.K. Sahay

arXiv:1406.5617·cs.IR·June 24, 2014·20 cites

Web Document Clustering and Ranking using Tf-Idf based Apriori Approach

R.K. Roul, O. R. Devanand, S.K. Sahay

PDF

Open Access

TL;DR

This paper introduces a novel Tf-Idf based Apriori approach for clustering and ranking web documents, improving relevance and retrieval efficiency for large unstructured datasets.

Contribution

The paper proposes a new clustering and ranking method combining Tf-Idf with Apriori, tailored for web documents, and demonstrates its effectiveness on large datasets.

Findings

01

Better clustering results at higher minimum support

02

Achieved a F-measure of 78% in ranking accuracy

03

Outperforms traditional Apriori algorithm

Abstract

The dynamic web has increased exponentially over the past few years with more than thousands of documents related to a subject available to the user now. Most of the web documents are unstructured and not in an organized manner and hence user facing more difficult to find relevant documents. A more useful and efficient mechanism is combining clustering with ranking, where clustering can group the similar documents in one place and ranking can be applied to each cluster for viewing the top documents at the beginning.. Besides the particular clustering algorithm, the different term weighting functions applied to the selected features to represent web document is a main aspect in clustering task. Keeping this approach in mind, here we proposed a new mechanism called Tf-Idf based Apriori for clustering the web documents. We then rank the documents in each cluster using Tf-Idf and similarity…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsText and Document Classification Technologies · Web Data Mining and Analysis · Advanced Clustering Algorithms Research