PDD Crawler: A focused web crawler using link and content analysis for   relevance prediction

Prashant Dahiwale; M M Raghuwanshi; Latesh malik

arXiv:1411.4366·cs.IR·November 18, 2014

PDD Crawler: A focused web crawler using link and content analysis for relevance prediction

Prashant Dahiwale, M M Raghuwanshi, Latesh malik

PDF

Open Access

TL;DR

This paper introduces PDD Crawler, a focused web crawler that combines link analysis and content analysis to predict page relevance more effectively, aiming to improve search efficiency.

Contribution

It presents a novel crawling strategy that integrates HTML tag content analysis with link-based methods to assess page relevance.

Findings

01

Enhanced relevance prediction accuracy

02

Effective content and link analysis integration

03

Potential for improved search engine performance

Abstract

Majority of the computer or mobile phone enthusiasts make use of the web for searching activity. Web search engines are used for the searching; The results that the search engines get are provided to it by a software module known as the Web Crawler. The size of this web is increasing round-the-clock. The principal problem is to search this huge database for specific information. To state whether a web page is relevant to a search topic is a dilemma. This paper proposes a crawler called as PDD crawler which will follow both a link based as well as a content based approach. This crawler follows a completely new crawling strategy to compute the relevance of the page. It analyses the content of the page based on the information contained in various tags within the HTML source code and then computes the total weight of the page. The page with the highest weight, thus has the maximum content…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsWeb visibility and informetrics · Web Data Mining and Analysis · Complex Network Analysis Techniques