Analysis and Evaluation of the Link and Content Based Focused   Treasure-Crawler

Ali Seyfi

arXiv:1306.0054·cs.IR·October 2, 2015

Analysis and Evaluation of the Link and Content Based Focused Treasure-Crawler

Ali Seyfi

PDF

TL;DR

This paper presents the design, implementation, and evaluation of the Treasure-Crawler, a focused web crawler that uses HTML analysis and a hierarchical T-Graph to prioritize links for efficient topic-specific web indexing.

Contribution

It introduces a novel approach combining HTML element analysis and a hierarchical T-Graph for improved focused crawling accuracy and efficiency.

Findings

01

Recall and precision close to 0.5 demonstrate balanced retrieval performance.

02

Hierarchical T-Graph effectively guides link prioritization.

03

Proposed method enhances focused crawling in large-scale web indexing.

Abstract

Indexing the Web is becoming a laborious task for search engines as the Web exponentially grows in size and distribution. Presently, the most effective known approach to overcome this problem is the use of focused crawlers. A focused crawler applies a proper algorithm in order to detect the pages on the Web that relate to its topic of interest. For this purpose we proposed a custom method that uses specific HTML elements of a page to predict the topical focus of all the pages that have an unvisited link within the current page. These recognized on-topic pages have to be sorted later based on their relevance to the main topic of the crawler for further actual downloads. In the Treasure-Crawler, we use a hierarchical structure called the T-Graph which is an exemplary guide to assign appropriate priority score to each unvisited link. These URLs will later be downloaded based on this…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.