Analysis and Evaluation of the Link and Content Based Focused Treasure-Crawler
Ali Seyfi

TL;DR
This paper presents the design, implementation, and evaluation of the Treasure-Crawler, a focused web crawler that uses HTML analysis and a hierarchical T-Graph to prioritize links for efficient topic-specific web indexing.
Contribution
It introduces a novel approach combining HTML element analysis and a hierarchical T-Graph for improved focused crawling accuracy and efficiency.
Findings
Recall and precision close to 0.5 demonstrate balanced retrieval performance.
Hierarchical T-Graph effectively guides link prioritization.
Proposed method enhances focused crawling in large-scale web indexing.
Abstract
Indexing the Web is becoming a laborious task for search engines as the Web exponentially grows in size and distribution. Presently, the most effective known approach to overcome this problem is the use of focused crawlers. A focused crawler applies a proper algorithm in order to detect the pages on the Web that relate to its topic of interest. For this purpose we proposed a custom method that uses specific HTML elements of a page to predict the topical focus of all the pages that have an unvisited link within the current page. These recognized on-topic pages have to be sorted later based on their relevance to the main topic of the crawler for further actual downloads. In the Treasure-Crawler, we use a hierarchical structure called the T-Graph which is an exemplary guide to assign appropriate priority score to each unvisited link. These URLs will later be downloaded based on this…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
