A Focused Crawler Combinatory Link and Content Model Based on T-Graph   Principles

Ali Seyfi

arXiv:1305.7265·cs.IR·October 2, 2015

A Focused Crawler Combinatory Link and Content Model Based on T-Graph Principles

Ali Seyfi

PDF

TL;DR

This paper introduces Treasure-Crawler, a focused Web crawler that combines link and content analysis using a T-Graph to accurately identify and prioritize topic-specific web pages for efficient crawling.

Contribution

It presents a novel combined link-content approach and a T-Graph based scoring method for improved focused crawling accuracy and prioritization.

Findings

01

High accuracy in predicting topical relevance of unvisited pages

02

Effective prioritization of URLs using T-Graph scoring

03

Successful architectural validation through test results

Abstract

The two significant tasks of a focused Web crawler are finding relevant topic-specific documents on the Web and analytically prioritizing them for later effective and reliable download. For the first task, we propose a sophisticated custom algorithm to fetch and analyze the most effective HTML structural elements of the page as well as the topical boundary and anchor text of each unvisited link, based on which the topical focus of an unvisited page can be predicted and elicited with a high accuracy. Thus, our novel method uniquely combines both link-based and content-based approaches. For the second task, we propose a scoring function of the relevant URLs through the use of T-Graph (Treasure Graph) to assist in prioritizing the unvisited links that will later be put into the fetching queue. Our Web search system is called the Treasure-Crawler. This research paper embodies the…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.