Effective performance of information retrieval on web by using web   crawling

Sk. AbdulNabi; P. Premchand

arXiv:1205.2891·cs.IR·May 15, 2012·1 cites

Effective performance of information retrieval on web by using web crawling

Sk. AbdulNabi, P. Premchand

PDF

Open Access

TL;DR

This paper introduces the EPOW web crawler architecture designed to efficiently retrieve information from the rapidly expanding web by optimizing download speed and robustness through parallelization and data structure enhancements.

Contribution

The paper presents a novel web crawler architecture, EPOW, with optimized parallelization and data structures to improve web information retrieval performance.

Findings

01

High download rate of pages per second

02

Robustness against system crashes

03

Improved crawler performance through data structures

Abstract

World Wide Web consists of more than 50 billion pages online. It is highly dynamic i.e. the web continuously introduces new capabilities and attracts many people. Due to this explosion in size, the effective information retrieval system or search engine can be used to access the information. In this paper we have proposed the EPOW (Effective Performance of WebCrawler) architecture. It is a software agent whose main objective is to minimize the overload of a user locating needed information. We have designed the web crawler by considering the parallelization policy. Since our EPOW crawler has a highly optimized system it can download a large number of pages per second while being robust against crashes. We have also proposed to use the data structure concepts for implementation of scheduler & circular Queue to improve the performance of our web crawler.

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsWeb Data Mining and Analysis · Caching and Content Delivery · Peer-to-Peer Network Technologies