WebParF: A Web partitioning framework for Parallel Crawlers

Sonali Gupta; Komal kumar Bhatia; Pikakshi Manchanda

arXiv:1406.5690·cs.IR·June 24, 2014·2 cites

WebParF: A Web partitioning framework for Parallel Crawlers

Sonali Gupta, Komal kumar Bhatia, Pikakshi Manchanda

PDF

Open Access

TL;DR

WebParF is a framework designed to improve web crawling efficiency by partitioning the URL frontier for parallel crawlers, addressing URL distribution and ordering challenges in multi-threaded environments.

Contribution

The paper introduces WebParF, a novel framework that effectively partitions URL frontiers for parallel crawlers, enhancing crawling performance and scalability.

Findings

01

WebParF improves crawling efficiency through effective URL partitioning.

02

The framework addresses URL distribution and ordering challenges.

03

Experimental results show increased crawling throughput.

Abstract

With the ever proliferating size and scale of the WWW [1] efficient ways of exploring content are of increasing importance. How can we efficiently retrieve information from it through crawling? And in this era of tera and multi-core processors, we ought to think of multi-threaded processes as a serving solution. So, even better how can we improve the crawling performance by using parallel crawlers that work independently? The paper devotes to the fundamental development in the field of parallel crawlers [4] highlighting the advantages and challenges arising from its design. The paper also focuses on the aspect of URL distribution among the various parallel crawling processes or threads and ordering the URLs within each distributed set of URLs. How to distribute URLs from the URL frontier to the various concurrently executing crawling process threads is an orthogonal problem. The paper…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsWeb Data Mining and Analysis · Caching and Content Delivery · Algorithms and Data Compression