WebParF: A Web partitioning framework for Parallel Crawlers
Sonali Gupta, Komal kumar Bhatia, Pikakshi Manchanda

TL;DR
WebParF is a framework designed to improve web crawling efficiency by partitioning the URL frontier for parallel crawlers, addressing URL distribution and ordering challenges in multi-threaded environments.
Contribution
The paper introduces WebParF, a novel framework that effectively partitions URL frontiers for parallel crawlers, enhancing crawling performance and scalability.
Findings
WebParF improves crawling efficiency through effective URL partitioning.
The framework addresses URL distribution and ordering challenges.
Experimental results show increased crawling throughput.
Abstract
With the ever proliferating size and scale of the WWW [1] efficient ways of exploring content are of increasing importance. How can we efficiently retrieve information from it through crawling? And in this era of tera and multi-core processors, we ought to think of multi-threaded processes as a serving solution. So, even better how can we improve the crawling performance by using parallel crawlers that work independently? The paper devotes to the fundamental development in the field of parallel crawlers [4] highlighting the advantages and challenges arising from its design. The paper also focuses on the aspect of URL distribution among the various parallel crawling processes or threads and ordering the URLs within each distributed set of URLs. How to distribute URLs from the URL frontier to the various concurrently executing crawling process threads is an orthogonal problem. The paper…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsWeb Data Mining and Analysis · Caching and Content Delivery · Algorithms and Data Compression
