Efficient Crawling for Scalable Web Data Acquisition (Extended Version)
Antoine Gauquier, Ioana Manolescu, Pierre Senellart

TL;DR
This paper introduces a reinforcement learning-based focused web crawling algorithm called SB-CLASSIFIER, designed to efficiently retrieve high-quality statistical datasets from large websites by selectively crawling relevant links, significantly reducing resource use.
Contribution
The paper presents a novel RL-based approach using sleeping bandits for focused web crawling, improving efficiency in accessing statistical data from large websites.
Findings
High target retrieval rates with minimal crawling
Efficient learning of relevant hyperlinks
Scalable performance on large websites
Abstract
Journalistic fact-checking, as well as social or economic research, require analyzing high-quality statistics datasets (SDs, in short). However, retrieving SD corpora at scale may be hard, inefficient, or impossible, depending on how they are published online. To improve open statistics data accessibility, we present a focused Web crawling algorithm that retrieves as many targets, i.e., resources of certain types, as possible, from a given website, in an efficient and scalable way, by crawling (much) less than the full website. We show that optimally solving this problem is intractable, and propose an approach based on reinforcement learning, namely using sleeping bandits. We propose SB-CLASSIFIER, a crawler that efficiently learns which hyperlinks lead to pages that link to many targets, based on the paths leading to the links in their enclosing webpages. Our experiments on websites…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsWeb Data Mining and Analysis · Web visibility and informetrics · Complex Network Analysis Techniques
