Efficient Crawling for Scalable Web Data Acquisition (Extended Version)

Antoine Gauquier; Ioana Manolescu; Pierre Senellart

arXiv:2602.11874·cs.IR·February 13, 2026

Efficient Crawling for Scalable Web Data Acquisition (Extended Version)

Antoine Gauquier, Ioana Manolescu, Pierre Senellart

PDF

Open Access

TL;DR

This paper introduces a reinforcement learning-based focused web crawling algorithm called SB-CLASSIFIER, designed to efficiently retrieve high-quality statistical datasets from large websites by selectively crawling relevant links, significantly reducing resource use.

Contribution

The paper presents a novel RL-based approach using sleeping bandits for focused web crawling, improving efficiency in accessing statistical data from large websites.

Findings

01

High target retrieval rates with minimal crawling

02

Efficient learning of relevant hyperlinks

03

Scalable performance on large websites

Abstract

Journalistic fact-checking, as well as social or economic research, require analyzing high-quality statistics datasets (SDs, in short). However, retrieving SD corpora at scale may be hard, inefficient, or impossible, depending on how they are published online. To improve open statistics data accessibility, we present a focused Web crawling algorithm that retrieves as many targets, i.e., resources of certain types, as possible, from a given website, in an efficient and scalable way, by crawling (much) less than the full website. We show that optimally solving this problem is intractable, and propose an approach based on reinforcement learning, namely using sleeping bandits. We propose SB-CLASSIFIER, a crawler that efficiently learns which hyperlinks lead to pages that link to many targets, based on the paths leading to the links in their enclosing webpages. Our experiments on websites…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsWeb Data Mining and Analysis · Web visibility and informetrics · Complex Network Analysis Techniques