A Scalable Crawling Algorithm Utilizing Noisy Change-Indicating Signals
R\'obert Busa-Fekete, Julian Zimmert, Andr\'as Gy\"orgy, Linhai Qiu,, Tzu-Wei Sung, Hao Shen, Hyomin Choi, Sharmila Subramaniam, Li Xiao

TL;DR
This paper introduces a scalable web crawling algorithm that effectively utilizes noisy change signals like sitemaps and CDN pings, ensuring fresh content with minimal bandwidth and adaptive to varying network conditions.
Contribution
It presents a novel, scalable crawling method that incorporates noisy side information, operates without heavy central computation, and adapts to bandwidth changes in real-time.
Findings
Effective use of noisy signals improves crawling freshness.
Algorithm maintains constant bandwidth usage without spikes.
Demonstrated versatility through experiments.
Abstract
Web refresh crawling is the problem of keeping a cache of web pages fresh, that is, having the most recent copy available when a page is requested, given a limited bandwidth available to the crawler. Under the assumption that the change and request events, resp., to each web page follow independent Poisson processes, the optimal scheduling policy was derived by Azar et al. 2018. In this paper, we study an extension of this problem where side information indicating content changes, such as various types of web pings, for example, signals from sitemaps, content delivery networks, etc., is available. Incorporating such side information into the crawling policy is challenging, because (i) the signals can be noisy with false positive events and with missing change events; and (ii) the crawler should achieve a fair performance over web pages regardless of the quality of the side information,…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsEvolutionary Algorithms and Applications · Embedded Systems Design Techniques
