Look back, look around: a systematic analysis of effective predictors for new outlinks in focused Web crawling
Thi Kim Nhung Dang (1), Doina Bucur (1), Berk Atil (2), Guillaume, Pitel (3, 4), Frank Ruis (1), Hamidreza Kadkhodaei (1), and Nelly Litvak, (1, 5) ((1) University of Twente, The Netherlands, (2) Bogazici, University, Turkey, (3) Babbar, France, (4) Exensa, France, (5) Eindhoven

TL;DR
This paper systematically analyzes predictors for new outlinks in focused Web crawling, introducing a new 'look back, look around' model that outperforms existing methods by focusing on recent history and content-related pages.
Contribution
It unifies various feature designs into a taxonomy, introduces a new model based on recent history, and demonstrates its superior performance in predicting new outlinks.
Findings
The LBLA model outperforms other predictors.
Recent history of outlinks and content pages are most informative.
NGBoost effectively models the number of new outlinks.
Abstract
Small and medium enterprises rely on detailed Web analytics to be informed about their market and competition. Focused crawlers meet this demand by crawling and indexing specific parts of the Web. Critically, a focused crawler must quickly find new pages that have not yet been indexed. Since a new page can be discovered only by following a new outlink, predicting new outlinks is very relevant in practice. In the literature, many feature designs have been proposed for predicting changes in the Web. In this work we provide a structured analysis of this problem, using new outlinks as our running prediction target. Specifically, we unify earlier feature designs in a taxonomic arrangement of features along two dimensions: static versus dynamic features, and features of a page versus features of the network around it. Within this taxonomy, complemented by our new (mainly, dynamic network)…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsWeb Data Mining and Analysis
