Neural Prioritisation for Web Crawling
Francesca Pezzuti, Sean MacAvaney, and Nicola Tonellotto

TL;DR
This paper introduces a neural semantic quality-driven prioritization method for web crawling that improves early-stage search effectiveness by surfacing semantically rich content aligned with modern natural language search trends.
Contribution
It proposes embedding neural semantic quality estimators into crawling to prioritize high-quality, semantically rich pages, advancing beyond traditional link-based techniques.
Findings
Significantly improves harvest rate and maxNDCG in early crawling stages.
Maintains comparable search performance on keyword queries.
Opens new research directions for semantic-aware web crawling.
Abstract
Given the vast scale of the Web, crawling prioritisation techniques based on link graph traversal, popularity, link analysis, and textual content are frequently applied to surface documents that are most likely to be valuable. While existing techniques are effective for keyword-based search, both retrieval methods and user search behaviours are shifting from keyword-based matching to natural language semantic matching. The remarkable success of applying semantic matching and quality signals during ranking leads us to hypothesize that crawling could be improved by prioritizing Web pages with high semantic quality. To investigate this, we propose a semantic quality-driven prioritisation technique to enhance the effectiveness of crawling and align the crawler behaviour with recent shift towards natural language search. We embed semantic understanding directly into the crawling process --…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
