Timely crawling of high-quality ephemeral new content
Damien Lefortier, Liudmila Ostroumova, Egor Samosvat, Pavel Serdyukov

TL;DR
This paper presents a new approach for timely crawling of ephemeral web pages by prioritizing high-quality sources and using a novel metric to maximize user interest capture, demonstrated through real-world experiments.
Contribution
It introduces a new metric for ephemeral page crawling, identifies key content sources, and proposes an adaptive recrawl and crawl strategy based on user interest and search logs.
Findings
Most ephemeral pages are found at few content sources.
The proposed method improves crawling timeliness and relevance.
Experimental results show increased user interest capture.
Abstract
Nowadays, more and more people use the Web as their primary source of up-to-date information. In this context, fast crawling and indexing of newly created Web pages has become crucial for search engines, especially because user traffic to a significant fraction of these new pages (like news, blog and forum posts) grows really quickly right after they appear, but lasts only for several days. In this paper, we study the problem of timely finding and crawling of such ephemeral new pages (in terms of user interest). Traditional crawling policies do not give any particular priority to such pages and may thus crawl them not quickly enough, and even crawl already obsolete content. We thus propose a new metric, well thought out for this task, which takes into account the decrease of user interest for ephemeral pages over time. We show that most ephemeral new pages can be found at a…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsWeb Data Mining and Analysis · Algorithms and Data Compression · Caching and Content Delivery
