Evaluating Methods to Rediscover Missing Web Pages from the Web Infrastructure
Martin Klein, Michael L Nelson

TL;DR
This paper evaluates four automated methods for rediscovering missing web pages, finding that combining titles and lexical signatures significantly improves retrieval success rates.
Contribution
It compares multiple automated techniques for web page recovery and recommends an effective combination of methods for better retrieval performance.
Findings
Over 60% of missing pages are top-ranked using lexical signatures or titles.
Combining methods increases top-ranked retrieval to over 75%.
Querying titles first, then lexical signatures, is the most effective approach.
Abstract
Missing web pages (pages that return the 404 "Page Not Found" error) are part of the browsing experience. The manual use of search engines to rediscover missing pages can be frustrating and unsuccessful. We compare four automated methods for rediscovering web pages. We extract the page's title, generate the page's lexical signature (LS), obtain the page's tags from the bookmarking website delicious.com and generate a LS from the page's link neighborhood. We use the output of all methods to query Internet search engines and analyze their retrieval performance. Our results show that both LSs and titles perform fairly well with over 60% URIs returned top ranked from Yahoo!. However, the combination of methods improves the retrieval performance. Considering the complexity of the LS generation, querying the title first and in case of insufficient results querying the LSs second is the…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsWeb Data Mining and Analysis · Information Retrieval and Search Behavior · Text and Document Classification Technologies
