CRAWLDoc: A Dataset for Robust Ranking of Bibliographic Documents
Fabian Karl, Ansgar Scherp

TL;DR
CRAWLDoc is a new method that improves the ranking of web documents linked to publications, enhancing metadata extraction across diverse web layouts and formats.
Contribution
Introduces CRAWLDoc, a layout-independent ranking method for linked web resources, supported by a new dataset of 600 manually labeled publications.
Findings
CRAWLDoc outperforms existing methods in ranking relevant documents.
The method is layout-independent and robust across publishers.
Provides a new dataset for evaluating document ranking methods.
Abstract
Publication databases rely on accurate metadata extraction from diverse web sources, yet variations in web layouts and data formats present challenges for metadata providers. This paper introduces CRAWLDoc, a new method for contextual ranking of linked web documents. Starting with a publication's URL, such as a digital object identifier, CRAWLDoc retrieves the landing page and all linked web resources, including PDFs, ORCID profiles, and supplementary materials. It embeds these resources, along with anchor texts and the URLs, into a unified representation. For evaluating CRAWLDoc, we have created a new, manually labeled dataset of 600 publications from six top publishers in computer science. Our method CRAWLDoc demonstrates a robust and layout-independent ranking of relevant documents across publishers and data formats. It lays the foundation for improved metadata extraction from web…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsWeb Data Mining and Analysis · Web visibility and informetrics · Text and Document Classification Technologies
