CRAWLDoc: A Dataset for Robust Ranking of Bibliographic Documents

Fabian Karl; Ansgar Scherp

arXiv:2506.03822·cs.CL·June 5, 2025

CRAWLDoc: A Dataset for Robust Ranking of Bibliographic Documents

Fabian Karl, Ansgar Scherp

PDF

Open Access

TL;DR

CRAWLDoc is a new method that improves the ranking of web documents linked to publications, enhancing metadata extraction across diverse web layouts and formats.

Contribution

Introduces CRAWLDoc, a layout-independent ranking method for linked web resources, supported by a new dataset of 600 manually labeled publications.

Findings

01

CRAWLDoc outperforms existing methods in ranking relevant documents.

02

The method is layout-independent and robust across publishers.

03

Provides a new dataset for evaluating document ranking methods.

Abstract

Publication databases rely on accurate metadata extraction from diverse web sources, yet variations in web layouts and data formats present challenges for metadata providers. This paper introduces CRAWLDoc, a new method for contextual ranking of linked web documents. Starting with a publication's URL, such as a digital object identifier, CRAWLDoc retrieves the landing page and all linked web resources, including PDFs, ORCID profiles, and supplementary materials. It embeds these resources, along with anchor texts and the URLs, into a unified representation. For evaluating CRAWLDoc, we have created a new, manually labeled dataset of 600 publications from six top publishers in computer science. Our method CRAWLDoc demonstrates a robust and layout-independent ranking of relevant documents across publishers and data formats. It lays the foundation for improved metadata extraction from web…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsWeb Data Mining and Analysis · Web visibility and informetrics · Text and Document Classification Technologies