A pipeline for matching bibliographic references with incomplete metadata: experiments with Crossref and OpenCitations
Matteo Guenci, Ivan Heibi, Chiara Parravicini, Silvio Peroni, Marta Soricetti

TL;DR
This paper presents a methodology for matching bibliographic references with incomplete metadata by analyzing unstructured text and partial data, improving citation link creation between entities in bibliographic databases.
Contribution
It introduces a heuristic-based tool and methodology for mapping references with incomplete metadata to existing entities, enhancing citation network completeness.
Findings
High matching precision achieved
Effective integration of partial metadata
Recall limitations indicate need for further improvements
Abstract
While Crossref makes available more than 1.8 billion bibliographic references from publications for which it provides a DOI, more than 698 million of these references do not specify a DOI, making the creation of a formal citation link from the citing entity and the cited entity problematic. In this article, we propose an analysis of Crossref bibliographic references to show how we can use the unstructured text defining such references and the available (and partial) metadata specified in them to (a) map them to existing entities included in OpenCitations Meta and, then, (b) to enable the potential inclusion of additional and valid citations link among these entities. We have defined a precise methodology to address the analysis and run it against a manually defined Gold Standard and a subset of Crossref. While the heuristic-based tool developed has demonstrated strong matching precision…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsBiomedical Text Mining and Ontologies · Data Quality and Management · Advanced Text Analysis Techniques
