RenoBench: A Citation Parsing Benchmark
Parth Sarin, Juan Pablo Alperin, Adam Buttrick, Dione Mentis

TL;DR
RenoBench is a publicly available, multilingual benchmark dataset of 10,000 annotated citations from PDFs across various platforms, designed to standardize and improve the evaluation of citation parsing systems.
Contribution
It introduces RenoBench, a large, diverse, and publicly accessible dataset for citation parsing, enabling reproducible and standardized system evaluation.
Findings
Language models, especially when fine-tuned, perform strongly on citation parsing.
RenoBench covers multiple languages, publication types, and platforms.
Automated validation and feature-based sampling improve dataset quality.
Abstract
Accurate parsing of citations is necessary for machine-readable scholarly infrastructure. But, despite sustained interest in this problem, existing evaluation techniques are often not generalizable, based on synthetic data, or not publicly available. We introduce RenoBench, a public domain benchmark for citation parsing, sourced from PDFs released on four publishing ecosystems: SciELO, Redalyc, the Public Knowledge Project, and Open Research Europe. Starting from 161,000 annotated citations, we apply automated validation and feature-based sampling to produce a dataset of 10,000 citations spanning multiple languages, publication types, and platforms. We then evaluate a variety of citation parsing systems and report field-level precision and recall. Our results show strong performance from language models, particularly when fine-tuned. RenoBench enables reproducible, standardized…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsScientific Computing and Data Management · Natural Language Processing Techniques · Biomedical Text Mining and Ontologies
