Citation Parsing and Analysis with Language Models
Parth Sarin, Juan Pablo Alperin

TL;DR
This paper explores the use of open-weight language models to accurately parse and analyze citations in research papers, aiming to enhance global knowledge networks and address disparities in scholarly communication.
Contribution
It demonstrates that open-weight language models can effectively parse citation components, outperforming existing methods, and suggests the potential for developing small, robust citation parsing tools.
Findings
Language models achieve high accuracy in citation component identification.
The smallest model, Qwen3-0.6B, can parse citations efficiently with minimal passes.
Open models outperform state-of-the-art citation parsing methods.
Abstract
A key type of resource needed to address global inequalities in knowledge production and dissemination is a tool that can support journals in understanding how knowledge circulates. The absence of such a tool has resulted in comparatively less information about networks of knowledge sharing in the Global South. In turn, this gap authorizes the exclusion of researchers and scholars from the South in indexing services, reinforcing colonial arrangements that de-center and minoritize those scholars. In order to support citation network tracking on a global scale, we investigate the capacity of open-weight language models to mark up manuscript citations in an indexable format. We assembled a dataset of matched plaintext and annotated citations from preprints and published research papers. Then, we evaluated a number of open-weight language models on the annotation task. We find that, even…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
Topicsscientometrics and bibliometrics research · Academic Publishing and Open Access · Biomedical Text Mining and Ontologies
