Benchmarking Large Language Models on Reference Extraction and Parsing in the Social Sciences and Humanities
Yurui Zhu, Giovanni Colavizza, Matteo Romanello

TL;DR
This paper introduces a comprehensive benchmark for reference extraction and parsing in social sciences and humanities, evaluating LLMs and traditional methods across diverse, realistic document conditions.
Contribution
It presents a unified benchmark with datasets reflecting SSH-specific citation challenges and compares LLMs with GROBID, highlighting strengths, limitations, and hybrid deployment strategies.
Findings
Extraction saturates beyond moderate capability thresholds.
Parsing and end-to-end parsing are primary bottlenecks due to structured-output brittleness.
Lightweight LoRA adaptation improves performance, especially on SSH-heavy benchmarks.
Abstract
Bibliographic reference extraction and parsing are foundational for citation indexing, linking, and downstream scholarly knowledge-graph construction. However, most established evaluations focus on clean, English, end-of-document bibliographies, and therefore underrepresent the Social Sciences and Humanities (SSH), where citations are frequently multilingual, embedded in footnotes, abbreviated, and shaped by heterogeneous historical conventions. We present a unified benchmark that targets these SSH-realistic conditions across three complementary datasets: CEX (English journal articles spanning multiple disciplines), EXCITE (German/English documents with end-section, footnote-only, and mixed regimes), and LinkedBooks (humanities references with strong stylistic variation and multilinguality). We evaluate three tasks of increasing difficulty -- reference extraction, reference parsing, and…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
