Benchmarking Large Language Models on Reference Extraction and Parsing in the Social Sciences and Humanities

Yurui Zhu; Giovanni Colavizza; Matteo Romanello

arXiv:2603.13651·cs.CL·April 3, 2026

Benchmarking Large Language Models on Reference Extraction and Parsing in the Social Sciences and Humanities

Yurui Zhu, Giovanni Colavizza, Matteo Romanello

PDF

TL;DR

This paper introduces a comprehensive benchmark for reference extraction and parsing in social sciences and humanities, evaluating LLMs and traditional methods across diverse, realistic document conditions.

Contribution

It presents a unified benchmark with datasets reflecting SSH-specific citation challenges and compares LLMs with GROBID, highlighting strengths, limitations, and hybrid deployment strategies.

Findings

01

Extraction saturates beyond moderate capability thresholds.

02

Parsing and end-to-end parsing are primary bottlenecks due to structured-output brittleness.

03

Lightweight LoRA adaptation improves performance, especially on SSH-heavy benchmarks.

Abstract

Bibliographic reference extraction and parsing are foundational for citation indexing, linking, and downstream scholarly knowledge-graph construction. However, most established evaluations focus on clean, English, end-of-document bibliographies, and therefore underrepresent the Social Sciences and Humanities (SSH), where citations are frequently multilingual, embedded in footnotes, abbreviated, and shaped by heterogeneous historical conventions. We present a unified benchmark that targets these SSH-realistic conditions across three complementary datasets: CEX (English journal articles spanning multiple disciplines), EXCITE (German/English documents with end-section, footnote-only, and mixed regimes), and LinkedBooks (humanities references with strong stylistic variation and multilinguality). We evaluate three tasks of increasing difficulty -- reference extraction, reference parsing, and…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.