BibTeX Citation Hallucinations in Scientific Publishing Agents: Evaluation and Mitigation
Delip Rao, Chris Callison-Burch

TL;DR
This paper evaluates the accuracy of search-enabled language models in generating BibTeX entries for scientific papers, identifies common error modes, and proposes a two-stage revision tool to improve correctness.
Contribution
It introduces a comprehensive benchmark and error taxonomy for BibTeX hallucinations, and demonstrates that a two-stage revision process significantly enhances citation accuracy.
Findings
Overall BibTeX accuracy is 83.6%, fully correct entries are 50.9%
Accuracy drops by 27.7 percentage points for recent papers
Two-stage revision increases accuracy to 91.5% and correct entries to 78.3%
Abstract
Large language models with web search are increasingly used in scientific publishing agents, yet they still produce BibTeX entries with pervasive field-level errors. Prior evaluations tested base models without search, which does not reflect current practice. We construct a benchmark of 931 papers across four scientific domains and three citation tiers -- popular, low-citation, and recent post-cutoff -- designed to disentangle parametric memory from search dependence, with version-aware ground truth accounting for multiple citable versions of the same paper. Three search-enabled frontier models (GPT-5, Claude Sonnet-4.6, Gemini-3 Flash) generate BibTeX entries scored on nine fields and a six-way error taxonomy, producing ~23,000 field-level observations. Overall accuracy is 83.6%, but only 50.9% of entries are fully correct; accuracy drops 27.7pp from popular to recent papers, revealing…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
