BibTeX Citation Hallucinations in Scientific Publishing Agents: Evaluation and Mitigation

Delip Rao; Chris Callison-Burch

arXiv:2604.03159·cs.DL·April 6, 2026

BibTeX Citation Hallucinations in Scientific Publishing Agents: Evaluation and Mitigation

Delip Rao, Chris Callison-Burch

PDF

TL;DR

This paper evaluates the accuracy of search-enabled language models in generating BibTeX entries for scientific papers, identifies common error modes, and proposes a two-stage revision tool to improve correctness.

Contribution

It introduces a comprehensive benchmark and error taxonomy for BibTeX hallucinations, and demonstrates that a two-stage revision process significantly enhances citation accuracy.

Findings

01

Overall BibTeX accuracy is 83.6%, fully correct entries are 50.9%

02

Accuracy drops by 27.7 percentage points for recent papers

03

Two-stage revision increases accuracy to 91.5% and correct entries to 78.3%

Abstract

Large language models with web search are increasingly used in scientific publishing agents, yet they still produce BibTeX entries with pervasive field-level errors. Prior evaluations tested base models without search, which does not reflect current practice. We construct a benchmark of 931 papers across four scientific domains and three citation tiers -- popular, low-citation, and recent post-cutoff -- designed to disentangle parametric memory from search dependence, with version-aware ground truth accounting for multiple citable versions of the same paper. Three search-enabled frontier models (GPT-5, Claude Sonnet-4.6, Gemini-3 Flash) generate BibTeX entries scored on nine fields and a six-way error taxonomy, producing ~23,000 field-level observations. Overall accuracy is 83.6%, but only 50.9% of entries are fully correct; accuracy drops 27.7pp from popular to recent papers, revealing…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.