Cite Pretrain: Retrieval-Free Knowledge Attribution for Large Language Models

Yukun Huang; Sanxing Chen; Jian Pei; Manzil Zaheer; Bhuwan Dhingra

arXiv:2506.17585·cs.AI·April 7, 2026

Cite Pretrain: Retrieval-Free Knowledge Attribution for Large Language Models

Yukun Huang, Sanxing Chen, Jian Pei, Manzil Zaheer, Bhuwan Dhingra

PDF

1 Video

TL;DR

This paper introduces a training approach enabling large language models to reliably attribute information to specific documents without test-time retrieval, enhancing citation accuracy and robustness.

Contribution

It proposes Active Indexing, a novel training method that improves LLMs' ability to generate and attribute citations from pretraining data, eliminating the need for external retrieval.

Findings

01

Active Indexing outperforms Passive Indexing with up to 30.2% citation precision gains.

02

Scaling augmented data improves citation performance.

03

Internal citations increase robustness to retrieval noise.

Abstract

Trustworthy language models should provide both correct and verifiable answers. However, citations generated directly by standalone LLMs are often unreliable. As a result, current systems insert citations by querying an external retriever at inference time, introducing latency, infrastructure dependence, and vulnerability to retrieval noise. We explore whether LLMs can be made to reliably attribute to the documents seen during continual pretraining without test-time retrieval, by revising the training process. To study this, we construct CitePretrainBench, a benchmark that mixes real-world corpora (Wikipedia, Common Crawl, arXiv) with novel documents and probes both short-form (single-fact) and long-form (multi-fact) citation tasks. Our approach follows a two-stage process: (1) continual pretraining to index factual knowledge by binding it to persistent document identifiers; and (2)…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

Cite Pretrain: Retrieval-Free Knowledge Attribution for Large Language Models· slideslive