Training-Induced Bias Toward LLM-Generated Content in Dense Retrieval

William Xion; Wolfgang Nejdl

arXiv:2602.10833·cs.IR·February 17, 2026

Training-Induced Bias Toward LLM-Generated Content in Dense Retrieval

William Xion, Wolfgang Nejdl

PDF

Open Access

TL;DR

This paper investigates how training processes influence the bias of dense retrieval models towards LLM-generated content, revealing that such bias is training-induced rather than inherent, with implications for model training strategies.

Contribution

It provides a controlled evaluation showing that source bias in dense retrieval models depends on training data and procedures, especially fine-tuning on LLM-generated text.

Findings

01

Supervised fine-tuning on MS MARCO shifts preference toward LLM-generated content.

02

Unsupervised retrievers do not show a uniform bias; it varies by dataset.

03

Fine-tuning on LLM-generated corpora induces a strong pro-LLM bias.

Abstract

Dense retrieval is a promising approach for acquiring relevant context or world knowledge in open-domain natural language processing tasks and is now widely used in information retrieval applications. However, recent reports claim a broad preference for text generated by large language models (LLMs). This bias is called "source bias", and it has been hypothesized that lower perplexity contributes to this effect. In this study, we revisit this claim by conducting a controlled evaluation to trace the emergence of such preferences across training stages and data sources. Using parallel human- and LLM-generated counterparts of the SciFact and Natural Questions (NQ320K) datasets, we compare unsupervised checkpoints with models fine-tuned using in-domain human text, in-domain LLM-generated text, and MS MARCO. Our results show the following: 1) Unsupervised retrievers do not exhibit a uniform…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsTopic Modeling · Biomedical Text Mining and Ontologies · Information Retrieval and Search Behavior