How Do LLM-Generated Texts Impact Term-Based Retrieval Models?
Wei Huang, Keping Bi, Yinqiong Cai, Wei Chen, Jiafeng Guo, Xueqi Cheng

TL;DR
This study examines how large language model-generated texts influence traditional term-based retrieval models, revealing that these models favor documents with term distributions similar to queries and are unaffected by source bias.
Contribution
The paper provides a detailed linguistic analysis of LLM-generated texts and investigates the source bias in term-based retrieval models, offering insights into their behavior with mixed content.
Findings
LLM-generated texts have higher term specificity and diversity.
Term-based models favor documents with term distributions similar to queries.
No inherent source bias was found in term-based retrieval models.
Abstract
As more content generated by large language models (LLMs) floods into the Internet, information retrieval (IR) systems now face the challenge of distinguishing and handling a blend of human-authored and machine-generated texts. Recent studies suggest that neural retrievers may exhibit a preferential inclination toward LLM-generated content, while classic term-based retrievers like BM25 tend to favor human-written documents. This paper investigates the influence of LLM-generated content on term-based retrieval models, which are valued for their efficiency and robust generalization across domains. Our linguistic analysis reveals that LLM-generated texts exhibit smoother high-frequency and steeper low-frequency Zipf slopes, higher term specificity, and greater document-level diversity. These traits are aligned with LLMs being trained to optimize reader experience through diverse and…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
