The Attribution Crisis in LLM Search Results
Ilan Strauss, Jangho Yang, Tim O'Reilly, Sruly Rosenblat, Isobel Moure

TL;DR
This paper investigates the attribution gap in web-enabled LLMs, revealing patterns of low citation rates despite extensive web content consumption, and advocates for transparent search architectures.
Contribution
It provides empirical analysis of attribution practices in LLMs, identifying exploitation patterns and proposing standards for transparency in search and citation logging.
Findings
34% of Google Gemini responses lack online content fetch
92% of Gemini answers have no clickable citation source
Models vary in citation efficiency from 0.19 to 0.45
Abstract
Web-enabled LLMs frequently answer queries without crediting the web pages they consume, creating an "attribution gap" - the difference between relevant URLs read and those actually cited. Drawing on approximately 14,000 real-world LMArena conversation logs with search-enabled LLM systems, we document three exploitation patterns: 1) No Search: 34% of Google Gemini and 24% of OpenAI GPT-4o responses are generated without explicitly fetching any online content; 2) No citation: Gemini provides no clickable citation source in 92% of answers; 3) High-volume, low-credit: Perplexity's Sonar visits approximately 10 relevant pages per query but cites only three to four. A negative binomial hurdle model shows that the average query answered by Gemini or Sonar leaves about 3 relevant websites uncited, whereas GPT-4o's tiny uncited gap is best explained by its selective log disclosures rather than…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
