Assessing the performance of 8 AI chatbots in bibliographic reference retrieval: Grok and DeepSeek outperform ChatGPT, but none are fully accurate
\'Alvaro Cabezas-Clavijo, Pavel Sidorenko-Bautista

TL;DR
This study evaluates eight AI chatbots' accuracy in generating academic references, revealing that only a quarter are fully correct, with Grok and DeepSeek outperforming others in avoiding false references.
Contribution
It provides a comparative analysis of AI chatbots' performance in bibliographic reference retrieval, highlighting their limitations and risks in academic contexts.
Findings
Grok and DeepSeek do not generate false references.
Only 26.5% of references were fully correct.
High overlap in sources among models, especially between DeepSeek, Grok, Gemini, and ChatGPT.
Abstract
This study analyzes the performance of eight generative artificial intelligence chatbots -- ChatGPT, Claude, Copilot, DeepSeek, Gemini, Grok, Le Chat, and Perplexity -- in their free versions, in the task of generating academic bibliographic references within the university context. A total of 400 references were evaluated across the five major areas of knowledge (Health, Engineering, Experimental Sciences, Social Sciences, and Humanities), based on a standardized prompt. Each reference was assessed according to five key components (authorship, year, title, source, and location), along with document type, publication age, and error count. The results show that only 26.5% of the references were fully correct, 33.8% partially correct, and 39.8% were either erroneous or entirely fabricated. Grok and DeepSeek stood out as the only chatbots that did not generate false references, while…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
