Analyzing the Presentation, Content, and Utilization of References in LLM-powered Conversational AI Systems
Jianheng Ouyang, Arpit Narechania

TL;DR
This paper systematically analyzes how references are presented and their quality in nine conversational AI systems, revealing significant variation and highlighting the need for improved interface design to enhance user trust.
Contribution
It provides the first comprehensive analysis of reference presentation and quality in LLM-powered conversational AI, with insights into user interaction and trust.
Findings
ChatGPT provides more references with higher quality.
Hunyuan-TurboS offers fewer references with lower quality.
Users rarely interact with references, behavior varies across systems.
Abstract
As conversational AI systems become popular for information retrieval and question-answering, the references they cite are key to ensuring their answers are reliable and trustworthy. Yet, no prior work systematically analyzes how these references are presented or their quality. We examine 1,517 references from 30 question-answer pairs across nine systems, focusing on their (1) presentation in the user interface and (2) quality using the CRAAP criteria. We find notable variations in the presentation, quality, and quantity of references across systems. For instance, ChatGPT provides more references (9.5 per response on average) with higher quality (15.48/20 CRAAP score), while Hunyuan-TurboS provides fewer references (4.0) and lower quality (11.65/20). Additionally, a preliminary user study shows that people rarely interact with these references and that their behavior differs across…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
