Search Engines in an AI Era: The False Promise of Factual and Verifiable Source-Cited Responses
Pranav Narayanan Venkit, Philippe Laban, Yilun Zhou, Yixin Mao,, Chien-Sheng Wu

TL;DR
This paper critically examines the limitations of LLM-based answer engines, revealing common issues like hallucinations and citation inaccuracies, and introduces an evaluation benchmark to improve their transparency and reliability.
Contribution
It provides a comprehensive user study, design recommendations, and an automated evaluation benchmark for assessing LLM-based answer engines.
Findings
Frequent hallucinations in answer generation
Inaccurate source citations by answer engines
Variation in answer confidence levels across systems
Abstract
Large Language Model (LLM)-based applications are graduating from research prototypes to products serving millions of users, influencing how people write and consume information. A prominent example is the appearance of Answer Engines: LLM-based generative search engines supplanting traditional search engines. Answer engines not only retrieve relevant sources to a user query but synthesize answer summaries that cite the sources. To understand these systems' limitations, we first conducted a study with 21 participants, evaluating interactions with answer vs. traditional search engines and identifying 16 answer engine limitations. From these insights, we propose 16 answer engine design recommendations, linked to 8 metrics. An automated evaluation implementing our metrics on three popular engines (You.com, Perplexity.ai, BingChat) quantifies common limitations (e.g., frequent…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMisinformation and Its Impacts · Ethics and Social Impacts of AI
