Search Arena: Analyzing Search-Augmented LLMs
Mihran Miroyan, Tsung-Han Wu, Logan King, Tianle Li, Jiayi Pan, Xinyan Hu, Wei-Lin Chiang, Anastasios N. Angelopoulos, Trevor Darrell, Narges Norouzi, Joseph E. Gonzalez

TL;DR
Search Arena introduces a large-scale, diverse dataset of human preferences for search-augmented LLMs, revealing insights into credibility, source preferences, and performance across different environments, supporting future research.
Contribution
The paper presents Search Arena, a comprehensive dataset and analysis framework for understanding user preferences and system performance in search-augmented LLMs.
Findings
User preferences are influenced by citation count, regardless of support.
Community-driven sources are generally preferred over static encyclopedic sources.
Web search can improve LLM performance in non-search settings but not always in search-intensive environments.
Abstract
Search-augmented language models combine web search with Large Language Models (LLMs) to improve response groundedness and freshness. However, analyzing these systems remains challenging: existing datasets are limited in scale and narrow in scope, often constrained to static, single-turn, fact-checking questions. In this work, we introduce Search Arena, a crowd-sourced, large-scale, human-preference dataset of over 24,000 paired multi-turn user interactions with search-augmented LLMs. The dataset spans diverse intents and languages, and contains full system traces with around 12,000 human preference votes. Our analysis reveals that user preferences are influenced by the number of citations, even when the cited content does not directly support the attributed claims, uncovering a gap between perceived and actual credibility. Furthermore, user preferences vary across cited sources,…
Peer Reviews
Decision·ICLR 2026 Poster
1. Unlike prior datasets such as SimpleQA and BrowseComp which are static, English-only, single-turn fact-seeking queries, the proposed Search Arena evaluates models in diverse, open-ended, multilingual, and multi-turn settings. 2. The human-preference analysis are detailed, covering number of citations, supportive claims, cited sources, etc. This is the crucial point in terms of the construction principle of Search Arena. 3. A set of detailed experimental results and analysis are provided to
1. The reliability of the collected data should be further judged, although the more sophisticated approaches are expensive. 2. The definition of search-augmented LLM and retrieval-augmented generation should be further distinguished if the author discusses them in different context as shown in the related work. Besides, the conversational search [1,2] is highly related to the proposed Search Arena, i.e., multi-turn human-AI interaction in search setting. 3. Section 2 discuss the difference c
-The study is impressively comprehensive, drawing on a diverse dataset that spans 11,650 users across 136 countries, 13 models, and 70 languages. -The authors conducted a series of insightful and relevant analyses on human preferences in search-augmented LLMs, such as the “Types of Cited Sources” section. The finding that Wikipedia citations correlate negatively with user preference (while social media and community sources correlate positively) is both counterintuitive and well-explained (i.e.
The current premise of search arena relies on human preference signals, but the paper’s own findings cast doubt on whether these signals are reliable indicators of true search quality. In particular, the Citation Attribution analysis shows that users often fail to distinguish between supporting and irrelevant citations and tend to prefer responses with a higher number of citations (regardless of the validity)). Human preference may conflate perceived credibility with actual factual correctness,
- The paper is well-motivated and easy to follow. It addresses a clear gap, i.e. the lack of large-scale multi-turn and multilingual datasets for evaluating search-augmented LLMs. - The Search Arena dataset can be very useful to researchers working on a broad set of topics. Its scale as well as diversity in languages and query intents make it a valuable resource. - The analysis of user preferences is interesting and results in some non-obvious findings. The finding that user preference is positi
- The paper presents a lot of correlational evidence, for eg, about which response features and cited sources users find important, but it states these findings more definitively than the correlational analysis supports. I think this is a major weakness and I would suggest rewriting these claims with caution. - The paper relies heavily on LLM-based pipelines for analysis, particularly for user intent classification and citation attribution. For the citation attribution validation, the process i
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsTopic Modeling · Expert finding and Q&A systems · Text Readability and Simplification
