InteractComp: Evaluating Search Agents With Ambiguous Queries

Mingyi Deng; Lijun Huang; Yani Fan; Jiayi Zhang; Fashen Ren; Jinyi Bai; Fuzhen Yang; Dayi Miao; Zhaoyang Yu; Yifan Wu; Yanfei Zhang; Fengwei Teng; Yingjia Wan; Song Hu; Yude Li; Xin Jin; Conghao Hu; Haoyu Li; Qirui Fu; Tai Zhong; Xinyu Wang; Xiangru Tang; Nan Tang; Chenglin Wu; Yuyu Luo

arXiv:2510.24668·cs.CL·October 29, 2025

InteractComp: Evaluating Search Agents With Ambiguous Queries

Mingyi Deng, Lijun Huang, Yani Fan, Jiayi Zhang, Fashen Ren, Jinyi Bai, Fuzhen Yang, Dayi Miao, Zhaoyang Yu, Yifan Wu, Yanfei Zhang, Fengwei Teng, Yingjia Wan, Song Hu, Yude Li, Xin Jin, Conghao Hu, Haoyu Li, Qirui Fu, Tai Zhong, Xinyu Wang, Xiangru Tang, Nan Tang, Chenglin Wu

PDF

1 Datasets 4 Reviews

TL;DR

InteractComp is a benchmark that evaluates whether search agents can recognize ambiguous queries and actively interact to clarify them, revealing significant gaps in current models' interactive capabilities despite improvements in search performance.

Contribution

This paper introduces InteractComp, a novel benchmark for assessing and training search agents' ability to handle ambiguous queries through interaction, addressing a critical gap in existing evaluation methods.

Findings

01

Current models show overconfidence and poor disambiguation ability.

02

Interaction significantly improves search accuracy.

03

Interaction capabilities have stagnated over 15 months despite search improvements.

Abstract

Language agents have demonstrated remarkable potential in web search and information retrieval. However, these search agents assume user queries are complete and unambiguous, an assumption that diverges from reality where users begin with incomplete queries requiring clarification through interaction. Yet most agents lack interactive mechanisms during the search process, and existing benchmarks cannot assess this capability. To address this gap, we introduce InteractComp, a benchmark designed to evaluate whether search agents can recognize query ambiguity and actively interact to resolve it during search. Following the principle of easy to verify, interact to disambiguate, we construct 210 expert-curated questions across 9 domains through a target-distractor methodology that creates genuine ambiguity resolvable only through interaction. Evaluation of 17 models reveals striking failure:…

Peer Reviews

Decision·ICLR 2026 Conference Withdrawn Submission

Reviewer 01Rating 2Confidence 4

Strengths

- The proposed benchmark targets an important scenario: in real life, user queries are often ambiguous or underspecified. To effectively address user queries, models need to know when and how to interact with users to resolve these ambiguities. - The answers are short and easily verifiable, which reduces subjectivity in evaluation. - Experiments reveal clear and useful empirical insight that overconfidence and lack of uncertainty awareness is a core weakness.

Weaknesses

- Based on the example in Table 1, the query appears to be something a model could likely resolve using its internal world knowledge rather than through web search. During data construction, there was no verification that these queries require external search to be answered correctly. If that’s the case, the rationale for putting the task in an agentic, search-based setup becomes unclear, especially since prior work has already shown that language models struggle to ask clarification questions.

Reviewer 02Rating 2Confidence 3

Strengths

- The paper identifies a gap between current search benchmarks and real-world use cases of search systems. - To address the above-mentioned gap, the authors construct a new expert-written benchmark that specifically tests the interaction capabilities of the models. - The developed benchmark is not saturated (~14%) and can be used for evaluating both search and interaction capabilities of the models. - The authors evaluate both proprietary and open-weight models and identify varying behaviors acr

Weaknesses

### Major **Surface-level analyses** The analyses of the models is limited to the raw numbers and comparisons across models and ablations. To support the claims and provide explanations of the observed behaviors, a deeper analysis is required. For example: why does increasing the number of interaction turns have different effects on different models? Additionally, providing qualitative examples can provide intuition and help the researchers better understand the failure modes on the proposed

Reviewer 03Rating 6Confidence 4

Strengths

Addresses a clear gap in current search benchmarks- interactivity. The paper makes a compelling argument that real-life search involves this iterative refinement process, and the paper makes a step in that direction.

Weaknesses

Ecological validity of benchmark construction. Annotators are deliberately asked to construct queries while starting from a target answer. It remains an open question if the scoring well on the benchmark would represent meaningful and demonstrative improvement in real life search tasks. The synthetic interaction channel by forcing a yes/no response from the responder who has the context, is a strange and not fully justified design decision. A more realistic scenario could have the responder resp

Reviewer 04Rating 4Confidence 4

Strengths

1. This paper demonstrates a good motivation, highlighting that human search behavior is typically iterative, beginning with ambiguous queries and progressively refining them through interaction. This perspective makes the evaluation setting more closely aligned with real-world scenarios. 2. The data construction is based on an insightful idea: ambiguity arises when similar entities share overlapping attributes. The benchmark also considers domain generalization and includes data quality verifi

Weaknesses

1. This paper can essentially be understood as targeting a multi-turn collaborative/conversational search scenario. Although InteractComp is verifiable, its query format (as exemplified in Table 1) does not show a clear distinction from complex search benchmarks like BrowseComp. The core design of BrowseComp involves multiple constraints, many of which are themselves ambiguous. Evidently, both benchmarks inherently possess interactive characteristics. 2. Although I find the heuristic approach f

Code & Models

Datasets

Rubbisheep/InteractComp
dataset· 23 dl
23 dl

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.