UIS-Digger: Towards Comprehensive Research Agent Systems for Real-world Unindexed Information Seeking
Chang Liu, Chuqiao Kuang, Tianyi Zhuang, Yuxin Cheng, Huichi Zhou, Xiaoguang Li, Lifeng Shang

TL;DR
This paper introduces UIS-Digger, a multi-agent framework designed to address the critical challenge of Unindexed Information Seeking (UIS), which involves retrieving vital information not captured by search engines, and provides a new benchmark for evaluating such systems.
Contribution
The paper presents UIS-QA, the first dedicated benchmark for UIS, and proposes UIS-Digger, a novel multi-agent system that effectively searches unindexed sources, setting a new baseline in the field.
Findings
State-of-the-art agents perform poorly on UIS-QA, highlighting the problem's severity.
UIS-Digger outperforms existing systems like O3 and GPT-4.1 on the benchmark.
Proactive interaction with unindexed sources improves information-seeking performance.
Abstract
Recent advancements in LLM-based information-seeking agents have achieved record-breaking performance on established benchmarks. However, these agents remain heavily reliant on search-engine-indexed knowledge, leaving a critical blind spot: Unindexed Information Seeking (UIS). This paper identifies and explores the UIS problem, where vital information is not captured by search engine crawlers, such as overlooked content, dynamic webpages, and embedded files. Despite its significance, UIS remains an underexplored challenge. To address this gap, we introduce UIS-QA, the first dedicated UIS benchmark, comprising 110 expert-annotated QA pairs. Notably, even state-of-the-art agents experience a drastic performance drop on UIS-QA (e.g., from 70.90 on GAIA and 46.70 on BrowseComp-zh to 24.55 on UIS-QA), underscoring the severity of the problem. To mitigate this, we propose UIS-Digger, a novel…
Peer Reviews
Decision·ICLR 2026 Poster
1) The experiments results are impressive, with exploration of the SFT and RFT finetuning approaches 2) The paper has a comprehensive analysis of the different actions and tool calls used.
1) The proposed UIS-QA benchmark is a bit limited in size with only 110 examples (with a split of 84 questions in Chinese and 26 questions in English). The authors should consider expanding the dataset to ~300 instances. Moreover, it would be worthwhile to also report expert human performance on this dataset to give a sense of upperbound. 2) While the proposed UIS-Digger models excel on the UIS-QA dataset, the performance is low on GAIA and BrowseComp-zh datasets, bringing into question the ge
1. **Valuable dataset contribution with clear curation criteria**: The manually created dataset of 110 validated QA pairs that explicitly avoid search engine shortcuts represents a concrete contribution. 2. **Valuable empirical evaluation**: Testing 13+ baseline systems across multiple categories provides valuable comparisons. 3. **Detailed failure analysis**: The breakdown of error modes and tool usage evolution across training stages offers useful insights.
1. **Insufficient differentiation from existing web agent research and benchmarks**: The paper claims that UIS represents an "underexplored challenge" and a "critical blind spot" in current agent systems, yet the capabilities required (e.g., interactive navigation, form filling, file downloading, multi-step exploration) are precisely what existing web agents can already do. The action space that UIS-Digger employs (search, crawl, click, scroll, type, download files) already exists in prior syste
1. This work identifies an interesting unindexed information-seeking problem, which is practical in reality and still under-explored. 2. This work introduces a dataset to study this problem and also constructs an agent to solve this challenge correspondingly.
1. The scope definition of UIS is not crystally clear. How is “unindexed” information precisely defined? Does it include API-gated, or private web data? 2. The boundary with traditional search is not super clear as well. How to distinguish UIS tasks from regular information-seeking tasks where the answer is simply poorly ranked or paraphrased online? 3. This work claims to build a benchmark, but the characteristics of the benchmark are not enough. For example, what fraction of examples involve d
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsInformation Retrieval and Search Behavior · Web Data Mining and Analysis · Expert finding and Q&A systems
