SGR-Bench: Benchmarking Search Agents on State-Gated Retrieval
Ningyuan Li, Haiyang Shen, Mugeng Liu, Yudong Han, Zhuofan Shi, Sixiong Xie, and Yun Ma

TL;DR
SGR-Bench is a new benchmark designed to evaluate search agents on their ability to perform state-gated retrieval tasks, which involve establishing correct site-specific retrieval states before producing answers.
Contribution
The paper introduces SGR-Bench, a comprehensive benchmark for evaluating agents on specialized retrieval tasks requiring state establishment, with detailed analysis of system performance and failure modes.
Findings
The strongest system achieves only 66.18% item-level F1 on SGR-Bench.
Agents often reach relevant sources but establish incorrect retrieval states.
Retrieval-scope drift and criterion mismatch are major failure causes.
Abstract
Recent advances in large language models and tool-using agents have expanded the range of benchmarked web tasks. Yet an important class of specialized retrieval tasks remains undercharacterized. On many specialized data-retrieval websites, answer-bearing evidence becomes accessible only after establishing the correct site-specific retrieval state through filters, views, hierarchies, or scopes. We term this capability state-gated retrieval (SGR). We introduce SGR-Bench, a benchmark for this setting containing 100 expert-curated tasks spanning six source families and 12 public data ecosystems. Each task requires discovering the appropriate website and configuring its site-specific retrieval state to produce a structured answer. SGR-Bench pairs constraint-guided and goal-oriented formulations of the same underlying problems, enabling controlled comparisons between explicit and implicit…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
