SGR-Bench: Benchmarking Search Agents on State-Gated Retrieval

Ningyuan Li; Haiyang Shen; Mugeng Liu; Yudong Han; Zhuofan Shi; Sixiong Xie; and Yun Ma

arXiv:2605.22219·cs.AI·May 22, 2026

SGR-Bench: Benchmarking Search Agents on State-Gated Retrieval

Ningyuan Li, Haiyang Shen, Mugeng Liu, Yudong Han, Zhuofan Shi, Sixiong Xie, and Yun Ma

PDF

1 Repo 1 Datasets

TL;DR

SGR-Bench is a new benchmark designed to evaluate search agents on their ability to perform state-gated retrieval tasks, which involve establishing correct site-specific retrieval states before producing answers.

Contribution

The paper introduces SGR-Bench, a comprehensive benchmark for evaluating agents on specialized retrieval tasks requiring state establishment, with detailed analysis of system performance and failure modes.

Findings

01

The strongest system achieves only 66.18% item-level F1 on SGR-Bench.

02

Agents often reach relevant sources but establish incorrect retrieval states.

03

Retrieval-scope drift and criterion mismatch are major failure causes.

Abstract

Recent advances in large language models and tool-using agents have expanded the range of benchmarked web tasks. Yet an important class of specialized retrieval tasks remains undercharacterized. On many specialized data-retrieval websites, answer-bearing evidence becomes accessible only after establishing the correct site-specific retrieval state through filters, views, hierarchies, or scopes. We term this capability state-gated retrieval (SGR). We introduce SGR-Bench, a benchmark for this setting containing 100 expert-curated tasks spanning six source families and 12 public data ecosystems. Each task requires discovering the appropriate website and configuring its site-specific retrieval state to produce a structured answer. SGR-Bench pairs constraint-guided and goal-oriented formulations of the same underlying problems, enabling controlled comparisons between explicit and implicit…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

https://huggingface.co/datasets/PKUAIWeb/SGR-BENCH
github

Datasets

PKUAIWeb/SGR-BENCH
dataset· 303 dl
303 dl

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.