SeekerGym: A Benchmark for Reliable Information Seeking

Remy Kim; Minseung Lee; Shuo Li; Osbert Bastani

arXiv:2604.17143·cs.LG·April 21, 2026

SeekerGym: A Benchmark for Reliable Information Seeking

Remy Kim, Minseung Lee, Shuo Li, Osbert Bastani

PDF

TL;DR

SeekerGym is a new benchmark designed to evaluate AI agents' ability to retrieve complete and uncertain information from documents like Wikipedia and survey papers.

Contribution

It introduces a benchmark for measuring the completeness and uncertainty quantification of AI information retrieval systems.

Findings

01

Best models retrieve 42.5% of Wikipedia passages

02

Best models retrieve 29.2% of survey paper passages

03

Significant room for improvement in retrieval completeness

Abstract

Despite their substantial successes, AI agents continue to face fundamental challenges in terms of trustworthiness. Consider deep research agents, tasked with searching for information relevant to a given topic-while AI agents can perform effective information retrieval, there is little guarantee regarding the completeness of this information. Gaps in retrieved information can leave biases that mislead users even if the information they are given is correct and relevant. We introduce SeekerGym, a benchmark designed to evaluate the completeness of information retrieved by AI agents. In addition, SeekerGym also measures how well agents quantify their uncertainty in the completeness of their information; if an agent fails to retrieve all relevant information, it is useful for it to at least quantify how much might be missing. At a high level, each task in SeekerGym is a document (e.g., a…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.