SeekerGym: A Benchmark for Reliable Information Seeking
Remy Kim, Minseung Lee, Shuo Li, Osbert Bastani

TL;DR
SeekerGym is a new benchmark designed to evaluate AI agents' ability to retrieve complete and uncertain information from documents like Wikipedia and survey papers.
Contribution
It introduces a benchmark for measuring the completeness and uncertainty quantification of AI information retrieval systems.
Findings
Best models retrieve 42.5% of Wikipedia passages
Best models retrieve 29.2% of survey paper passages
Significant room for improvement in retrieval completeness
Abstract
Despite their substantial successes, AI agents continue to face fundamental challenges in terms of trustworthiness. Consider deep research agents, tasked with searching for information relevant to a given topic-while AI agents can perform effective information retrieval, there is little guarantee regarding the completeness of this information. Gaps in retrieved information can leave biases that mislead users even if the information they are given is correct and relevant. We introduce SeekerGym, a benchmark designed to evaluate the completeness of information retrieved by AI agents. In addition, SeekerGym also measures how well agents quantify their uncertainty in the completeness of their information; if an agent fails to retrieve all relevant information, it is useful for it to at least quantify how much might be missing. At a high level, each task in SeekerGym is a document (e.g., a…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
