AgentSearchBench: A Benchmark for AI Agent Search in the Wild

Bin Wu; Arastun Mammadli; Xiaoyu Zhang; Emine Yilmaz

arXiv:2604.22436·cs.AI·April 27, 2026

AgentSearchBench: A Benchmark for AI Agent Search in the Wild

Bin Wu, Arastun Mammadli, Xiaoyu Zhang, Emine Yilmaz

PDF

1 Repo

TL;DR

AgentSearchBench is a large-scale benchmark designed to evaluate the effectiveness of agent search methods in real-world scenarios, emphasizing the importance of execution-grounded signals over traditional description-based approaches.

Contribution

The paper introduces a comprehensive benchmark for agent search that incorporates execution signals, revealing limitations of existing methods and proposing improvements through behavioral signals.

Findings

01

Semantic similarity does not reliably predict agent performance.

02

Execution-aware probing significantly improves ranking quality.

03

Lightweight behavioral signals enhance agent discovery effectiveness.

Abstract

The rapid growth of AI agent ecosystems is transforming how complex tasks are delegated and executed, creating a new challenge of identifying suitable agents for a given task. Unlike traditional tools, agent capabilities are often compositional and execution-dependent, making them difficult to assess from textual descriptions alone. However, existing research and benchmarks typically assume well-specified functionalities, controlled candidate pools, or only executable task queries, leaving realistic agent search scenarios insufficiently studied. We introduce AgentSearchBench, a large-scale benchmark for agent search in the wild, built from nearly 10,000 real-world agents across multiple providers. The benchmark formalizes agent search as retrieval and reranking problems under both executable task queries and high-level task descriptions, and evaluates relevance using execution-grounded…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

Bingo-W/AgentSearchBench
github

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.