TL;DR
AgentSearchBench is a large-scale benchmark designed to evaluate the effectiveness of agent search methods in real-world scenarios, emphasizing the importance of execution-grounded signals over traditional description-based approaches.
Contribution
The paper introduces a comprehensive benchmark for agent search that incorporates execution signals, revealing limitations of existing methods and proposing improvements through behavioral signals.
Findings
Semantic similarity does not reliably predict agent performance.
Execution-aware probing significantly improves ranking quality.
Lightweight behavioral signals enhance agent discovery effectiveness.
Abstract
The rapid growth of AI agent ecosystems is transforming how complex tasks are delegated and executed, creating a new challenge of identifying suitable agents for a given task. Unlike traditional tools, agent capabilities are often compositional and execution-dependent, making them difficult to assess from textual descriptions alone. However, existing research and benchmarks typically assume well-specified functionalities, controlled candidate pools, or only executable task queries, leaving realistic agent search scenarios insufficiently studied. We introduce AgentSearchBench, a large-scale benchmark for agent search in the wild, built from nearly 10,000 real-world agents across multiple providers. The benchmark formalizes agent search as retrieval and reranking problems under both executable task queries and high-level task descriptions, and evaluates relevance using execution-grounded…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
